BioMotion Arena

Abstract

Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences.

In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants.

Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find over 90% evaluated models, including the cuttingedge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.

Statistics of the average lines of code for biological motion representation generated by different models

Win-rate Comparison

Win-rate (Model A beats Model B) between a subset of models in BioMotion Arena.

The rate of 'Both-are-bad' within the code-specific comparisons.

For battles between LLMs, deepseek-R1-0528 consistently outperforms the others (except for OpenAI's o3-mini) by a large margin, winning 62% of its battles against the second-best model, OpenAI's o1. For battles between MLLMs, where reference images are provided during motion generation, Gemini 2.5 Pro establishes a commanding lead over the entire field of competitors. In general, LLMs exhibit slightly inferior generative performance compared to MLLMs, primarily due to their lack of visual grounding.

Elo scores of a subset of models in the BioMotion Arena

We can observe a significant gap between open-source and proprietary models.

Presentation of Biological Motions

In BioMotion Arena, we focus on ten typical human actions including walking, running, waving a hand, jumping up, jumping forward, bowing, lying down, sitting down, turning around, and forward rolling. Moreover, since biological motion contains information about several different emotions, intentions, personality traits, and biological attributes of the agent, which can be retrieved by a person at a glance, we further incorporate finer-grained dimensions, including gender, weight, happiness, and direction. This poses stricter scenarios for large models to understand and generate. Here, we display some point-light animations generated by various LLMs and MLLMs.

Gemini 3 Pro (New!!!)

Walking