Can Large Models Fool the Eye? A New Turing Test for Biological Animation

1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3Macao Polytechnic University, *Corresponding author
Teaser Image

We propose BioMotion Arena, the first biological motion-based visual preference evaluation framework for large models. We focus on ten typical human motions and introduce fine-grained control over gender, weight, mood, and direction. More than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants are collected.

Even young infants can easily interpret the biological motions through pointlight display without any knowledge foundation.

A Glance

Abstract

Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences.

In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants.

Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find over 90% evaluated models, including the cuttingedge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.

Statistics of the average lines of code for biological motion representation generated by different models

Teaser Image

Win-rate Comparison

Win-rate (Model A beats Model B) between a subset of models in BioMotion Arena.

Teaser Image

The rate of 'Both-are-bad' within the code-specific comparisons.

Teaser Image

For battles between LLMs, deepseek-R1-0528 consistently outperforms the others (except for OpenAI's o3-mini) by a large margin, winning 62% of its battles against the second-best model, OpenAI's o1. For battles between MLLMs, where reference images are provided during motion generation, Gemini 2.5 Pro establishes a commanding lead over the entire field of competitors. In general, LLMs exhibit slightly inferior generative performance compared to MLLMs, primarily due to their lack of visual grounding.

Elo scores of a subset of models in the BioMotion Arena

Teaser Image

We can observe a significant gap between open-source and proprietary models.

Presentation of Biological Motions

In BioMotion Arena, we focus on ten typical human actions including walking, running, waving a hand, jumping up, jumping forward, bowing, lying down, sitting down, turning around, and forward rolling. Moreover, since biological motion contains information about several different emotions, intentions, personality traits, and biological attributes of the agent, which can be retrieved by a person at a glance, we further incorporate finer-grained dimensions, including gender, weight, happiness, and direction. This poses stricter scenarios for large models to understand and generate. Here, we display some point-light animations generated by various LLMs and MLLMs.

GPT-5 (New!!!)

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

OpenAI's o3

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

OpenAI's o4-mini

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

GPT-4o

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

Gemini 2.5 Pro

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

Gemini 2.5 Flash

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

Grok-4

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

Claude-4-Opus

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

DeepSeek-R1-20250120

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking

Qwen3-32B

Walking

Jumping up

Happy-light-woman-walking

Sad-heavy-man-walking