LMM-JND

Just Noticeable Difference for Large Multimodal Models

Zijian Chen1,2, Yuan Tian1,2, Yuze Sun1, Wei Sun3, Zicheng Zhang1,2, Weisi Lin4, Guangtao Zhai1,2, Wenjun Zhang1

1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3East China Normal University, 4Nanyang Technological University

Arxiv 2025

Overview: Recent years have witnessed unprecedented developments in large multimodal models (LMMs). Beyond the advancements, few studies have investigated the perception limits of LMMs, which not only include the breadth of their capabilities but also their granularity. As the ultimate receiver and appreciator of the increasingly larger number of visual contents change from the human visual system (HVS) to LMMs, the concerns regarding their perceptual capabilities and safety are gaining prominence. Many LMMs struggle to detect and respond differently to changes that humans can easily perceive. Thus, a question naturally raises: what is the minimal magnitude of changes that LMMs can perceive? In this paper, we propose the LMM-JND concept, i.e. Just Noticeable Difference for LMMs, to explore the perceptual redundancy characteristic for LMMs. In addition, we introduce a large-scale LMM-oriented just noticeable difference image dataset (VPA-JND) containing over 489k stimuli with more than 21k reference images to quantitatively evaluate the perceptual limits of LMMs and facilitate future LMM-JND studies.

LMM-JND Concept              

(1) Quantify the perceptual redundancy characteristic for LMMs.

(2) Act as a benefit zone that allows the LMM to ignore insignificant or irrelevant variations in the input, benefitting the processing efficiency.

(3) Explain security concerns, e.g., data tampering and attacking, and indicate the robustness of the LMM in certain scenarios.

📊 VPA-JND Dataset

(1) Large-scale: Contain 21,598 reference images and 489,065 stimuli.

(2) Human visual system-aligned: Draw inspiration from the signal processing mechanisms of the human visual cortex (v1-v4), which strategically aligns with the human-like development objectives of contemporary LMMs.

(3) Comprehensiveness: Involve 7 common low-level distortions and 2 content injection patterns, as well as 2 spatial FoV changes.

🎯 Comprehensive Evaluation     

(1) 1st JND: Gemini 2.0 Flash has the finest level of perception granularity on most low-level and content-injection dists. while approaching human level in spatial awareness stimuli.

(2) LMM-JND curves: Show distinct perceptual redundancy intervals in GPT-4o and Qwen series. InternVL2.5 displays more frequent fluctuations in contrast.

(3) Vision/Language Backbone: LMMs with stronger detail perception capabilities generally have a lower Param.L / Param.V.

(4) Beyond Visual Signal: LMM-JND is also exist in textual inputs when quantitatively applying adversarial textual attacks.

VPA-JND Dataset
VPA-JND Dataset
VPA-JND is a large-scale, HVS-aligned, and comprehensive dataset with 21,598 reference images and 489,065 stimuli covering three categories: (1) low-level distorted images, (2) content-injected images, and (3) 3D field of view (FoV) images. Each category is further divided into multiple fine-grained subsets to address diverse perception and safety scenarios.
The dataset features three distinct subsets:
    • Low-level Distortion: We consider 7 typical low-level distortions, i.e., blur, brightness, color saturation, contrast change, JPEG compression, and banding artifacts, resulting from the signal acquisition, transmission, or quantization that deal with efficient visual coding.
    • Content-injection: Grounded in visual acuity tests and robustness benchmarks as well as the consideration of the input security for LMMs, we outline two different content perturbations, i.e., benign and malicious, that require qualitative and quantitative perception of semantic content, location, and visibility.
    • 3D FoV: We build two virtual 3D environments using Ansys Speos to achieve precise and controllable camera FoV adjustment. We focus on the rotation and revolution of the camera and generate 10k stimuli with varying viewing distances and angles, including panning, zooming in, horizontal flipping, and pitch transformations.