LMM-JND

Just Noticeable Difference for Large Multimodal Models

Zijian Chen^1,2, Yuan Tian^1,2, Yuze Sun¹, Wei Sun³, Zicheng Zhang^1,2, Weisi Lin⁴, Guangtao Zhai^1,2, Wenjun Zhang¹

¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory, ³East China Normal University, ⁴Nanyang Technological University

Arxiv 2025

Overview: Recent years have witnessed unprecedented developments in large multimodal models (LMMs). Beyond the advancements, few studies have investigated the perception limits of LMMs, which not only include the breadth of their capabilities but also their granularity. As the ultimate receiver and appreciator of the increasingly larger number of visual contents change from the human visual system (HVS) to LMMs, the concerns regarding their perceptual capabilities and safety are gaining prominence. Many LMMs struggle to detect and respond differently to changes that humans can easily perceive. Thus, a question naturally raises: what is the minimal magnitude of changes that LMMs can perceive? In this paper, we propose the LMM-JND concept, i.e. Just Noticeable Difference for LMMs, to explore the perceptual redundancy characteristic for LMMs. In addition, we introduce a large-scale LMM-oriented just noticeable difference image dataset (VPA-JND) containing over 489k stimuli with more than 21k reference images to quantitatively evaluate the perceptual limits of LMMs and facilitate future LMM-JND studies.

LMM-JND Concept

(1) Quantify the perceptual redundancy characteristic for LMMs.

(2) Act as a benefit zone that allows the LMM to ignore insignificant or irrelevant variations in the input, benefitting the processing efficiency.

(3) Explain security concerns, e.g., data tampering and attacking, and indicate the robustness of the LMM in certain scenarios.

📊 VPA-JND Dataset

(1) Large-scale: Contain 21,598 reference images and 489,065 stimuli.

(2) Human visual system-aligned: Draw inspiration from the signal processing mechanisms of the human visual cortex (v1-v4), which strategically aligns with the human-like development objectives of contemporary LMMs.

(3) Comprehensiveness: Involve 7 common low-level distortions and 2 content injection patterns, as well as 2 spatial FoV changes.

🎯 Comprehensive Evaluation

(1) 1st JND: Gemini 2.0 Flash has the finest level of perception granularity on most low-level and content-injection dists. while approaching human level in spatial awareness stimuli.

(2) LMM-JND curves: Show distinct perceptual redundancy intervals in GPT-4o and Qwen series. InternVL2.5 displays more frequent fluctuations in contrast.

(3) Vision/Language Backbone: LMMs with stronger detail perception capabilities generally have a lower Param.L / Param.V.

(4) Beyond Visual Signal: LMM-JND is also exist in textual inputs when quantitatively applying adversarial textual attacks.

VPA-JND Dataset

VPA-JND is a large-scale, HVS-aligned, and comprehensive dataset with 21,598 reference images and 489,065 stimuli covering three categories: (1) low-level distorted images, (2) content-injected images, and (3) 3D field of view (FoV) images. Each category is further divided into multiple fine-grained subsets to address diverse perception and safety scenarios.
The dataset features three distinct subsets:
    • Low-level Distortion: We consider 7 typical low-level distortions, i.e., blur, brightness, color saturation, contrast change, JPEG compression, and banding artifacts, resulting from the signal acquisition, transmission, or quantization that deal with efficient visual coding.
    • Content-injection: Grounded in visual acuity tests and robustness benchmarks as well as the consideration of the input security for LMMs, we outline two different content perturbations, i.e., benign and malicious, that require qualitative and quantitative perception of semantic content, location, and visibility.
    • 3D FoV: We build two virtual 3D environments using Ansys Speos to achieve precise and controllable camera FoV adjustment. We focus on the rotation and revolution of the camera and generate 10k stimuli with varying viewing distances and angles, including panning, zooming in, horizontal flipping, and pitch transformations.

Main Results

Evaluation Models

We select 5 representative LMMs series including Qwen2.5-VL, DeepSeek-VL2, InternVL2.5, Qwen2-VL, and LLaVA-OneVision, as well as the currently smallest vision-language model, SmolVLM-256M, for evaluation. These LMMs not only vary in their vision architectures, such as QwenViT vs. SigLIP with different number of parameters (675M vs. 400M) and input sizes (224x224 vs. 384x384), but also in their language backbones (Qwen2.5 vs. DeepSeekMoE) and scales (72B vs. 16B). In addition to the 16 open-source models listed in the above Table, we also include 4 mainstream proprietary LMMs, i.e., GPT-4o (2024-11-20), Gemini (2.0 Flash and 1.5 Pro), and Qwen-VL-Max (2024-11-19) for evaluation. All models natively support multi-image inputs to ensure the best performance.

The Minimal Perceptible Distortion Level

The minimal perceptible distortion level (1st JND) of various LMMs on different visual stimuli. The general maximum value for each cell is 50, except for the JPEG and banding columns, where the maximum value is 100. ‘WM’ is the abbreviation for the embedded watermark. The best and second-best results are highlighted in bold and underlined, respectively.

n-th JND Curves

LMM-JND curves of several randomly selected stimuli for six representative LMMs. The reference images and their distorted versions are displayed under the curve, where the leftmost one is the original reference image, and the rest are distorted variants with change level = 12, 36, and 48. The ‘0’ and ‘1’ on the vertical axis represent “no response change” and “response changed”, respectively.

1st JND vs. Parameter Ratio (L / V)

1st JND versus the parameter ratio of the language backbone (Param.L) to the visual backbone (Param.V). The red dotted curves show the fitted optimization direction of the perception granularity for large models (72B/78B).

Beyond Visual Signals and with Other Modalities

Illustration of different textual attacks and the response change ratio w.r.t. the perturbation level. The image placeholders are excluded in counting the characters and words.

LMM-JND

Just Noticeable Difference for Large Multimodal Models

BibTeX