The increasing sophistication of multimodal models necessitates benchmarks that can rigorously evaluate their understanding and reasoning in complex, safety-pertinent, open-world scenarios. This study introduces M4R (Measuring Massive Multimodal Understanding and Reasoning), a large-scale benchmark uniquely designed to assess reasoning capabilities across diverse open spaces, comprehensively covering land, air, and water environments. M4R comprises approximately 2,000 videos and over 19,000 human-annotated question-answer pairs. These videos, varying in length (short, medium, long) and presenting tasks of tiered difficulty (interval-based choices and accuracy-based choices), encompass distinct operational domains: the land-based scenarios primarily focus on traffic environments, particularly traffic collisions and accident cases; the air-based scenarios center on airplane navigation; and the water-based scenarios involve ship movements. M4R systematically evaluates models on temporal-causal reasoning, spatial understanding, and intent and goal planning within these dynamic contexts. By providing a unified platform across this broad spectrum of domains, M4R aims to drive the development of more robust and generalizable AI systems. Benchmarking state-of-the-art multimodal models on our dataset reveals that even leading models, such as ChatGPT-4o and Gemini-2.5 Pro, achieve only around 30\% average accuracy on the hard-level tasks, highlighting the significant challenges that remain in open-space multimodal reasoning.
Difficulty | Models | Size | Over. Avg. | Temporal | Spatial | Intent |
---|---|---|---|---|---|---|
Hard | GPT 4o | - | 22.21 | 24.92 | 27.14 | 13.80 |
Gemini 2.5 Pro 🥇 | - | 31.01 | 38.18 | 30.08 | 25.20 | |
Gemini 2.5 flash think | - | 28.52 | 31.74 | 30.33 | 26.26 | |
Gemini 2.5 flash no think | - | 24.33 | 24.80 | 30.41 | 20.53 | |
Gemini 1.5 Pro | - | 19.07 | 22.53 | 21.57 | 17.25 | |
Claude 3.5 | - | 28.89 | 32.84 | 29.18 | 23.41 | |
InternVL2.5 | 26B | 22.45 | 25.33 | 27.42 | 12.64 | |
InternVL2.5 | 8B | 20.39 | 21.30 | 29.41 | 11.42 | |
InternVL2.5 | 4B | 17.31 | 17.39 | 23.04 | 13.13 | |
LLaVA Next | 32B | 17.83 | 11.28 | 26.09 | 10.10 | |
LLaVA Video | 7B | 17.35 | 13.02 | 27.49 | 10.18 | |
LLaVA OneVision | 7B | 14.27 | 9.55 | 24.74 | 10.15 | |
Qwen2.5 VL | 32B | 19.39 | 13.19 | 27.85 | 14.05 | |
Qwen2.5 VL | 7B | 20.34 | 12.31 | 28.40 | 15.48 | |
Medium | GPT 4o | - | 41.21 | 44.89 | 47.03 | 28.19 |
Gemini 2.5 Pro | - | 41.07 | 41.31 | 48.33 | 33.06 | |
Gemini 2.5 flash think 🥇 | - | 41.45 | 46.83 | 45.89 | 35.51 | |
Gemini 2.5 flash no think | - | 40.36 | 41.97 | 42.93 | 33.61 | |
Gemini 1.5 Pro | - | 37.13 | 40.69 | 43.81 | 31.06 | |
Claude 3.5 | - | 37.99 | 36.46 | 47.34 | 31.09 | |
InternVL2.5 | 26B | 36.39 | 37.85 | 47.51 | 27.55 | |
InternVL2.5 | 8B | 35.44 | 39.85 | 51.07 | 18.98 | |
InternVL2.5 | 4B | 36.53 | 31.21 | 45.36 | 32.68 | |
LLaVA Next | 32B | 21.07 | 13.57 | 33.08 | 14.24 | |
LLaVA Video | 7B | 24.04 | 19.33 | 30.50 | 19.72 | |
LLaVA OneVision | 7B | 17.76 | 17.81 | 24.71 | 17.12 | |
Qwen2.5 VL | 32B | 29.93 | 23.34 | 41.94 | 25.82 | |
Qwen2.5 VL | 7B | 28.79 | 22.18 | 34.64 | 22.89 | |
Easy | GPT 4o | - | 45.01 | 55.33 | 38.08 | 43.72 |
Gemini 2.5 Pro 🥇 | - | 59.36 | 61.16 | 54.51 | 58.09 | |
Gemini 2.5 flash think | - | 53.14 | 58.70 | 55.86 | 48.47 | |
Gemini 2.5 flash no think | - | 50.52 | 52.16 | 51.30 | 44.41 | |
Gemini 1.5 Pro | - | 48.05 | 53.22 | 47.85 | 45.37 | |
Claude 3.5 | - | 50.14 | 53.28 | 48.51 | 46.40 | |
InternVL2.5 | 26B | 55.08 | 58.41 | 53.46 | 44.45 | |
InternVL2.5 | 8B | 51.03 | 53.64 | 54.52 | 42.20 | |
InternVL2.5 | 4B | 48.93 | 46.55 | 52.31 | 43.65 | |
LLaVA Next | 32B | 35.32 | 31.22 | 40.09 | 34.34 | |
LLaVA Video | 7B | 30.44 | 29.41 | 34.12 | 31.64 | |
LLaVA OneVision | 7B | 31.10 | 29.46 | 33.78 | 29.88 | |
Qwen2.5 VL | 32B | 48.35 | 50.68 | 47.82 | 44.97 | |
Qwen2.5 VL | 7B | 37.97 | 38.87 | 33.20 | 36.45 |
Difficulty | Models | Size | Over. Avg. | Short Video Scenarios | Medium Video Scenarios | Long Video Scenarios | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg. | Temporal | Spatial | Intent | Avg. | Temporal | Spatial | Intent | Avg. | Temporal | Spatial | Intent | ||||
Hard | GPT 4o | - | 24.41 | 26.78 | 34.65 | 34.69 | 11 | 35.70 | 43.14 | 32.14 | 31.82 | 11.00 | 6 | 26 | 1 |
Gemini 2.5 Pro 🥇 | - | 29.76 | 34.84 | 36.63 | 44.90 | 23.0 | 35.76 | 45.10 | 30.36 | 31.82 | 18.67 | 10.0 | 28.0 | 18.0 | |
Gemini 2.5 flash think | - | 28.67 | 32.13 | 35.64 | 37.75 | 23.00 | 35.20 | 37.25 | 41.07 | 27.27 | 18.67 | 6.00 | 36.00 | 14.00 | |
Gemini 2.5 flash no-think | - | 24.34 | 24.74 | 30.69 | 26.53 | 17.00 | 30.94 | 52.94 | 23.21 | 16.67 | 17.33 | 14.00 | 24.00 | 14.00 | |
Gemini 1.5 Pro | - | 18.76 | 19.72 | 23.76 | 20.41 | 15 | 24.55 | 33.33 | 16.07 | 24.24 | 12.00 | 2 | 26 | 8 | |
Claude 3.5 | - | 28.71 | 33.76 | 35.64 | 31.63 | 34.0 | 28.87 | 37.26 | 35.71 | 13.63 | 16.0 | 12.0 | 26.0 | 10.0 | |
InternVL2.5 | 26B | 23.78 | 21.33 | 26.0 | 31.0 | 7.0 | 32.00 | 46.0 | 32.0 | 18.0 | 18.00 | 16.0 | 24.0 | 14.0 | |
InternVL2.5 | 8B | 22.67 | 20.00 | 18.0 | 33.0 | 9.0 | 30.00 | 46.0 | 30.0 | 14.0 | 18.00 | 16.0 | 28.0 | 10.0 | |
InternVL2.5 | 4B | 19.56 | 18.67 | 18.0 | 28.0 | 8.0 | 28.00 | 34.0 | 24.0 | 26.0 | 12.00 | 8.0 | 22.0 | 6.0 | |
LLaVA Next | 32B | 16.22 | 20.67 | 16.0 | 32.0 | 14.0 | 11.33 | 12.0 | 12.0 | 10.0 | 16.67 | 10.0 | 30.0 | 10.0 | |
LLaVA Video | 7B | 19.78 | 19.33 | 12.0 | 35.0 | 11.0 | 24.67 | 26.0 | 30.0 | 18.0 | 15.33 | 10.0 | 28.0 | 8.0 | |
LLaVA OneVision | 7B | 13.67 | 14.33 | 5.0 | 27.0 | 11.0 | 14.67 | 18.0 | 8.0 | 18.0 | 12.0 | 6.0 | 22.0 | 8.0 | |
Qwen2.5 VL | 32B | 22.66 | 19.33 | 11.0 | 34.0 | 13.0 | 35.33 | 46.0 | 24.0 | 36.0 | 13.33 | 4.0 | 26.0 | 10.0 | |
Qwen2.5 VL | 7B | 22.89 | 26.00 | 17.0 | 30.0 | 31.0 | 30.00 | 40.0 | 32.0 | 18.0 | 12.67 | 2.0 | 30.0 | 6.0 | |
Medium | GPT 4o | - | 36.99 | 45.49 | 48.48 | 55 | 33 | 33.89 | 41.67 | 26.67 | 33.33 | 31.33 | 24 | 44 | 26 |
Gemini 2.5 Pro | - | 36.46 | 42.79 | 38.38 | 59.0 | 31.0 | 33.93 | 39.58 | 28.89 | 33.33 | 32.67 | 28.0 | 44.0 | 26.0 | |
Gemini 2.5 flash think 🥇 | - | 37.52 | 47.82 | 46.47 | 56.00 | 41.00 | 36.99 | 43.75 | 42.22 | 25.00 | 28.00 | 12.00 | 44.00 | 28.00 | |
Gemini 2.5 flash no-think | - | 36.70 | 47.50 | 48.49 | 58.00 | 36.00 | 33.93 | 39.58 | 28.89 | 33.33 | 28.67 | 24.00 | 42.00 | 20.00 | |
Gemini 1.5 Pro | - | 33.89 | 39.47 | 42.42 | 42 | 34 | 33.52 | 33.33 | 42.22 | 25 | 28.67 | 12 | 52 | 22 | |
Claude 3.5 | - | 35.35 | 41.78 | 35.35 | 50.0 | 40.0 | 35.60 | 39.58 | 42.22 | 25.0 | 28.67 | 16.0 | 44.0 | 26.0 | |
InternVL2.5 | 26B | 35.11 | 36.00 | 39.0 | 50.0 | 19.0 | 36.67 | 50.0 | 36.0 | 24.0 | 32.67 | 30.0 | 40.0 | 28.0 | |
InternVL2.5 | 8B | 34.66 | 37.33 | 43.0 | 57.0 | 12.0 | 35.33 | 42.0 | 46.0 | 18.0 | 31.33 | 26.0 | 44.0 | 24.0 | |
InternVL2.5 | 4B | 33.89 | 39.67 | 38.0 | 53.0 | 28.0 | 32.67 | 44.0 | 28.0 | 26.0 | 29.33 | 16.0 | 46.0 | 26.0 | |
LLaVA Next | 32B | 20.0 | 27.33 | 16.0 | 49.0 | 17.0 | 10.67 | 14.0 | 10.0 | 8.0 | 22.0 | 16.0 | 36.0 | 14.0 | |
LLaVA Video | 7B | 25.67 | 25.00 | 20.0 | 34.0 | 26.0 | 28.67 | 36.0 | 28.0 | 22.0 | 23.33 | 14.0 | 40.0 | 16.0 | |
LLaVA OneVision | 7B | 16.67 | 16.00 | 26.0 | 30.0 | 16.0 | 14.67 | 18.0 | 8.0 | 18.0 | 19.33 | 12.0 | 30.0 | 16.0 | |
Qwen2.5 VL | 32B | 28.55 | 28.33 | 21.0 | 44.0 | 20.0 | 33.33 | 40.0 | 30.0 | 30.0 | 24.00 | 8.0 | 40.0 | 24.0 | |
Qwen2.5 VL | 7B | 29.89 | 39.00 | 37.0 | 42.0 | 38.0 | 30.67 | 32.0 | 40.0 | 20.0 | 20.00 | 16.0 | 26.0 | 18.0 | |
Easy | GPT 4o | - | 42.17 | 52.35 | 59 | 47.06 | 51 | 47.16 | 54.9 | 44.9 | 41.67 | 27.00 | 44 | 5 | 32 |
Gemini 2.5 Pro 🥇 | - | 54.56 | 62.96 | 70.0 | 55.88 | 63.0 | 54.73 | 52.94 | 59.18 | 52.08 | 46.00 | 40.0 | 54.0 | 44.0 | |
Gemini 2.5 flash think | - | 50.00 | 67.56 | 69.00 | 65.69 | 68.00 | 44.45 | 52.94 | 40.82 | 39.58 | 38.00 | 32.00 | 38.00 | 44.00 | |
Gemini 2.5 flash no-think | - | 51.40 | 58.97 | 70.00 | 54.90 | 52.00 | 46.56 | 52.94 | 36.74 | 50.00 | 48.67 | 38.00 | 56.00 | 52.00 | |
Gemini 1.5 Pro | - | 46.00 | 51.33 | 60 | 50 | 44 | 36.92 | 49.02 | 36.73 | 25 | 50.00 | 58 | 44 | 48 | |
Claude 3.5 | - | 48.59 | 60.33 | 61.0 | 50.0 | 70.0 | 36.35 | 35.29 | 51.02 | 22.73 | 49.33 | 64.0 | 44.0 | 40.0 | |
InternVL2.5 | 26B | 52.55 | 61.00 | 62.0 | 59.0 | 62.0 | 45.33 | 58.0 | 44.0 | 34.0 | 51.33 | 62.0 | 62.0 | 30.0 | |
InternVL2.5 | 8B | 50.11 | 55.67 | 55.0 | 60.0 | 52.0 | 44.67 | 58.0 | 42.0 | 34.0 | 50.00 | 54.0 | 64.0 | 32.0 | |
InternVL2.5 | 4B | 44.89 | 53.33 | 46.0 | 60.0 | 54.0 | 37.33 | 48.0 | 38.0 | 26.0 | 44.00 | 44.0 | 48.0 | 40.0 | |
LLaVA Next | 32B | 31.25 | 38.00 | 35.0 | 45.0 | 34.0 | 21.33 | 12.0 | 14.0 | 38.0 | 34.67 | 20.0 | 50.0 | 34.0 | |
LLaVA Video | 7B | 31.44 | 33.00 | 30.0 | 31.0 | 38.0 | 33.33 | 38.0 | 36.0 | 26.0 | 28.00 | 16.0 | 32.0 | 36.0 | |
LLaVA OneVision | 7B | 29.78 | 32.00 | 31.0 | 33.0 | 32.0 | 24.00 | 26.0 | 30.0 | 16.0 | 33.33 | 28.0 | 36.0 | 36.0 | |
Qwen2.5 VL | 32B | 43.22 | 51.00 | 58.0 | 50.0 | 45.0 | 41.33 | 46.0 | 38.0 | 40.0 | 37.33 | 32.0 | 44.0 | 36.0 | |
Qwen2.5 VL | 7B | 40.67 | 51.33 | 55.0 | 42.0 | 57.0 | 36.00 | 32.0 | 42.0 | 34.0 | 34.67 | 34.0 | 28.0 | 42.0 |
Difficulty | Models | Size | Over. Avg. | Short Video Scenarios | Medium Video Scenarios | Long Video Scenarios | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg. | Temporal | Spatial | Intent | Avg. | Temporal | Spatial | Intent | Avg. | Temporal | Spatial | Intent | ||||
Hard | GPT 4o | - | 18.11 | 21.33 | 16.00 | 26.00 | 22.00 | 14.67 | 12.00 | 30.00 | 2.00 | 18.33 | 5.00 | 35.00 | 15.00 |
Gemini 2.5 Pro 🥇 | - | 31.39 | 32.83 | 36.0 | 24.49 | 38.0 | 24.67 | 32.0 | 22.0 | 20.0 | 36.67 | 30.0 | 15.0 | 65.0 | |
Gemini 2.5 flash think | - | 25.78 | 26.00 | 26.00 | 18.00 | 34.00 | 21.33 | 28.00 | 18.00 | 18.00 | 30.00 | 30.00 | 10.00 | 50.00 | |
Gemini 2.5 flash no-think | - | 25.44 | 25.33 | 22.00 | 28.00 | 26.00 | 26.00 | 26.00 | 28.00 | 24.00 | 25.00 | 0.00 | 40.00 | 35.00 | |
Gemini 1.5 Pro | - | 22.34 | 26.67 | 24.00 | 26.00 | 30.00 | 18.67 | 20.00 | 22.00 | 14.00 | 21.67 | 10.00 | 25.00 | 30.00 | |
Claude 3.5 | - | 24.22 | 26.00 | 18.0 | 32.0 | 28.0 | 23.33 | 20.0 | 28.0 | 22.0 | 23.33 | 10.0 | 40.0 | 20.0 | |
InternVL2.5 | 26B | 17.33 | 19.33 | 24.00 | 26.00 | 10.00 | 19.33 | 16.00 | 32.00 | 10.00 | 13.33 | 10.00 | 10.00 | 20.00 | |
InternVL2.5 | 8B | 18.22 | 18.67 | 20.00 | 28.00 | 8.00 | 19.33 | 16.00 | 30.00 | 12.00 | 16.67 | 5.00 | 35.00 | 10.00 | |
InternVL2.5 | 4B | 15.33 | 15.33 | 14.00 | 10.00 | 22.00 | 14.00 | 16.00 | 18.00 | 8.00 | 16.67 | 15.00 | 30.00 | 5.00 | |
LLaVA Next | 32B | 17.89 | 18.67 | 14.0 | 34.0 | 8.0 | 16.67 | 6.0 | 32.0 | 12.0 | 18.33 | 5.0 | 40.0 | 10.0 | |
LLaVA Video | 7B | 14.78 | 16.67 | 14.00 | 28.00 | 8.00 | 12.67 | 6.00 | 22.00 | 10.00 | 15.00 | 5.00 | 30.00 | 10.00 | |
LLaVA OneVision | 7B | 15.67 | 16.00 | 12.00 | 28.00 | 8.00 | 16.00 | 12.00 | 26.00 | 10.00 | 15.00 | 10.00 | 25.00 | 10.00 | |
Qwen2.5 VL | 32B | 16.22 | 20.00 | 6.00 | 36.00 | 18.00 | 15.33 | 4.00 | 24.00 | 18.00 | 13.33 | 0.00 | 30.00 | 10.00 | |
Qwen2.5 VL | 7B | 16.55 | 19.33 | 0.00 | 30.00 | 28.00 | 15.33 | 2.00 | 30.00 | 14.00 | 15.00 | 5.00 | 30.00 | 10.00 | |
Medium | GPT 4o | - | 38.45 | 38.67 | 38.00 | 56.00 | 22.00 | 30.00 | 38.00 | 34.00 | 18.00 | 46.67 | 65.00 | 30.00 | 45.00 |
Gemini 2.5 Pro | - | 43.11 | 44.67 | 42.0 | 40.0 | 52.0 | 31.33 | 34.0 | 34.0 | 26.0 | 53.33 | 60.0 | 35.0 | 65.0 | |
Gemini 2.5 flash think | - | 39.78 | 39.33 | 32.00 | 38.00 | 48.00 | 30.00 | 34.00 | 28.00 | 28.00 | 50.00 | 65.00 | 15.00 | 70.00 | |
Gemini 2.5 flash no-think 🥇 | - | 49.67 | 43.33 | 30.00 | 48.00 | 52.00 | 40.67 | 38.00 | 50.00 | 34.00 | 65.00 | 60.00 | 65.00 | 70.00 | |
Gemini 1.5 Pro | - | 38.78 | 38.00 | 32.00 | 48.00 | 34.00 | 36.67 | 34.00 | 52.00 | 24.00 | 41.67 | 30.00 | 55.00 | 40.00 | |
Claude 3.5 | - | 39.67 | 38.00 | 26.0 | 40.0 | 48.0 | 36.00 | 32.0 | 54.0 | 22.0 | 45.00 | 50.0 | 35.0 | 50.0 | |
InternVL2.5 | 26B | 28.67 | 31.33 | 28.00 | 58.00 | 8.00 | 24.67 | 12.00 | 50.00 | 12.00 | 30.00 | 25.00 | 45.00 | 20.00 | |
InternVL2.5 | 8B | 34.33 | 30.00 | 20.00 | 58.00 | 12.00 | 34.67 | 32.00 | 50.00 | 22.00 | 38.33 | 40.00 | 45.00 | 30.00 | |
InternVL2.5 | 4B | 32.22 | 29.33 | 28.00 | 44.00 | 16.00 | 34.00 | 30.00 | 54.00 | 18.00 | 33.33 | 35.00 | 40.00 | 25.00 | |
LLaVA Next | 32B | 26.11 | 24.67 | 18.0 | 40.0 | 16.0 | 25.33 | 18.0 | 40.0 | 18.0 | 28.33 | 25.0 | 40.0 | 20.0 | |
LLaVA Video | 7B | 24.00 | 25.33 | 24.00 | 36.00 | 16.00 | 20.00 | 16.00 | 26.00 | 18.00 | 26.67 | 15.00 | 45.00 | 20.00 | |
LLaVA OneVision | 7B | 23.67 | 23.33 | 20.00 | 34.00 | 16.00 | 22.67 | 20.00 | 32.00 | 16.00 | 25.00 | 20.00 | 35.00 | 20.00 | |
Qwen2.5 VL | 32B | 33.34 | 32.67 | 12.00 | 48.00 | 38.00 | 30.67 | 22.00 | 50.00 | 20.00 | 36.67 | 20.00 | 60.00 | 30.00 | |
Qwen2.5 VL | 7B | 28.00 | 24.67 | 16.00 | 24.00 | 34.00 | 26.00 | 24.00 | 26.00 | 28.00 | 33.33 | 35.00 | 20.00 | 45.00 | |
Easy | GPT 4o | - | 40.67 | 35.33 | 30.00 | 28.00 | 48.00 | 36.67 | 24.00 | 38.00 | 48.00 | 50.00 | 45.00 | 50.00 | 55.00 |
Gemini 2.5 Pro 🥇 | - | 52.56 | 56.00 | 60.0 | 48.0 | 60.0 | 40.00 | 40.0 | 36.0 | 44.0 | 61.67 | 75.0 | 35.0 | 75.0 | |
Gemini 2.5 flash think | - | 50.67 | 49.33 | 40.00 | 46.00 | 62.00 | 46.00 | 46.00 | 44.00 | 48.00 | 56.67 | 55.00 | 40.00 | 75.00 | |
Gemini 2.5 flash no-think | - | 50.78 | 49.33 | 36.00 | 52.00 | 60.00 | 48.00 | 40.00 | 50.00 | 54.00 | 55.00 | 60.00 | 50.00 | 55.00 | |
Gemini 1.5 Pro | - | 43.00 | 45.33 | 36.00 | 44.00 | 56.00 | 42.00 | 48.00 | 32.00 | 46.00 | 41.67 | 35.00 | 50.00 | 40.00 | |
Claude 3.5 | - | 42.45 | 38.00 | 34.0 | 38.0 | 42.0 | 42.67 | 30.0 | 56.0 | 42.0 | 46.67 | 40.0 | 45.0 | 55.0 | |
InternVL2.5 | 26B | 36.11 | 35.33 | 36.00 | 44.00 | 26.00 | 34.67 | 28.00 | 46.00 | 30.00 | 38.33 | 30.00 | 40.00 | 45.00 | |
InternVL2.5 | 8B | 38.44 | 36.67 | 28.00 | 46.00 | 36.00 | 35.33 | 32.00 | 42.00 | 32.00 | 43.33 | 60.00 | 40.00 | 30.00 | |
InternVL2.5 | 4B | 40.33 | 43.33 | 42.00 | 50.00 | 38.00 | 39.33 | 30.00 | 44.00 | 44.00 | 38.33 | 35.00 | 60.00 | 20.00 | |
LLaVA Next | 32B | 33.22 | 36.67 | 36.00 | 42.0 | 32.0 | 31.33 | 36.0 | 32.0 | 26.0 | 31.67 | 35.0 | 30.0 | 30.0 | |
LLaVA Video | 7B | 33.22 | 33.33 | 34.00 | 38.00 | 28.00 | 34.67 | 34.00 | 38.00 | 32.00 | 31.67 | 35.00 | 30.00 | 30.00 | |
LLaVA OneVision | 7B | 33.22 | 33.33 | 34.00 | 38.00 | 28.00 | 34.67 | 34.00 | 38.00 | 32.00 | 31.67 | 35.00 | 30.00 | 30.00 | |
Qwen2.5 VL | 32B | 52.45 | 50.00 | 34.00 | 56.00 | 60.00 | 50.67 | 40.00 | 54.00 | 58.00 | 56.67 | 55.00 | 60.00 | 55.00 | |
Qwen2.5 VL | 7B | 39.89 | 33.33 | 28.00 | 18.00 | 54.00 | 38.00 | 48.00 | 16.00 | 50.00 | 48.33 | 55.00 | 30.00 | 60.00 |
Difficulty | Models | Size | Over. Avg. | River Scenarios | Ocean Scenarios | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Avg. | Temporal | Spatial | Intent | Avg. | Temporal | Spatial | Intent | ||||
Hard | GPT 4o | - | 22.10 | 28.20 | 38.46 | 26.92 | 19.23 | 16.00 | 18.00 | 18.00 | 12.00 |
Gemini 2.5 Pro 🥇 | - | 29.64 | 34.62 | 23.08 | 34.62 | 46.15 | 24.67 | 38.0 | 16.0 | 20.0 | |
Gemini 2.5 flash think | - | 27.36 | 32.05 | 30.77 | 26.92 | 38.46 | 22.67 | 30.00 | 22.00 | 16.00 | |
Gemini 2.5 flash no-think | - | 27.44 | 28.21 | 42.31 | 19.23 | 23.08 | 26.67 | 36.00 | 20.00 | 24.00 | |
Gemini 1.5 Pro | - | 26.02 | 26.92 | 23.08 | 30.77 | 26.92 | 25.11 | 34.00 | 20.93 | 20.41 | |
Claude 3.5 | - | 25.44 | 28.20 | 19.23 | 19.23 | 46.15 | 22.67 | 26.0 | 22.0 | 20.0 | |
InternVL2.5 | 26B | 22.54 | 23.08 | 15.38 | 19.23 | 34.62 | 22.00 | 18.00 | 28.00 | 20.00 | |
InternVL2.5 | 8B | 21.90 | 21.79 | 7.69 | 26.92 | 30.77 | 22.00 | 16.00 | 28.00 | 22.00 | |
InternVL2.5 | 4B | 20.92 | 20.51 | 19.23 | 19.23 | 23.08 | 21.33 | 16.00 | 26.00 | 22.00 | |
LLaVA Next | 32B | 14.39 | 11.54 | 7.69 | 19.23 | 7.69 | 15.33 | 8.0 | 30.0 | 8.0 | |
LLaVA Video | 7B | 14.00 | 16.67 | 15.38 | 23.08 | 11.54 | 11.33 | 8.00 | 20.00 | 6.00 | |
LLaVA OneVision | 7B | 15.67 | 16.67 | 11.54 | 26.92 | 11.54 | 14.67 | 8.00 | 28.00 | 8.00 | |
Qwen2.5 VL | 32B | 13.39 | 14.10 | 7.69 | 23.08 | 11.54 | 12.67 | 8.0 | 24.0 | 6.0 | |
Qwen2.5 VL | 7B | 14.67 | 16.67 | 7.69 | 30.77 | 11.54 | 12.67 | 6.00 | 24.00 | 8.00 | |
Medium | GPT 4o | - | 38.49 | 42.31 | 50.00 | 53.85 | 23.08 | 34.67 | 36.00 | 48.00 | 20.00 |
Gemini 2.5 Pro | - | 41.77 | 44.87 | 30.77 | 61.54 | 42.31 | 38.67 | 48.0 | 46.0 | 22.0 | |
Gemini 2.5 flash think 🥇 | - | 48.26 | 53.85 | 61.54 | 57.70 | 42.31 | 42.67 | 52.00 | 42.00 | 34.00 | |
Gemini 2.5 flash no-think | - | 46.12 | 50.00 | 46.15 | 57.69 | 46.15 | 42.00 | 56.00 | 44.00 | 26.00 | |
Gemini 1.5 Pro | - | 46.31 | 53.84 | 46.15 | 65.38 | 50.00 | 38.78 | 34.00 | 49.02 | 33.33 | |
Claude 3.5 | - | 38.62 | 35.90 | 34.62 | 50.0 | 23.08 | 41.33 | 42.0 | 54.0 | 28.0 | |
InternVL2.5 | 26B | 41.77 | 44.87 | 30.77 | 57.69 | 46.15 | 38.67 | 24.00 | 62.00 | 30.00 | |
InternVL2.5 | 8B | 41.08 | 46.15 | 34.62 | 61.54 | 42.31 | 36.00 | 34.00 | 60.00 | 14.00 | |
InternVL2.5 | 4B | 44.36 | 48.72 | 23.08 | 65.38 | 57.69 | 40.00 | 28.00 | 60.00 | 32.00 | |
LLaVA Next | 32B | 20.88 | 23.08 | 11.54 | 38.46 | 19.23 | 18.67 | 10.00 | 30.00 | 16.00 | |
LLaVA Video | 7B | 21.92 | 20.51 | 19.23 | 26.92 | 15.38 | 23.33 | 20.00 | 30.00 | 20.00 | |
LLaVA OneVision | 7B | 22.54 | 23.08 | 19.23 | 30.77 | 19.23 | 22.00 | 14.00 | 34.00 | 18.00 | |
Qwen2.5 VL | 32B | 33.31 | 34.62 | 19.23 | 50.00 | 34.62 | 32.00 | 20.00 | 50.00 | 26.00 | |
Qwen2.5 VL | 7B | 24.08 | 29.49 | 19.23 | 30.77 | 38.46 | 18.67 | 18.00 | 26.00 | 12.00 | |
Easy | GPT 4o | - | 50.51 | 57.69 | 57.69 | 50.00 | 65.38 | 43.33 | 66.00 | 34.00 | 30.00 |
Gemini 2.5 Pro | - | 61.05 | 64.10 | 57.69 | 57.69 | 76.92 | 58.00 | 72.0 | 50.0 | 52.0 | |
Gemini 2.5 flash think 🥇 | - | 62.03 | 65.39 | 80.77 | 42.31 | 73.08 | 58.67 | 70.00 | 52.00 | 54.00 | |
Gemini 2.5 flash no-think | - | 58.18 | 57.69 | 57.69 | 38.46 | 76.92 | 58.67 | 80.00 | 42.00 | 54.00 | |
Gemini 1.5 Pro | - | 50.69 | 52.56 | 42.31 | 61.54 | 53.85 | 48.81 | 50.00 | 46.43 | 50.00 | |
Claude 3.5 | - | 49.39 | 47.44 | 50.0 | 53.85 | 38.46 | 51.33 | 62.0 | 52.0 | 40.0 | |
InternVL2.5 | 26B | 55.05 | 64.10 | 65.38 | 57.69 | 69.23 | 46.00 | 50.00 | 50.00 | 38.00 | |
InternVL2.5 | 8B | 53.47 | 60.26 | 69.23 | 46.15 | 65.38 | 46.67 | 46.00 | 54.00 | 40.00 | |
InternVL2.5 | 4B | 53.87 | 56.41 | 53.85 | 57.69 | 57.69 | 51.33 | 52.00 | 56.00 | 46.00 | |
LLaVA Next | 32B | 35.59 | 37.18 | 26.92 | 53.85 | 30.77 | 34.00 | 30.00 | 38.00 | 34.00 | |
LLaVA Video | 7B | 31.03 | 32.05 | 30.77 | 34.62 | 30.77 | 30.00 | 22.00 | 38.00 | 30.00 | |
LLaVA OneVision | 7B | 33.00 | 33.33 | 34.62 | 34.62 | 30.77 | 32.67 | 28.00 | 38.00 | 32.00 | |
Qwen2.5 VL | 32B | 52.77 | 61.54 | 53.85 | 61.54 | 69.23 | 44.00 | 40.00 | 54.00 | 38.00 | |
Qwen2.5 VL | 7B | 31.31 | 34.62 | 38.46 | 19.23 | 46.15 | 28.00 | 36.00 | 22.00 | 26.00 |
@article{gu2025m4r,
title={Measuring Massive Multimodal Understanding and Reasoning in Open Space},
author={Gu, Shangding and Wang, Xiaohan and Ying, Donghao and Zhao, Haoyu and Yang, Runing and Li, Boyi and Jin, Ming and Pavone, Marco and Yeung-Levy, Serena and Wang, Jun and Song, Dawn and Spanos, Costas},
journal={Github},
year={2025}
}