Abstract

The increasing sophistication of multimodal models necessitates benchmarks that can rigorously evaluate their understanding and reasoning in complex, safety-pertinent, open-world scenarios. This study introduces M4R (Measuring Massive Multimodal Understanding and Reasoning), a large-scale benchmark uniquely designed to assess reasoning capabilities across diverse open spaces, comprehensively covering land, air, and water environments. M4R comprises approximately 2,000 videos and over 19,000 human-annotated question-answer pairs. These videos, varying in length (short, medium, long) and presenting tasks of tiered difficulty (interval-based choices and accuracy-based choices), encompass distinct operational domains: the land-based scenarios primarily focus on traffic environments, particularly traffic collisions and accident cases; the air-based scenarios center on airplane navigation; and the water-based scenarios involve ship movements. M4R systematically evaluates models on temporal-causal reasoning, spatial understanding, and intent and goal planning within these dynamic contexts. By providing a unified platform across this broad spectrum of domains, M4R aims to drive the development of more robust and generalizable AI systems. Benchmarking state-of-the-art multimodal models on our dataset reveals that even leading models, such as ChatGPT-4o and Gemini, achieve only around a 20% success rate, highlighting the significant challenges that remain in open-space multimodal reasoning.

Open Space Scenarios

Leaderboard of Benchmark Evaluation in the Open Space (Land and Air) Domain

Evaluation of the Open Space domain using Short, Medium, and Long videos, categorized by reasoning types: temporal, spatial, and intent reasoning. All reported numbers reflect the combined results of Land Space and Air Space. Water Space is listed separately, as it includes only river and ocean videos without categorization by video length (short, medium, long). The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. Short Video Scenarios Medium Video Scenarios Long Video Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
HardGPT 4o-21.2624.0525.3230.3416.525.1927.5731.0716.9114.665.530.58
Gemini 2.5 Pro 🥇 - 30.58 33.84 36.32 34.70 30.50 30.22 38.55 26.18 25.91 27.67 20.00 21.50 41.50
Gemini 1.5 Pro-20.5523.223.8823.222.521.6126.6619.0419.1216.84625.519
Claude 3.5 - 26.46 29.88 26.82 31.82 31.0 26.10 28.63 31.86 17.82 19.66 11.0 33.0 15.0
InternVL2.526B20.5520.332528.58.525.6631321415.66131717
InternVL2.58B20.4519.341930.58.524.6631301317.3410.531.510
InternVL2.54B17.45171619152125211714.3411.5265.5
LLaVA Next32B17.0519.67153311149221117.57.53510
LLaVA Video7B17.28181331.59.518.6716261415.167.5299
LLaVA OneVision7B14.6715.168.527.59.515.3415171413.5823.59
Qwen2.5 VL32B19.4419.668.53515.525.3325242713.3322810
Qwen2.5 VL7B19.7222.668.53029.522.6621311613.843.5308
MediumGPT 4o-37.7242.0843.2455.527.531.9539.8430.3425.663944.53735.5
Gemini 2.5 Pro 🥇 - 39.78 43.73 40.19 49.5 41.5 32.63 36.79 31.44 29.66 43.0 44.0 39.5 45.5
Gemini 1.5 Pro-36.3438.7337.21453435.0933.6647.1124.535.172153.531
Claude 3.5 - 37.51 39.89 30.68 45.0 44.0 35.80 35.79 48.11 23.5 36.84 33.0 39.5 38.0
InternVL2.526B31.8933.6633.55413.530.6731431831.3427.542.524
InternVL2.58B34.4933.6631.557.5123537482034.833344.527
InternVL2.54B33.0534.53348.52233.3437412231.3325.54325.5
LLaVA Next32B23.05261744.516.51816251325.1620.53817
LLaVA Video7B24.8425.1622352124.342627202514.542.518
LLaVA OneVision7B20.1719.6623321618.6719201722.161632.518
Qwen2.5 VL32B30.9530.516.546293231402530.34145027
Qwen2.5 VL7B28.9531.8426.5333628.3428332426.6625.52331.5
EasyGPT 4o-41.4243.8444.537.5349.541.9139.4541.4544.8438.544.527.543.5
Gemini 2.5 Pro 🥇 - 53.56 59.48 65.00 51.94 61.50 47.36 46.47 47.59 48.04 53.84 57.50 44.50 59.50
Gemini 1.5 Pro -44.548.3348475039.4648.5134.3635.545.8446.54744
Claude 3.5 - 45.52 49.16 47.5 44.0 56.0 39.51 32.64 53.51 32.36 48.0 52.0 44.5 47.5
InternVL2.5 26B44.3348.164951.5444043453244.83465137.5
InternVL2.58B44.2746.1741.553444045423346.66575231
InternVL2.54B42.6148.3344554638.3339413541.1639.55430
LLaVA Next32B32.2337.3435.543.53326.3324233233.1727.54032
LLaVA Video7B32.3333.163234.5333436372929.8425.53133
LLaVA OneVision7B31.532.6632.535.53029.3430342432.531.53333
Qwen2.5 VL32B47.8450.5465352.5464346494743.55245.5
Qwen2.5 VL7B40.2842.3341.53055.53740294241.544.52951

Land Space Scenarios

Leaderboard of Benchmark Evaluation in the Land Space Domain

Evaluation of the Land Space domain using using Short, Medium, and Long videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. Short Video Scenarios Medium Video Scenarios Long Video Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
HardGPT 4o-24.4126.7834.6534.691135.7043.1432.1431.8211.006261
Gemini 2.5 Pro 🥇 - 29.76 34.84 36.63 44.90 23.0 35.76 45.10 30.36 31.82 18.67 10.0 28.0 18.0
Gemini 1.5 Pro-18.7619.7223.7620.411524.5533.3316.0724.2412.002268
Claude 3.5 - 28.71 33.76 35.64 31.63 34.0 28.87 37.26 35.71 13.63 16.0 12.0 26.0 10.0
InternVL2.526B23.7821.3326.031.07.032.0046.032.018.018.0016.024.014.0
InternVL2.58B22.6720.0018.033.09.030.0046.030.014.018.0016.028.010.0
InternVL2.54B19.5618.6718.028.08.028.0034.024.026.012.008.022.06.0
LLaVA Next32B16.2220.6716.032.014.011.3312.012.010.016.6710.030.010.0
LLaVA Video7B19.7819.3312.035.011.024.6726.030.018.015.3310.028.08.0
LLaVA OneVision7B13.6714.335.027.011.014.6718.08.018.012.06.022.08.0
Qwen2.5 VL32B22.6619.3311.034.013.035.3346.024.036.013.334.026.010.0
Qwen2.5 VL7B22.8926.0017.030.031.030.0040.032.018.012.672.030.06.0
MediumGPT 4o 🥇-36.9945.4948.48553333.8941.6726.6733.3331.33244426
Gemini 2.5 Pro - 36.46 42.79 38.38 59.0 31.0 33.93 39.58 28.89 33.33 32.67 28.0 44.0 26.0
Gemini 1.5 Pro-33.8939.4742.42423433.5233.3342.222528.67125222
Claude 3.5 - 35.35 41.78 35.35 50.0 40.0 35.60 39.58 42.22 25.0 28.67 16.0 44.0 26.0
InternVL2.526B35.1136.0039.050.019.036.6750.036.024.032.6730.040.028.0
InternVL2.58B34.6637.3343.057.012.035.3342.046.018.031.3326.044.024.0
InternVL2.54B33.8939.6738.053.028.032.6744.028.026.029.3316.046.026.0
LLaVA Next32B20.027.3316.049.017.010.6714.010.08.022.016.036.014.0
LLaVA Video7B25.6725.0020.034.026.028.6736.028.022.023.3314.040.016.0
LLaVA OneVision7B16.6716.0026.030.016.014.6718.08.018.019.3312.030.016.0
Qwen2.5 VL32B28.5528.3321.044.020.033.3340.030.030.024.008.040.024.0
Qwen2.5 VL7B29.8939.0037.042.038.030.6732.040.020.020.0016.026.018.0
EasyGPT 4o-42.1752.355947.065147.1654.944.941.6727.0044532
Gemini 2.5 Pro 🥇 - 54.56 62.96 70.0 55.88 63.0 54.73 52.94 59.18 52.08 46.00 40.0 54.0 44.0
Gemini 1.5 Pro-46.0051.3360504436.9249.0236.732550.00584448
Claude 3.5 - 48.59 60.33 61.0 50.0 70.0 36.35 35.29 51.02 22.73 49.33 64.0 44.0 40.0
InternVL2.526B52.5561.0062.059.062.045.3358.044.034.051.3362.062.030.0
InternVL2.58B50.1155.6755.060.052.044.6758.042.034.050.0054.064.032.0
InternVL2.54B44.8953.3346.060.054.037.3348.038.026.044.0044.048.040.0
LLaVA Next32B31.2538.0035.045.034.021.3312.014.038.034.6720.050.034.0
LLaVA Video7B31.4433.0030.031.038.033.3338.036.026.028.0016.032.036.0
LLaVA OneVision7B29.7832.0031.033.032.024.0026.030.016.033.3328.036.036.0
Qwen2.5 VL32B43.2251.0058.050.045.041.3346.038.040.037.3332.044.036.0
Qwen2.5 VL7B40.6751.3355.042.057.036.0032.042.034.034.6734.028.042.0

Air Space Scenarios

Leaderboard of Benchmark Evaluation in the Air Space Domain

Evaluation of the Air Space domain using using Short, Medium, and Long videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, reflecting an increase in video length and indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. Short Video Scenarios Medium Video Scenarios Long Video Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
Hard GPT 4o-18.1121.3316.0026.0022.0014.6712.0030.002.0018.335.0035.0015.00
Gemini 2.5 Pro 🥇 - 31.39 32.83 36.0 24.49 38.0 24.67 32.0 22.0 20.0 36.67 30.0 15.0 65.0
Gemini 1.5 Pro-22.3426.6724.0026.0030.0018.6720.0022.0014.0021.6710.0025.0030.00
Claude 3.5 - 24.22 26.00 18.0 32.0 28.0 23.33 20.0 28.0 22.0 23.33 10.0 40.0 20.0
InternVL2.526B17.3319.3324.0026.0010.0019.3316.0032.0010.0013.3310.0010.0020.00
InternVL2.58B18.2218.6720.0028.008.0019.3316.0030.0012.0016.675.0035.0010.00
InternVL2.54B15.3315.3314.0010.0022.0014.0016.0018.008.0016.6715.0030.005.00
LLaVA Next32B17.8918.6714.034.08.016.676.032.012.018.335.040.010.0
LLaVA Video7B14.7816.6714.0028.008.0012.676.0022.0010.0015.005.0030.0010.00
LLaVA OneVision7B15.6716.0012.0028.008.0016.0012.0026.0010.0015.0010.0025.0010.00
Qwen2.5 VL32B16.2220.006.0036.0018.0015.334.0024.0018.0013.330.0030.0010.00
Qwen2.5 VL7B16.5519.330.0030.0028.0015.332.0030.0014.0015.005.0030.0010.00
Medium GPT 4o-38.4538.6738.0056.0022.0030.0038.0034.0018.0046.6765.0030.0045.00
Gemini 2.5 Pro 🥇 - 43.11 44.67 42.0 40.0 52.0 31.33 34.0 34.0 26.0 53.33 60.0 35.0 65.0
Gemini 1.5 Pro-38.7838.0032.0048.0034.0036.6734.0052.0024.0041.6730.0055.0040.00
Claude 3.5 - 39.67 38.00 26.0 40.0 48.0 36.00 32.0 54.0 22.0 45.00 50.0 35.0 50.0
InternVL2.526B28.6731.3328.0058.008.0024.6712.0050.0012.0030.0025.0045.0020.00
InternVL2.58B34.3330.0020.0058.0012.0034.6732.0050.0022.0038.3340.0045.0030.00
InternVL2.54B32.2229.3328.0044.0016.0034.0030.0054.0018.0033.3335.0040.0025.00
LLaVA Next32B26.1124.6718.040.016.025.3318.040.018.028.3325.040.020.0
LLaVA Video7B24.0025.3324.0036.0016.0020.0016.0026.0018.0026.6715.0045.0020.00
LLaVA OneVision7B23.6723.3320.0034.0016.0022.6720.0032.0016.0025.0020.0035.0020.00
Qwen2.5 VL32B33.3432.6712.0048.0038.0030.6722.0050.0020.0036.6720.0060.0030.00
Qwen2.5 VL7B28.0024.6716.0024.0034.0026.0024.0026.0028.0033.3335.0020.0045.00
Easy GPT 4o-40.6735.3330.0028.0048.0036.6724.0038.0048.0050.0045.0050.0055.00
Gemini 2.5 Pro 🥇 - 52.56 56.00 60.0 48.0 60.0 40.00 40.0 36.0 44.0 61.67 75.0 35.0 75.0
Gemini 1.5 Pro-43.0045.3336.0044.0056.0042.0048.0032.0046.0041.6735.0050.0040.00
Claude 3.5 - 42.45 38.00 34.0 38.0 42.0 42.67 30.0 56.0 42.0 46.67 40.0 45.0 55.0
InternVL2.526B36.1135.3336.0044.0026.0034.6728.0046.0030.0038.3330.0040.0045.00
InternVL2.58B38.4436.6728.0046.0036.0035.3332.0042.0032.0043.3360.0040.0030.00
InternVL2.54B40.3343.3342.0050.0038.0039.3330.0044.0044.0038.3335.0060.0020.00
LLaVA Next32B33.2236.6736.0042.032.031.3336.032.026.031.6735.030.030.0
LLaVA Video7B33.2233.3334.0038.0028.0034.6734.0038.0032.0031.6735.0030.0030.00
LLaVA OneVision7B33.2233.3334.0038.0028.0034.6734.0038.0032.0031.6735.0030.0030.00
Qwen2.5 VL32B52.4550.0034.0056.0060.0050.6740.0054.0058.0056.6755.0060.0055.00
Qwen2.5 VL7B39.8933.3328.0018.0054.0038.0048.0016.0050.0048.3355.0030.0060.00

Water Space Scenarios

Leaderboard of Benchmark Evaluation in the Water Space Domain

Evaluation of the Water Space domain using using River and Ocean videos, categorized by reasoning types: temporal, spatial, and intent reasoning. The background color transitions from light blue to light purple, indicating a gradual rise in task difficulty.
Difficulty Models Size Over. Avg. River Scenarios Ocean Scenarios
Avg.TemporalSpatialIntent Avg.TemporalSpatialIntent
Hard GPT 4o-22.1028.2038.4626.9219.2316.0018.0018.0012.00
Gemini 2.5 Pro 🥇 - 29.64 34.62 23.08 34.62 46.15 24.67 38.0 16.0 20.0
Gemini 1.5 Pro-26.0226.9223.0830.7726.9225.1134.0020.9320.41
Claude 3.5 - 25.44 28.20 19.23 19.23 46.15 22.67 26.0 22.0 20.0
InternVL2.526B22.5423.0815.3819.2334.6222.0018.0028.0020.00
InternVL2.58B21.9021.797.6926.9230.7722.0016.0028.0022.00
InternVL2.54B20.9220.5119.2319.2323.0821.3316.0026.0022.00
LLaVA Next32B14.3911.547.6919.237.6915.338.030.08.0
LLaVA Video7B14.0016.6715.3823.0811.5411.338.0020.006.00
LLaVA OneVision7B15.6716.6711.5426.9211.5414.678.0028.008.00
Qwen2.5 VL32B13.3914.107.6923.0811.5412.678.024.06.0
Qwen2.5 VL7B14.6716.677.6930.7711.5412.676.0024.008.00
Medium GPT 4o-38.4942.3150.0053.8523.0834.6736.0048.0020.00
Gemini 2.5 Pro - 41.77 44.87 30.77 61.54 42.31 38.67 48.0 46.0 22.0
Gemini 1.5 Pro 🥇-46.3153.8446.1565.3850.0038.7834.0049.0233.33
Claude 3.5 - 38.62 35.90 34.62 50.0 23.08 41.33 42.0 54.0 28.0
InternVL2.526B41.7744.8730.7757.6946.1538.6724.0062.0030.00
InternVL2.58B41.0846.1534.6261.5442.3136.0034.0060.0014.00
InternVL2.54B44.3648.7223.0865.3857.6940.0028.0060.0032.00
LLaVA Next32B20.8823.0811.5438.4619.2318.6710.0030.0016.00
LLaVA Video7B21.9220.5119.2326.9215.3823.3320.0030.0020.00
LLaVA OneVision7B22.5423.0819.2330.7719.2322.0014.0034.0018.00
Qwen2.5 VL32B33.3134.6219.2350.0034.6232.0020.0050.0026.00
Qwen2.5 VL7B24.0829.4919.2330.7738.4618.6718.0026.0012.00
Easy GPT 4o-50.5157.6957.6950.0065.3843.3366.0034.0030.00
Gemini 2.5 Pro 🥇 - 61.05 64.10 57.69 57.69 76.92 58.00 72.0 50.0 52.0
Gemini 1.5 Pro-50.6952.5642.3161.5453.8548.8150.0046.4350.00
Claude 3.5 - 49.39 47.44 50.0 53.85 38.46 51.33 62.0 52.0 40.0
InternVL2.526B55.0564.1065.3857.6969.2346.0050.0050.0038.00
InternVL2.58B53.4760.2669.2346.1565.3846.6746.0054.0040.00
InternVL2.54B53.8756.4153.8557.6957.6951.3352.0056.0046.00
LLaVA Next32B35.5937.1826.9253.8530.7734.0030.0038.0034.00
LLaVA Video7B31.0332.0530.7734.6230.7730.0022.0038.0030.00
LLaVA OneVision7B33.0033.3334.6234.6230.7732.6728.0038.0032.00
Qwen2.5 VL32B52.7761.5453.8561.5469.2344.0040.0054.0038.00
Qwen2.5 VL7B31.3134.6238.4619.2346.1528.0036.0022.0026.00

Opportunities!

  • 1. Long-Horizon Temporal Reasoning: Future work can explore models with stronger memory and temporal abstraction capabilities to handle long-duration videos with multiple events and delayed causal effects, especially in air and water scenarios.
  • 2. Generalization Across Domains and Modalities: Cross-domain generalization—from land to air or water—and transfer learning across modalities (e.g., combining video, text, and audio) remain underexplored and crucial for building versatile systems.
  • 3. Safety-Aware and Verifiable Reasoning: Given the high-stakes nature of open-space applications (e.g., autonomous driving or aircraft control), future benchmarks and methods should integrate safety constraints and provide interpretable or verifiable reasoning processes.

BibTeX

@article{gu2025m4r,
      title={Measuring Massive Multimodal Understanding and Reasoning in Open Space},
      author={Gu, Shangding and Wang, Xiaohan and Ying, Donghao and Zhao, Haoyu and Yang, Runing and Li, Boyi and Jin, Ming and Pavone, Marco and Yeung-Levy, Serena and Wang, Jun and Song, Dawn and Spanos, Costas},
      journal={Github},
      year={2025}
    }