--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text tags: - multimodal-llm - reasoning - math datasets: - MMR1/MMR1-SFT - MMR1/MMR1-RL --- --- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text tags: - multimodal-llm - reasoning - math datasets: - MMR1/MMR1-SFT - MMR1/MMR1-RL --- [](https://arxiv.org/abs/2509.21268) [](https://huggingface.co/papers/2509.21268) # MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources ## Paper Abstract Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at this https URL . ## Code and Project Links - **GitHub Repository:** [https://github.com/LengSicong/MMR1](https://github.com/LengSicong/MMR1) - **Paper on Hugging Face:** [https://huggingface.co/papers/2509.21268](https://huggingface.co/papers/2509.21268)
> [**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**](https://github.com/DAMO-NLP-SG/VideoLLaMA3)
> Boqiang Zhang* , Kehan Li* , Zesen Cheng* , Zhiqiang Hu* , Yuqian Yuan* , Guanzheng Chen* , Sicong Leng* , Yuming Jiang* , Hang Zhang* , Xin Li* , Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
[](https://github.com/DAMO-NLP-SG/VideoLLaMA3) [](https://github.com/DAMO-NLP-SG/VideoLLaMA3) [](https://arxiv.org/abs/2501.13106)
> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
> Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
[](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [](https://arxiv.org/abs/2406.07476)
> Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
[](https://github.com/DAMO-NLP-SG/VCD) [](https://github.com/DAMO-NLP-SG/VCD) [](https://arxiv.org/abs/2311.16922)
> [**The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio**](https://arxiv.org/abs/2410.12787)
> Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
[](https://github.com/DAMO-NLP-SG/CMM) [](https://github.com/DAMO-NLP-SG/CMM) [](https://arxiv.org/abs/2410.12787)
> [**Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss**](https://arxiv.org/abs/2410.17243)
> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing
[](https://github.com/DAMO-NLP-SG/Inf-CLIP) [](https://github.com/DAMO-NLP-SG/Inf-CLIP) [](https://arxiv.org/abs/2410.17243)
> [**VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM**](https://arxiv.org/abs/2501.00599)
> Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
[](https://github.com/DAMO-NLP-SG/VideoRefer) [](https://github.com/DAMO-NLP-SG/VideoRefer) [](https://arxiv.org/abs/2501.00599)