OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper • 2602.05400 • Published 5 days ago • 32
Grounding and Enhancing Informativeness and Utility in Dataset Distillation Paper • 2601.21296 • Published 12 days ago • 18
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique Paper • 2511.09067 • Published Nov 12, 2025 • 2
Grounding Computer Use Agents on Human Demonstrations Paper • 2511.07332 • Published Nov 10, 2025 • 106
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published Nov 4, 2025 • 101
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 39
MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems Paper • 2404.09486 • Published Apr 15, 2024 • 2
AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness Paper • 2507.01702 • Published Jul 2, 2025 • 4
view article Article BigCodeArena: Judging code generations end to end with code executions Oct 7, 2025 • 21
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 175
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Paper • 2504.07981 • Published Apr 4, 2025 • 4
GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks Paper • 2504.12764 • Published Apr 17, 2025 • 42
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper • 2505.21497 • Published May 27, 2025 • 109
view article Article ✴️ ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Jan 3, 2025 • 23