VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models Paper • 2603.22003 • Published 5 days ago • 11
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing Paper • 2603.12254 • Published 16 days ago • 21
Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought Paper • 2603.22847 • Published 4 days ago • 22
RealMaster: Lifting Rendered Scenes into Photorealistic Video Paper • 2603.23462 • Published 4 days ago • 26
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation Paper • 2603.23500 • Published 4 days ago • 34
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning Paper • 2603.23483 • Published 4 days ago • 57
DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models Paper • 2603.23499 • Published 4 days ago • 47
From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents Paper • 2603.22386 • Published 5 days ago • 51
SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM Paper • 2603.23386 • Published 4 days ago • 41
PEARL: Personalized Streaming Video Understanding Model Paper • 2603.20422 • Published 8 days ago • 37
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding Paper • 2603.22458 • Published 5 days ago • 124
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG Paper • 2603.23497 • Published 4 days ago • 84
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding Paper • 2603.18472 • Published 9 days ago • 19
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation Paper • 2603.19220 • Published 9 days ago • 61
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs Paper • 2603.19217 • Published 9 days ago • 28
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World Paper • 2603.19223 • Published 9 days ago • 30
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing Paper • 2603.19228 • Published 9 days ago • 66