MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Paper • 2601.21821 • Published 14 days ago • 59
PaddleOCR-VL Collection Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model • 5 items • Updated 1 day ago • 28
PaddleOCR-VL-1.5 Collection Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing • 6 items • Updated 1 day ago • 9
Video-As-Prompt: Unified Semantic Control for Video Generation Paper • 2510.20888 • Published Oct 23, 2025 • 50
FlowAct-R1: Towards Interactive Humanoid Video Generation Paper • 2601.10103 • Published 28 days ago • 74
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Paper • 2410.10594 • Published Oct 14, 2024 • 29
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation Paper • 2510.09733 • Published Oct 10, 2025 • 5
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Paper • 2509.24650 • Published Sep 29, 2025 • 3
RigMo: Unifying Rig and Motion Learning for Generative Animation Paper • 2601.06378 • Published Jan 10 • 12
VideoPrism Collection VideoPrism is a foundational video encoder that enables state-of-the-art performance on a large variety of video understanding tasks. • 5 items • Updated Jul 16, 2025 • 17