Collections
Discover the best community collections!
Collections including paper arxiv:2510.12798
-
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
Paper • 2509.22944 • Published • 79 -
Robot Learning: A Tutorial
Paper • 2510.12403 • Published • 114 -
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
Paper • 2510.13344 • Published • 62 -
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
Paper • 2510.06308 • Published • 53
-
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Paper • 2510.08565 • Published • 19 -
Detect Anything via Next Point Prediction
Paper • 2510.12798 • Published • 46 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 103 -
DeepEyesV2: Toward Agentic Multimodal Model
Paper • 2511.05271 • Published • 42
-
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Paper • 2508.09131 • Published • 16 -
Detect Anything via Next Point Prediction
Paper • 2510.12798 • Published • 46 -
OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
Paper • 2510.26800 • Published • 21 -
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Paper • 2511.21691 • Published • 32
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 34 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 27 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 126 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 22
-
Diffusion Transformers with Representation Autoencoders
Paper • 2510.11690 • Published • 165 -
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
Paper • 2510.12276 • Published • 145 -
FlashWorld: High-quality 3D Scene Generation within Seconds
Paper • 2510.13678 • Published • 71 -
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Paper • 2510.14847 • Published • 55
-
A Survey of Context Engineering for Large Language Models
Paper • 2507.13334 • Published • 259 -
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
Paper • 2507.15846 • Published • 133 -
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Paper • 2507.22827 • Published • 99 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 208
-
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Paper • 2410.02740 • Published • 54 -
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Paper • 2410.01215 • Published • 39 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121 -
EuroLLM: Multilingual Language Models for Europe
Paper • 2409.16235 • Published • 29
-
Diffusion Transformers with Representation Autoencoders
Paper • 2510.11690 • Published • 165 -
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
Paper • 2510.12276 • Published • 145 -
FlashWorld: High-quality 3D Scene Generation within Seconds
Paper • 2510.13678 • Published • 71 -
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Paper • 2510.14847 • Published • 55
-
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
Paper • 2509.22944 • Published • 79 -
Robot Learning: A Tutorial
Paper • 2510.12403 • Published • 114 -
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
Paper • 2510.13344 • Published • 62 -
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
Paper • 2510.06308 • Published • 53
-
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Paper • 2510.08565 • Published • 19 -
Detect Anything via Next Point Prediction
Paper • 2510.12798 • Published • 46 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 103 -
DeepEyesV2: Toward Agentic Multimodal Model
Paper • 2511.05271 • Published • 42
-
A Survey of Context Engineering for Large Language Models
Paper • 2507.13334 • Published • 259 -
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
Paper • 2507.15846 • Published • 133 -
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Paper • 2507.22827 • Published • 99 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 208
-
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Paper • 2508.09131 • Published • 16 -
Detect Anything via Next Point Prediction
Paper • 2510.12798 • Published • 46 -
OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
Paper • 2510.26800 • Published • 21 -
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Paper • 2511.21691 • Published • 32
-
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Paper • 2410.02740 • Published • 54 -
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Paper • 2410.01215 • Published • 39 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121 -
EuroLLM: Multilingual Language Models for Europe
Paper • 2409.16235 • Published • 29
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 34 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 27 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 126 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 22