Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2510.12798

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

Paper • 2509.22944 • Published Sep 26 • 79
Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14 • 114
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Paper • 2510.13344 • Published Oct 15 • 62
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Paper • 2510.06308 • Published Oct 7 • 53

Visual Multi Modal LLM

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9 • 19
Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Paper • 2510.14528 • Published Oct 16 • 103
DeepEyesV2: Toward Agentic Multimodal Model

Paper • 2511.05271 • Published 29 days ago • 42

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Paper • 2508.09131 • Published Aug 12 • 16
Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46
OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Paper • 2510.26800 • Published Oct 30 • 21
Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Paper • 2511.21691 • Published 10 days ago • 32

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Diffusion Transformers with Representation Autoencoders

Paper • 2510.11690 • Published Oct 13 • 165
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Paper • 2510.12276 • Published Oct 14 • 145
FlashWorld: High-quality 3D Scene Generation within Seconds

Paper • 2510.13678 • Published Oct 15 • 71
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Paper • 2510.14847 • Published Oct 16 • 55

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46

A Survey of Context Engineering for Large Language Models

Paper • 2507.13334 • Published Jul 17 • 259
GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21 • 133
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Paper • 2507.22827 • Published Jul 30 • 99
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25 • 208

Interesting Papers

These papers are interesting (to me)

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 54
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Paper • 2410.01215 • Published Oct 2, 2024 • 39
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121
EuroLLM: Multilingual Language Models for Europe

Paper • 2409.16235 • Published Sep 24, 2024 • 29

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46

Diffusion Transformers with Representation Autoencoders

Paper • 2510.11690 • Published Oct 13 • 165
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Paper • 2510.12276 • Published Oct 14 • 145
FlashWorld: High-quality 3D Scene Generation within Seconds

Paper • 2510.13678 • Published Oct 15 • 71
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Paper • 2510.14847 • Published Oct 16 • 55

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

Paper • 2509.22944 • Published Sep 26 • 79
Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14 • 114
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Paper • 2510.13344 • Published Oct 15 • 62
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Paper • 2510.06308 • Published Oct 7 • 53

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46

Visual Multi Modal LLM

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9 • 19
Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Paper • 2510.14528 • Published Oct 16 • 103
DeepEyesV2: Toward Agentic Multimodal Model

Paper • 2511.05271 • Published 29 days ago • 42

A Survey of Context Engineering for Large Language Models

Paper • 2507.13334 • Published Jul 17 • 259
GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21 • 133
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Paper • 2507.22827 • Published Jul 30 • 99
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25 • 208

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Paper • 2508.09131 • Published Aug 12 • 16
Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46
OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Paper • 2510.26800 • Published Oct 30 • 21
Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Paper • 2511.21691 • Published 10 days ago • 32

Interesting Papers

These papers are interesting (to me)

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 54
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Paper • 2410.01215 • Published Oct 2, 2024 • 39
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121
EuroLLM: Multilingual Language Models for Europe

Paper • 2409.16235 • Published Sep 24, 2024 • 29

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs