MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
Viewer • Updated • 37M • 4.54k • 64
How to use Dream-org/Dream-VL-7B with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="Dream-org/Dream-VL-7B", trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
pipe(text=messages) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Dream-org/Dream-VL-7B", trust_remote_code=True, dtype="auto")How to use Dream-org/Dream-VL-7B with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Dream-org/Dream-VL-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Dream-org/Dream-VL-7B",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker model run hf.co/Dream-org/Dream-VL-7B
How to use Dream-org/Dream-VL-7B with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "Dream-org/Dream-VL-7B" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Dream-org/Dream-VL-7B",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Dream-org/Dream-VL-7B" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Dream-org/Dream-VL-7B",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'How to use Dream-org/Dream-VL-7B with Docker Model Runner:
docker model run hf.co/Dream-org/Dream-VL-7B
Dream-VL 7B is an open diffusion vision-language model trained on 12M multimodal data from the MAmmoTH-VL-Instruct-12M dataset. The model takes language instructions and images as input and generates language outputs.
All Dream-VL checkpoints, as well as our training codebase are released under an Apache 2.0 License.
For full details, please read our blog and the paper: Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone.
Dream-7B, with Qwen2ViT Vision Backbone.import torch
from transformers import AutoProcessor, AutoModel
model_name = "Dream-org/Dream-VL-7B"
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to('cuda')
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
####### Method 1
from PIL import Image
import requests
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{
"role": "user","content": [{"type": "image"}, {"type": "text", "text": "Describe this image"}]
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
print(text)
inputs = processor(
text=[text], images=[image], padding=True, return_tensors="pt"
)
####### Method 2: use qwen_vl_utils
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "image",
# "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
# },
# {"type": "text", "text": "Describe this image."},
# ],
# }
# ]
# text = processor.apply_chat_template(
# messages, tokenize=False, add_generation_prompt=True
# )
# from qwen_vl_utils import process_vision_info
# image_inputs, video_inputs = process_vision_info(messages)
# inputs = processor(
# text=[text],
# images=image_inputs,
# videos=video_inputs,
# padding=True,
# return_tensors="pt",
# )
inputs = inputs.to("cuda")
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=128,
output_history=True,
return_dict_in_generate=True,
steps=128,
temperature=0.1,
top_p=1,
alg="maskgit_plus",
alg_temp=0,
use_cache=False,
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
# output: The image depicts a serene beach scene featuring a young woman and a golden retriever.
# The woman, dressed in a plaid shirt and dark pants, is seated on the sandy shore, smiling warmly at the camera.
# The golden retriever, adorned with a colorful harness, sits attentively beside her, its gaze fixed on the woman.
# The background reveals the vast expanse of the ocean, with waves gently kissing the shore. The sky above is a clear blue, suggesting a sunny day.
# The overall atmosphere exudes a sense of peace and companionship between the woman and her dog.
BibTeX:
@article{ye2025dreamvla,
title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang} and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
journal={arXiv preprint arXiv:2512.22615},
year={2025}
}