Instructions to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only")
model = AutoModelForCausalLM.from_pretrained("lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only

SGLang

How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with Docker Model Runner:
```
docker model run hf.co/lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
```

LLaVA-Video-7B-Qwen2-Video-Only

Model Summary
Use
Limitations
Training
License
Citation

Model Summary

In contrast to lmms-lab/LLaVA-NeXT-Video-7B-Qwen2, this is a 7B model trained on LLaVA-Video-178K only, based on Qwen2 language model with a context window of 32K tokens.

This model supports up to 110 frames and achieves comparable results to those of lmms-lab/LLaVA-Video-7B-Qwen2 in terms of video benchmarks.

Project Page: Project Page.
Paper: For more details, please check our paper
Repository: LLaVA-VL/LLaVA-NeXT
Point of Contact: Yuanhan Zhang
Languages: English, Chinese

Use

Intended use

The model was trained on LLaVA-Video-178K and have the ability to interact with videos.

Feel free to share your generations in the Community tab!

Generation

We provide the simple generation process for using our model. For more details, you could refer to Github.

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    frame_time = [i/fps for i in frame_idx]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only "
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "XXXX"
max_frames_num = "64"
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

Training

Model

Architecture: SO400M + Qwen2
Initialized Model: lmms-lab/llava-onevision-qwen2-7b-si
Data: video data only, 1 epoch, full model
Precision: bfloat16

Hardware & Software

GPUs: 256 * Nvidia Tesla A100 (for whole model series training)
Orchestration: Huggingface Trainer
Neural networks: PyTorch

Citations


@misc{zhang2024videoinstructiontuningsynthetic,
    title={Video Instruction Tuning With Synthetic Data}, 
    author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
    year={2024},
    eprint={2410.02713},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.02713}, 
}

Downloads last month: 260

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only

Base model

lmms-lab/llava-onevision-qwen2-7b-si

Finetuned

(4)

this model

Dataset used to train lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only

Spaces using lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only 3

Collection including lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only

LLaVA-Video

Collection

Models focus on video understanding (previously known as LLaVA-NeXT-Video). • 8 items • Updated Feb 21, 2025 • 63

Paper for lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only

Video Instruction Tuning With Synthetic Data

Paper • 2410.02713 • Published Oct 3, 2024 • 41

Evaluation results

accuracy on ActNet-QA
self-reported

58.200
accuracy on EgoSchema
self-reported

57.300
accuracy on MLVU
self-reported

69.800
accuracy on MVBench
self-reported

58.400
accuracy on NextQA
self-reported

82.200
accuracy on PercepTest
self-reported

71.700
score on VideoChatGPT
self-reported

3.540
score on VideoDC
self-reported

3.710
accuracy on LongVideoBench
self-reported

57.300
accuracy on VideoMME
self-reported

63.200

lmms-lab
/

LLaVA-Video-7B-Qwen2-Video-Only