Instructions to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only") model = AutoModelForCausalLM.from_pretrained("lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
- SGLang
How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only with Docker Model Runner:
docker model run hf.co/lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
LLaVA-Video-7B-Qwen2-Video-Only
Table of Contents
Model Summary
In contrast to lmms-lab/LLaVA-NeXT-Video-7B-Qwen2, this is a 7B model trained on LLaVA-Video-178K only, based on Qwen2 language model with a context window of 32K tokens.
This model supports up to 110 frames and achieves comparable results to those of lmms-lab/LLaVA-Video-7B-Qwen2 in terms of video benchmarks.
- Project Page: Project Page.
- Paper: For more details, please check our paper
- Repository: LLaVA-VL/LLaVA-NeXT
- Point of Contact: Yuanhan Zhang
- Languages: English, Chinese
Use
Intended use
The model was trained on LLaVA-Video-178K and have the ability to interact with videos.
Feel free to share your generations in the Community tab!
Generation
We provide the simple generation process for using our model. For more details, you could refer to Github.
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
if max_frames_num == 0:
return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
sample_fps = max_frames_num
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only "
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "XXXX"
max_frames_num = "64"
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
input_ids,
images=video,
modalities= ["video"],
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)
Training
Model
- Architecture: SO400M + Qwen2
- Initialized Model: lmms-lab/llava-onevision-qwen2-7b-si
- Data: video data only, 1 epoch, full model
- Precision: bfloat16
Hardware & Software
- GPUs: 256 * Nvidia Tesla A100 (for whole model series training)
- Orchestration: Huggingface Trainer
- Neural networks: PyTorch
Citations
@misc{zhang2024videoinstructiontuningsynthetic,
title={Video Instruction Tuning With Synthetic Data},
author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
year={2024},
eprint={2410.02713},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02713},
}
- Downloads last month
- 260
Model tree for lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
Base model
lmms-lab/llava-onevision-qwen2-7b-siDataset used to train lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
Spaces using lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only 3
Collection including lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
Paper for lmms-lab/LLaVA-Video-7B-Qwen2-Video-Only
Evaluation results
- accuracy on ActNet-QAself-reported58.200
- accuracy on EgoSchemaself-reported57.300
- accuracy on MLVUself-reported69.800
- accuracy on MVBenchself-reported58.400
- accuracy on NextQAself-reported82.200
- accuracy on PercepTestself-reported71.700
- score on VideoChatGPTself-reported3.540
- score on VideoDCself-reported3.710
- accuracy on LongVideoBenchself-reported57.300
- accuracy on VideoMMEself-reported63.200