Fine-Tuning VLMs

In Unit 1, we explored supervised fine-tuning on LLMs, including efficient strategies using TRL. In this section, we adapt these techniques for Vision Language Models (VLMs), focusing on efficiency and task-specific performance.

Key Efficiency Techniques

When fine-tuning VLMs, memory and computation can quickly become a bottleneck. Here are the main strategies:

Quantization

Quantization reduces the precision of model weights and activations, lowering memory usage and speeding up computation.

bfloat16 / float16 halves memory requirements while maintaining accuracy.
8-bit / 4-bit quantization reduces memory further, with minor performance trade-offs.
⚠️ Especially relevant for VLMs, where image features increase memory demands.

PEFT & LoRA

Low-Rank Adaptation (LoRA) freezes the base model weights and trains compact rank-decomposition matrices, drastically reducing the number of trainable parameters. When combined with PEFT, fine-tuning requires millions of trainable parameters instead of billions, making large VLMs accessible on limited hardware.

Batch Size Optimization

Memory-efficient training can be achieved with:

Gradient accumulation: maintain effective batch size over multiple steps.
Gradient checkpointing: recompute intermediate activations to save memory.
Start with a large batch, reduce if OOM errors occur, and combine with LoRA/quantization for best results.

Supervised Fine-Tuning (SFT)

SFT adapts a pre-trained VLM to a specific task using labeled datasets (image-text pairs). Examples include:

Visual question answering (VQA)
Image captioning
Chart or diagram interpretation

When to Use SFT

Specialize a VLM in a domain where the base model struggles.
Learn domain-specific vocabulary or visual patterns.

Limitations

Requires high-quality, labeled datasets.
Can be computationally intensive.
Risk of overfitting if fine-tuning is too narrow.

Usage Example

The SFTTrainer supports training VLMs directly.
Your dataset should include an additional images column containing the visual inputs. See the dataset format docs for details.

from trl import SFTTrainer

training_args = SFTTrainer(
   output_dir="./fine_tuned_model",
   per_device_train_batch_size=4,
   num_train_epochs=3,
   learning_rate=5e-5,
   save_steps=1000,
   bf16=True,
   gradient_checkpointing=True,
   gradient_accumulation_steps=16,
   logging_steps=50
)

⚠️ Important: Set max_length=None in the SFTConfig.
Otherwise, truncation may remove image tokens during training.

SFTConfig(max_length=None, ...)

Practical Steps

Data Preparation
- Use image-text pairs, e.g., HuggingFaceM4/ChartQA.
Model Setup
- Load a pre-trained VLM such as HuggingFaceTB/SmolVLM2-2.2B-Instruct.
- Initialize a processor to prepare text and image inputs.
Fine-Tuning Process
- Format data into chat-like messages (system, user, assistant).
- Configure optimizer, batch size, and gradient accumulation.
- Apply quantization and LoRA for memory-efficient training.

Preference Optimization (DPO)

Direct Preference Optimization (DPO) aligns a VLM with human preferences instead of strict instruction following.

Useful for creative tasks, subjective judgments, or multi-choice answers.
The model learns to select the more human-aligned response, even if it isn’t strictly “correct.”

Limitations

Requires high-quality preference-labeled datasets.
Training involves pairwise preference sampling and careful resource management.

Usage Example

Dataset: Each example contains a prompt (image + question) and two candidate responses:

Question: How many families?
Rejected: The image does not provide information about families.
Chosen: The image shows a Union Organization table setup with 18,000 families.

Model Setup: Load the pre-trained VLM, integrate with TRL DPO, and prepare the processor.
Training Pipeline:
- Format dataset into chat-like messages.
- Apply a preference-based loss function.
- Use gradient accumulation, checkpointing, LoRA, and quantization for efficiency.

SFT vs DPO Comparison

Feature	SFT	DPO
Input	Labeled image-text	Image-text + preference-ranked outputs
Loss	Standard supervised	Preference-based
Goal	Task-specific adaptation	Human-aligned output
Use Case	Domain specialization	Creative, subjective, or multi-choice tasks

Practical Tips

Start small: test with a subset of the dataset before full training.
Use gradient checkpointing + LoRA + quantization to reduce memory usage.
Monitor checkpoint frequency to balance storage and safety.
Validate on a small set to avoid overfitting.

Next Steps

After fine-tuning, evaluate your VLM’s performance on multimodal tasks using benchmarks and custom test sets, applying techniques from Unit 2.

Fine-Tuning a VLM in hf jobs using TRL

As introduced in earlier units, Hugging Face Jobs make fine-tuning Vision Language Models (VLMs) straightforward. You can run Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) directly on the Hugging Face infrastructure with minimal setup, adjusting the training parameters we discussed previously.

Quick Example

hf jobs uv run \
   --flavor a100-large \
   --secrets HF_TOKEN \
   --timeout 2h \
   "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" \
   --model_name_or_path HuggingFaceTB/SmolVLM2-2.2B-Instruct \
   --dataset_name HuggingFaceM4/ChartQA \
   --report_to trackio

--flavor a100-large: GPU type for training.
--secrets HF_TOKEN: Your Hugging Face token.

The script handles processor setup, data formatting, and model training automatically. Once the job finishes, your fine-tuned VLM is ready to download and use in downstream tasks.

For memory-efficient fine-tuning of large VLMs, consider combining techniques like LoRA adapters, gradient accumulation, and quantization. These strategies help reduce memory usage while maintaining performance.

Resources

Update on GitHub

a smol course

Fine-Tuning VLMs

Key Efficiency Techniques

Quantization

PEFT & LoRA

Batch Size Optimization

Supervised Fine-Tuning (SFT)

When to Use SFT

Usage Example

Practical Steps

Preference Optimization (DPO)

Usage Example

SFT vs DPO Comparison

Practical Tips

Next Steps

Fine-Tuning a VLM in hf jobs using TRL

Quick Example

Resources