Introduction to Vision Language Models

Vision Language Models (VLMs) can understand both images and text simultaneously, enabling tasks like image captioning, visual question answering, and multimodal reasoning. Just like LLMs, VLMs are trained to predict the next token — but with the added ability to process visual information. For example, HuggingFaceTB/SmolVLM2-2.2B-Base is a base VLM model, while HuggingFaceTB/SmolVLM2-2.2B-Instruct is instruction-tuned for chat-like interactions with users.

In this unit, we will explore how these models are built, how they work, and, most importantly, how you can use and adapt them for your own projects.

By the end of this unit, you’ll fine-tune a VLM using the same techniques you’ve already learned in previous units (like SFT). As ever, this unit is smol but fast!

If you’re looking for a deeper dive into computer vision, check out The Community Computer Vision Course.

After completing this unit (and the assignment), don’t forget to test your knowledge with the quiz!

What are Vision Language Models?

VLMs process image alongside text to enable tasks like image captioning, visual question answering, and multimodal reasoning.

A typical VLM architecture consists of an image encoder to extract visual features, a projection layer to align visual and textual representations, and a language model to process or generate text. This allows the model to establish connections between visual elements and language concepts.

VLMs can be used in different configurations depending on the use case. Base models handle general vision-language tasks, while chat-optimized variants support conversational interactions. Some models include additional components for grounding predictions in visual evidence or specializing in specific tasks like object detection.

Latest trends

Adding vision to language models has unlocked many exciting directions, including:

Reasoning-focused VLMs: solve complex problems using visual inputs.
Specialized VLMs: e.g. object detection, segmentation, or document understanding.
Vision-Language-Action models: generate end actions for robotics.
Agentic VLMs: enable complex workflows like chatting with documents or interacting with computer through screenshots.
Any-to-any models: expanding beyond vision and text to handle multiple input/output modalities (text, image, audio, video, etc.).

Adapting Vision Language Models for specific needs

Fine-tuning a VLM means adapting a pre-trained model to your dataset or task. You’ve already seen strategies like supervised fine-tuning (SFT) and preference alignment in previous units, the same ideas apply here.

While the core tools and techniques remain similar to those used for LLMs, fine-tuning VLMs brings additional challenges. A key one is data representation: images must be carefully prepared so the model can effectively combine visual and textual information. Another factor is model size. VLMs are often much larger than LLMs, making efficiency critical.

To keep training practical and cost-effective, we can rely on techniques like quantization and PEFT (Parameter-Efficient Fine-Tuning), as we explored in Unit 1. These approaches make fine-tuning more lightweight, enabling more users to adapt and experiment with powerful VLMs.

Evaluating Vision Language Models

As we saw in Unit 2, evaluation is a crucial step both during development and at production stage. For Vision Language Models (VLMs), the same principle applies: we need benchmarks to assess their capabilities and limitations during development, and real-world testing to ensure reliability and practical usefulness once deployed.

Some widely used general-purpose benchmarks include:

MMMU & MMMU-Pro: large multi-discipline benchmarks requiring reasoning across domains like arts, science, and engineering.
MMBench: over 3,000 single-choice questions testing skills such as OCR, localization, and reasoning.
MMT-Bench: focuses on expert-level multimodal tasks, including recognition, localization, reasoning, and planning.

There are also domain-specific benchmarks designed to test specialized skills:

MathVista: evaluates mathematical reasoning in the context of images.
AI2D: focuses on diagram understanding.
ScienceQA: science question answering.
OCRBench: assesses document understanding and OCR capabilities.

Finally, for a streamlined evaluation workflow, the OpenVLM Leaderboard provides a toolkit to evaluate VLMs across multiple benchmarks with a single command.

What You’ll Build

By the end of this module, you will:

Learn how to use VLMs with the 🤗 transformers library
Understand chat templates and conversation formatting for VLMs
Fine-tune SmolVLM on your own dataset
Run both programmatic and CLI-based training workflows

Let’s dive into the fascinating world of Vision Language Models!

References

Update on GitHub

a smol course

Introduction to Vision Language Models

What are Vision Language Models?

Latest trends

Adapting Vision Language Models for specific needs

Evaluating Vision Language Models

What You’ll Build

References