| # Multimodal Support in llama.cpp | |
| This directory provides multimodal capabilities for `llama.cpp`. Initially intended as a showcase for running LLaVA models, its scope has expanded significantly over time to include various other vision-capable models. As a result, LLaVA is no longer the only multimodal architecture supported. | |
| > [!IMPORTANT] | |
| > | |
| > Multimodal support can be viewed as a sub-project within `llama.cpp`. It is under **very heavy development**, and **breaking changes are expected**. | |
| The naming and structure related to multimodal support have evolved, which might cause some confusion. Here's a brief timeline to clarify: | |
| - [#3436](https://github.com/ggml-org/llama.cpp/pull/3436): Initial support for LLaVA 1.5 was added, introducing `llava.cpp` and `clip.cpp`. The `llava-cli` binary was created for model interaction. | |
| - [#4954](https://github.com/ggml-org/llama.cpp/pull/4954): Support for MobileVLM was added, becoming the second vision model supported. This built upon the existing `llava.cpp`, `clip.cpp`, and `llava-cli` infrastructure. | |
| - **Expansion & Fragmentation:** Many new models were subsequently added (e.g., [#7599](https://github.com/ggml-org/llama.cpp/pull/7599), [#10361](https://github.com/ggml-org/llama.cpp/pull/10361), [#12344](https://github.com/ggml-org/llama.cpp/pull/12344), and others). However, `llava-cli` lacked support for the increasingly complex chat templates required by these models. This led to the creation of model-specific binaries like `qwen2vl-cli`, `minicpmv-cli`, and `gemma3-cli`. While functional, this proliferation of command-line tools became confusing for users. | |
| - [#12849](https://github.com/ggml-org/llama.cpp/pull/12849): `libmtmd` was introduced as a replacement for `llava.cpp`. Its goals include providing a single, unified command-line interface, improving the user/developer experience (UX/DX), and supporting both audio and image inputs. | |
| - [#13012](https://github.com/ggml-org/llama.cpp/pull/13012): `mtmd-cli` was added, consolidating the various model-specific CLIs into a single tool powered by `libmtmd`. | |
| ## Pre-quantized models | |
| See the list of pre-quantized model [here](../../docs/multimodal.md) | |
| ## How it works and what is `mmproj`? | |
| Multimodal support in `llama.cpp` works by encoding images into embeddings using a separate model component, and then feeding these embeddings into the language model. | |
| This approach keeps the multimodal components distinct from the core `libllama` library. Separating these allows for faster, independent development cycles. While many modern vision models are based on Vision Transformers (ViTs), their specific pre-processing and projection steps can vary significantly. Integrating this diverse complexity directly into `libllama` is currently challenging. | |
| Consequently, running a multimodal model typically requires two GGUF files: | |
| 1. The standard language model file. | |
| 2. A corresponding **multimodal projector (`mmproj`)** file, which handles the image encoding and projection. | |
| ## What is `libmtmd`? | |
| As outlined in the history, `libmtmd` is the modern library designed to replace the original `llava.cpp` implementation for handling multimodal inputs. | |
| Built upon `clip.cpp` (similar to `llava.cpp`), `libmtmd` offers several advantages: | |
| - **Unified Interface:** Aims to consolidate interaction for various multimodal models. | |
| - **Improved UX/DX:** Features a more intuitive API, inspired by the `Processor` class in the Hugging Face `transformers` library. | |
| - **Flexibility:** Designed to support multiple input types (text, audio, images) while respecting the wide variety of chat templates used by different models. | |
| ## How to obtain `mmproj` | |
| Multimodal projector (`mmproj`) files are specific to each model architecture. | |
| For the following models, you can use `convert_hf_to_gguf.py` with `--mmproj` flag to get the `mmproj` file: | |
| - [Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) ; See the guide [here](../../docs/multimodal/gemma3.md) - Note: 1B variant does not have vision support | |
| - SmolVLM (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB)) | |
| - SmolVLM2 (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB)) | |
| - [Pixtral 12B](https://huggingface.co/mistral-community/pixtral-12b) - only works with `transformers`-compatible checkpoint | |
| - Qwen 2 VL and Qwen 2.5 VL (from [Qwen](https://huggingface.co/Qwen)) | |
| - [Mistral Small 3.1 24B](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) | |
| - InternVL 2.5 and InternVL 3 from [OpenGVLab](https://huggingface.co/OpenGVLab) (note: we don't support conversion of `InternVL3-*-hf` model, only non-HF version is supported ; `InternLM2Model` **text** model is not supported) | |
| For older models, please refer to the relevant guide for instructions on how to obtain or create them: | |
| NOTE: conversion scripts are located under `tools/mtmd/legacy-models` | |
| - [LLaVA](../../docs/multimodal/llava.md) | |
| - [MobileVLM](../../docs/multimodal/MobileVLM.md) | |
| - [GLM-Edge](../../docs/multimodal/glmedge.md) | |
| - [MiniCPM-V 2.5](../../docs/multimodal/minicpmv2.5.md) | |
| - [MiniCPM-V 2.6](../../docs/multimodal/minicpmv2.6.md) | |
| - [MiniCPM-o 2.6](../../docs/multimodal/minicpmo2.6.md) | |
| - [IBM Granite Vision](../../docs/multimodal/granitevision.md) | |