Instructions to use unsloth/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use unsloth/MiniMax-M2.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/MiniMax-M2.7-GGUF", filename="BF16/MiniMax-M2.7-BF16-00001-of-00010.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use unsloth/MiniMax-M2.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Use Docker
docker model run hf.co/unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
- LM Studio
- Jan
- vLLM
How to use unsloth/MiniMax-M2.7-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/MiniMax-M2.7-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/MiniMax-M2.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
- Ollama
How to use unsloth/MiniMax-M2.7-GGUF with Ollama:
ollama run hf.co/unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
- Unsloth Studio new
How to use unsloth/MiniMax-M2.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/MiniMax-M2.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/MiniMax-M2.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/MiniMax-M2.7-GGUF to start chatting
- Pi new
How to use unsloth/MiniMax-M2.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/MiniMax-M2.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use unsloth/MiniMax-M2.7-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
- Lemonade
How to use unsloth/MiniMax-M2.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_M
Run and chat with the model
lemonade run user.MiniMax-M2.7-GGUF-UD-Q4_K_M
List all available models
lemonade list
Dynamic Quants producing garbage thinking output (llama.cpp + Cuda 13 issue)
Just pulled both UD-Q3_K_XL and UD-Q4_K_XL and getting the same result. I was previously running M2.5 and bumped the model to point at these M2.7 quants as well as updating llama.cpp to latest version.
My logs become flooded with:
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 3 | task 0 | n_decoded = 7446, n_remaining = -1, next token: 23 ''
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 7446
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 7447, front = 0
slot update_slots: id 3 | task 0 | slot decode token, n_ctx = 196608, n_tokens = 7485, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv update_chat_: Parsing chat message:
! $''
Parsing PEG input with format peg-native: [e~[
]~b]ai
<think>
! $''
srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"\u0017"}}],"created":1775973650,"id":"chatcmpl-edmSe4RiidSxhopP0cNN8g7Juh8uvO5A","model":"UD-Q4_K_XL.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
Here are my llama-serve params:
llama-serve --port 5512
-m /models/MiniMax-M2.7/UD-Q4_K_XL.gguf
-fa on
-ctk q8_0 -ctv q8_0
--no-mmap
-b 4096 -ub 2048
--temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
Update
The issue seems to be caused by a newer versions of llama.cpp and also CUDA 13. llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 has resolved the issue for me
I tried it just then and rebuilt from source - Q3_K_XL works fine
I also tried Q4_K_XL and it works:
I used the exact commands as you had:
./master_llama_cpp/llama.cpp/llama-cli --model MiniMax-M2.7-GGUF/UD-Q4_K_XL/MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf --flash-attn on --no-mmap -b 4096 -ub 2048 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
i am also getting the same output as OP. i'm running v100 gpus. and the thinking output seems to have slowed down to crawl while also outputting random characters.
i even tested your unsloth studio and its not working with this model either
You both got llama.cpp from source right? Can you try other quant providers and see if they work?
i did actually try other uploaders they were the same(i have fast internet) i tried 2 other uploaders the Q3_K_L quant Q3_K_M quant. and yours UD_Q3_K_L. i was wondering if thats a llama.cpp bug? some kind of v100 incompatbility with minimax 2.7 because none of them seemed to work for me. for reference though i have used other models types just fine like qwen or minimax 2.5. i also used the latest llama.cpp source uploaded 13 hours ago.
I tried offloading MoE as well and it works fine through -ot ".ffn_.*_exps.=CPU"
Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
Looks like it is a llama.cpp issue. I am just so used to pulling new models and updating llama.cpp when I do. Backing up to the llama.cpp I was using before and running this I am no longer getting the garbage output
i did actually try other uploaders they were the same(i have fast internet) i tried 2 other uploaders the Q3_K_L quant Q3_K_M quant. and yours UD_Q3_K_L. i was wondering if thats a llama.cpp bug? some kind of v100 incompatbility with minimax 2.7 because none of them seemed to work for me. for reference though i have used other models types just fine like qwen or minimax 2.5. i also used the latest llama.cpp source uploaded 13 hours ago.
Looks like it is a llama.cpp issue. I am just so used to pulling new models and updating llama.cpp when I do. Backing up to the llama.cpp I was using before and running this I am no longer getting the garbage output
Wait a second, are you guys using CUDA 13.2 by any chance?
Wait a second, are you guys using CUDA 13.2 by any chance?
Yeah, I'm in that version Driver Version: 595.45.04 CUDA Driver Version: 13.2
Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
But even after building in that commit, I'm still seeing the problem.
i used a build from april 5th of ik_llama.cpp and its working now. so its definitely a regression in newer llamas. which i think is odd in itself because i thought unsloth quants were not compatible with ik_llama.cpp
Oh my CUDA 13.2 is the culprit
Wait a second, are you guys using CUDA 13.2 by any chance?
Yeah, I'm in that version
Driver Version: 595.45.04 CUDA Driver Version: 13.2Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
But even after building in that commit, I'm still seeing the problem.
i used a build from april 5th of ik_llama.cpp and its working now. so its definitely a regression in newer llamas
Do NOT use CUDA 13.2!!!!!!! NVIDIA is fixing the issue of bad ouputs
For me the UD Q4 K XL works fine. No garbage thinking output and so on. Just used the most recent llama cpp cuda server docker image
See https://github.com/unslothai/unsloth/issues/4849 - we already pinged NVIDIA - CUDA 13.2 breaks llama.cpp quants under 4bit - use CUDA 13.1 or lower, or as a fallback use our pre-compiled quants from https://github.com/unslothai/llama.cpp/releases/tag/b8746 which use CUDA 13.0
llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 also works for me using Cuda 12. Confirmed I was always using Cuda 12 so even something in new builds broke me. I might try and bisect in the next few days to find what it is.
So this fixed the outputs in my case:
- I'm using llama.cpp in the commit suggested. https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9
- Driver Version: 595.45.04 CUDA Driver Version: 13.2
β UD-IQ4_XS doesn't work.
β
UD-Q3_K_XL works.
finnally got it tested. and yes its working with the build you linked
I tried offloading MoE as well and it works fine through
-ot ".ffn_.*_exps.=CPU"Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 also works for me using Cuda 12. Confirmed I was always using Cuda 12 so even something in new builds broke me. I might try and bisect in the next few days to find what it is.
Can you edit your main post so people can find the fix easily without reading the thread? thank you!! :) @orlandocollins - try our CUDA 13.0 pre-compiled binaries at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions/1#69db47965d14cc8ca1c6d3e4







