Instructions to use XiaomiMiMo/MiMo-V2-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use XiaomiMiMo/MiMo-V2-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="XiaomiMiMo/MiMo-V2-Flash", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("XiaomiMiMo/MiMo-V2-Flash", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use XiaomiMiMo/MiMo-V2-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "XiaomiMiMo/MiMo-V2-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XiaomiMiMo/MiMo-V2-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/XiaomiMiMo/MiMo-V2-Flash
- SGLang
How to use XiaomiMiMo/MiMo-V2-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "XiaomiMiMo/MiMo-V2-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XiaomiMiMo/MiMo-V2-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "XiaomiMiMo/MiMo-V2-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XiaomiMiMo/MiMo-V2-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use XiaomiMiMo/MiMo-V2-Flash with Docker Model Runner:
docker model run hf.co/XiaomiMiMo/MiMo-V2-Flash
Update MiMo-V2-Flash config.json for native Transformers compatibility
Hello Xiaomi MiMo team,
I am currently working on a PR to add the MiMo-V2-Flash model to the Transformers library, and I’ve been asked whether it would be possible to add a few entries to your config.json file, so we can have a single unified config.json file that aligns native Transformers hyperparamters conventions with your current config.json file.
So this PR updates your config.json to include native Transformers config entries while keeping full backward compatibility with your current remote implementation (single merged config.json, no model behavior change).
Added entries
rms_norm_eps(native alias oflayernorm_epsilon)layer_types(native alias ofhybrid_layer_pattern)mlp_layer_types(native alias ofmoe_layer_freq)rope_parameters(native RoPE structure for full/sliding attention)
Changed entries (non-breaking changes)
routed_scaling_factor:null->1.0
All legacy fields used by
modeling_mimo_v2_flash.pyare preserved, so this remains compatible with existing loading paths.
Tagging @AntonV from the Hugging Face team who also worked on it and is aware of this alignment request.
Thanks! Kind regards.
Thanks a lot!! This would be very nice to merge 🤗