Instructions to use moonshotai/Kimi-K2-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-K2-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2-Instruct

SGLang

How to use moonshotai/Kimi-K2-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2-Instruct with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2-Instruct
```

Run 1T-param on A100/H100(80G)x8 using FP4

by ghostplant - opened Jul 13, 2025

Discussion

ghostplant

Jul 13, 2025

Docker Instructions (from https://hub.docker.com/r/tutelgroup/deepseek-671b):

# For A100/A800/H100/H800/H20/H200 (80G x 8):

# Step-1: Download 1TB Model
huggingface-cli download moonshotai/Kimi-K2-Instruct --local-dir ./moonshotai/Kimi-K2-Instruct

# Step-2: Run with A100/H100 (80G x 8):
docker run -it --rm --ipc=host --net=host --shm-size=8g --ulimit memlock=-1 \
      --ulimit stack=67108864 --gpus=all -v /:/host -w /host$(pwd) \
      tutelgroup/deepseek-671b:a100x8-chat-20250712 \
        --try_path ./moonshotai/Kimi-K2-Instruct \
        --serve --listen_port 8000 \
        --prompt "Calculate the indefinite integral of 1/sin(x) + x"

lsw825

Jul 13, 2025

Great work! Thanks a lot.

zhnagchenchne

Jul 14, 2025

Could you please introduce what framework is used for reasoning?

ghostplant

Jul 14, 2025

Do you mean inference framework?

zhnagchenchne

Jul 14, 2025

Do you mean inference framework?

@ghostplant Yes!

ghostplant

Jul 14, 2025

•

edited Jul 14, 2025

We integrate a couple of well-tuned MoE operators (i.e.Kimi fused gating, low-precision MoE FFN forwarding, etc. all of which are compatible for cheap GPUs) into Tutel, a library containing a collection of efficient MoE computing and communication operators, so the model can leverage these public-unoptimized fixes to resolve their slow execution phases, and finally support a very effective overall inference throughput even using A100.

ghostplant changed discussion status to closed Jul 14, 2025

ghostplant changed discussion status to open Jul 14, 2025

einsteiner1983

Jul 15, 2025

•

edited Jul 15, 2025

This is FP4? I think you mean int4?

ghostplant

Jul 15, 2025

This is FP4? I think you mean int4?

It inline quants to FP4 so that 8 A100 (80GB) can run this 1T model.

ghostplant changed discussion status to closed Jul 15, 2025

ghostplant changed discussion status to open Jul 18, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment