Instructions to use AxionLab-Co/AxionMoE-350k-A250k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AxionLab-Co/AxionMoE-350k-A250k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AxionLab-Co/AxionMoE-350k-A250k", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AxionLab-Co/AxionMoE-350k-A250k", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AxionLab-Co/AxionMoE-350k-A250k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AxionLab-Co/AxionMoE-350k-A250k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AxionLab-Co/AxionMoE-350k-A250k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AxionLab-Co/AxionMoE-350k-A250k

SGLang

How to use AxionLab-Co/AxionMoE-350k-A250k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AxionLab-Co/AxionMoE-350k-A250k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AxionLab-Co/AxionMoE-350k-A250k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AxionLab-Co/AxionMoE-350k-A250k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AxionLab-Co/AxionMoE-350k-A250k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AxionLab-Co/AxionMoE-350k-A250k with Docker Model Runner:
```
docker model run hf.co/AxionLab-Co/AxionMoE-350k-A250k
```

AxionLab-official commited on Mar 8

Commit

3f7e5c2

verified ·

1 Parent(s): 321d5f9

Update README.md

Browse files

Files changed (1) hide show

README.md +34 -9

README.md CHANGED Viewed

@@ -93,7 +93,7 @@ active_params/tok : ~160,000
 ## Usage
 ```python
-from transformers import AutoModelForCausalLM
 from tokenizer import BPETokenizer
 import torch
@@ -103,9 +103,25 @@ model = AutoModelForCausalLM.from_pretrained(
 )
 model.eval()
-# model.vocab and model.model must be in the same folder
 tok = BPETokenizer.load("model.vocab", "model.model")
 prompt = "# Pergunta:\nQuanto é 5 + 3?\n--\n# Resposta:\n"
 ids = tok.encode(prompt, add_bos=True, add_eos=False)
 input_ids = torch.tensor([ids])
@@ -113,16 +129,25 @@ input_ids = torch.tensor([ids])
 with torch.no_grad():
     output = model.generate(
         input_ids,
-        max_new_tokens=60,
-        temperature=0.8,
         do_sample=True,
-        top_k=40,
-        top_p=0.9,
-        eos_token_id=tok.token2id["<eos>"],
-        pad_token_id=tok.token2id["<pad>"],
     )
-print(tok.decode(output[0].tolist()))
 ```
 ---

 ## Usage
 ```python
+from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
 from tokenizer import BPETokenizer
 import torch
 )
 model.eval()
 tok = BPETokenizer.load("model.vocab", "model.model")
+# Bloqueia EOS e PAD nos primeiros min_tokens gerados
+class MinNewTokens(LogitsProcessor):
+    def __init__(self, min_tokens: int, eos_id: int, pad_id: int):
+        self.min_tokens = min_tokens
+        self.bad = [eos_id, pad_id]
+        self.generated = 0
+    def __call__(self, input_ids, scores):
+        if self.generated < self.min_tokens:
+            for bid in self.bad:
+                scores[:, bid] = float("-inf")
+        self.generated += 1
+        return scores
+eos_id = tok.token2id["<eos>"]
+pad_id = tok.token2id["<pad>"]
 prompt = "# Pergunta:\nQuanto é 5 + 3?\n--\n# Resposta:\n"
 ids = tok.encode(prompt, add_bos=True, add_eos=False)
 input_ids = torch.tensor([ids])
 with torch.no_grad():
     output = model.generate(
         input_ids,
+        max_new_tokens=80,
+        temperature=0.9,
         do_sample=True,
+        top_k=50,
+        top_p=0.95,
+        eos_token_id=eos_id,
+        pad_token_id=pad_id,
+        use_cache=False,
+        logits_processor=LogitsProcessorList([
+            MinNewTokens(min_tokens=5, eos_id=eos_id, pad_id=pad_id)
+        ]),
     )
+new_tokens = output[0][len(ids):].tolist()
+# Remove EOS do final se presente
+if new_tokens and new_tokens[-1] == eos_id:
+    new_tokens = new_tokens[:-1]
+print("Resposta:", tok.decode(new_tokens))
 ```
 ---