Instructions to use Pramodith/topN_sigma_generation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Pramodith/topN_sigma_generation with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Pramodith/topN_sigma_generation")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Pramodith/topN_sigma_generation")
model = AutoModelForCausalLM.from_pretrained("Pramodith/topN_sigma_generation")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Pramodith/topN_sigma_generation with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Pramodith/topN_sigma_generation"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pramodith/topN_sigma_generation",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Pramodith/topN_sigma_generation

SGLang

How to use Pramodith/topN_sigma_generation with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Pramodith/topN_sigma_generation" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pramodith/topN_sigma_generation",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Pramodith/topN_sigma_generation" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pramodith/topN_sigma_generation",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Pramodith/topN_sigma_generation with Docker Model Runner:
```
docker model run hf.co/Pramodith/topN_sigma_generation
```

topN_sigma_generation / README.md

Pramodith

Update Readme.md

9c424bf 10 months ago

preview code

raw

history blame contribute delete

3.57 kB

metadata

library_name: transformers
tags:
  - custom_generate

Overview

This generation sampling method is based on the paper Top-N Sigma: A Simple and Effective Sampling Method for Language Models.

Most output token sampling techniques operate on the probability scores post temperature being applied. The softmax function distorts the underlying logit scores distribution making it hard to know a meaningful top-p/top-k value to set.

This can lead to invalid tokens being in the chosen set of tokens after applying the top/min p/k threshold.

The authors observed that the logit scores for the most part follow a gaussian distribution and noisy/irrelevant tokens would often be in the outlier zone.

We observe that the majority of logits follow a Gaussian distribution in the lower-value region, which corresponds to the low-probability tails that are commonly treated as noise in the probability distribution. This pattern suggests the potential for more meaningful truncation in the logit space.

😎 Top-NSigma

Top-NSigma is a simple sampling algorithm that operates on the logit scores directly, here’s how it works:

Find the max logit score for the given time step of generation.
Compute the standard deviation of all the logit scores.
Filter out tokens with logit scores less than N standard deviations away from the max logit scores.
Apply temperature and softmax to convert logit scores of the unfiltered tokens to probs.
Sample tokens.

Base model:

Qwen/Qwen2.5-0.5B-Instruct

Model compatibility

Most models. More specifically, any transformer LLM/VLM trained for causal language modeling.

Additional Arguments

This implementation of Top-NSigma requires the user to pass in a new argument n_sigma to the generation function.

We'll use this to filter out tokens whose logit scores are n_sigma number of standard deviations below the max logit score.

The authors recommend using n_sigma=1.0 for most use cases, but you can experiment with values in the range (0.0, 2√3].

Output Type changes

(none)

Example usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", device_map="auto")
generation_config = GenerationConfig(temperature=1.5, max_length=128)

messages = [{"role":"user", "content": "Write a story about a dog and cat becoming friends."}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# There is a print message hardcoded in the custom generation method
gen_out = model.generate(**model_inputs, n_sigma=1.0, generation_config=generation_config, custom_generate="Pramodith/topN_sigma_generation", trust_remote_code=True)

print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])

Citation

@inproceedings{tang2025top,
    title={Top-n𝜎: Eliminating Noise in Logit Space for Robust Token Sampling of LLM},
    author={Tang, Chenxia and Liu, Jianchun and Xu, Hongli and Huang, Liusheng},
    booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages={10758--10774},
    year={2025}
}