Instructions to use Pramodith/topN_sigma_generation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Pramodith/topN_sigma_generation with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Pramodith/topN_sigma_generation") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Pramodith/topN_sigma_generation") model = AutoModelForCausalLM.from_pretrained("Pramodith/topN_sigma_generation") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Pramodith/topN_sigma_generation with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Pramodith/topN_sigma_generation" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pramodith/topN_sigma_generation", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Pramodith/topN_sigma_generation
- SGLang
How to use Pramodith/topN_sigma_generation with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Pramodith/topN_sigma_generation" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pramodith/topN_sigma_generation", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Pramodith/topN_sigma_generation" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pramodith/topN_sigma_generation", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Pramodith/topN_sigma_generation with Docker Model Runner:
docker model run hf.co/Pramodith/topN_sigma_generation
library_name: transformers
tags:
- custom_generate
Overview
This generation sampling method is based on the paper Top-N Sigma: A Simple and Effective Sampling Method for Language Models.
Most output token sampling techniques operate on the probability scores post temperature being applied. The softmax function distorts the underlying logit scores distribution making it hard to know a meaningful top-p/top-k value to set.
This can lead to invalid tokens being in the chosen set of tokens after applying the top/min p/k threshold.
The authors observed that the logit scores for the most part follow a gaussian distribution and noisy/irrelevant tokens would often be in the outlier zone.
We observe that the majority of logits follow a Gaussian distribution in the lower-value region, which corresponds to the low-probability tails that are commonly treated as noise in the probability distribution. This pattern suggests the potential for more meaningful truncation in the logit space.
😎 Top-NSigma
Top-NSigma is a simple sampling algorithm that operates on the logit scores directly, here’s how it works:
- Find the max logit score for the given time step of generation.
- Compute the standard deviation of all the logit scores.
- Filter out tokens with logit scores less than N standard deviations away from the max logit scores.
- Apply temperature and softmax to convert logit scores of the unfiltered tokens to probs.
- Sample tokens.
Base model:
Qwen/Qwen2.5-0.5B-Instruct
Model compatibility
Most models. More specifically, any transformer LLM/VLM trained for causal language modeling.
Additional Arguments
This implementation of Top-NSigma requires the user to pass in a new argument n_sigma to the generation function.
We'll use this to filter out tokens whose logit scores are n_sigma number of standard deviations below the max logit score.
The authors recommend using n_sigma=1.0 for most use cases, but you can experiment with values in the range (0.0, 2√3].
Output Type changes
(none)
Example usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", device_map="auto")
generation_config = GenerationConfig(temperature=1.5, max_length=128)
messages = [{"role":"user", "content": "Write a story about a dog and cat becoming friends."}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# There is a print message hardcoded in the custom generation method
gen_out = model.generate(**model_inputs, n_sigma=1.0, generation_config=generation_config, custom_generate="Pramodith/topN_sigma_generation", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
Citation
@inproceedings{tang2025top,
title={Top-n𝜎: Eliminating Noise in Logit Space for Robust Token Sampling of LLM},
author={Tang, Chenxia and Liu, Jianchun and Xu, Hongli and Huang, Liusheng},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={10758--10774},
year={2025}
}