Predictions contain `\n<ctrl100>`
#2
by
jo-mengr
- opened
Hi, when running the predictions based on the suggested code (see below), I get the following output:
Predicted Cell Type:
CD4-positive, alpha-beta T cell./n <ctrl100>
Should this special token maybe be removed automatically? I also pasted by env specs below. Note that besides pip install accelerate transformers sentencepieceas mentioned in the model card, I had to install protobuf
# pip install accelerate transformers sentencepiece
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model directly from Hugging Face Hub
model_id = "vandijklab/C2S-Scale-Gemma-2-2B"
# Load tokenizer; requires sentencepiece to be installed
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
).to(device)
# Format prompt (see previous section)
cell_sentence = "MALAT1 TMSB4X B2M EEF1A1 H3F3B ACTB FTL RPL13 ..." # Truncated for example, use at least 200 genes for inference
num_genes = 1000
organism = "Homo sapiens"
prompt = f"""The following is a list of {num_genes} gene names ordered by descending expression level in a {organism} cell. Your task is to give the cell type which this cell belongs to based on its gene expression.
Cell sentence: {cell_sentence}.
The cell type corresponding to these genes is:"""
# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to(device)
# Generate response
outputs = model.generate(**input_ids, max_new_tokens=20)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# The predicted cell type will be the text immediately following the prompt
predicted_cell_type = response.split("The cell type corresponding to these genes is:")[1].strip()
print(f"Predicted Cell Type: {predicted_cell_type}")
Env:
Using Python 3.12.9 environment at: c2s
Package Version
----------------------- -----------
accelerate 1.12.0
anyio 4.11.0
appnope 0.1.4
asttokens 3.0.1
certifi 2025.11.12
charset-normalizer 3.4.4
click 8.3.1
comm 0.2.3
debugpy 1.8.17
decorator 5.2.1
executing 2.2.1
filelock 3.20.0
fsspec 2025.10.0
h11 0.16.0
hf-xet 1.2.0
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.36.0
idna 3.11
ipykernel 7.1.0
ipython 9.7.0
ipython-pygments-lexers 1.1.1
jedi 0.19.2
jinja2 3.1.6
jupyter-client 8.6.3
jupyter-core 5.9.1
markupsafe 3.0.3
matplotlib-inline 0.2.1
mpmath 1.3.0
nest-asyncio 1.6.0
networkx 3.6
numpy 2.3.5
packaging 25.0
parso 0.8.5
pexpect 4.9.0
platformdirs 4.5.0
prompt-toolkit 3.0.52
protobuf 6.33.1
psutil 7.1.3
ptyprocess 0.7.0
pure-eval 0.2.3
pygments 2.19.2
python-dateutil 2.9.0.post0
pyyaml 6.0.3
pyzmq 27.1.0
regex 2025.11.3
requests 2.32.5
safetensors 0.7.0
sentencepiece 0.2.1
setuptools 80.9.0
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
stack-data 0.6.3
sympy 1.14.0
tokenizers 0.22.1
torch 2.9.1
tornado 6.5.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.57.3
typer-slim 0.20.0
typing-extensions 4.15.0
urllib3 2.5.0
wcwidth 0.2.14