Predictions contain `\n<ctrl100>`

#2
by jo-mengr - opened

Hi, when running the predictions based on the suggested code (see below), I get the following output:
Predicted Cell Type:
CD4-positive, alpha-beta T cell./n <ctrl100>
Should this special token maybe be removed automatically? I also pasted by env specs below. Note that besides pip install accelerate transformers sentencepieceas mentioned in the model card, I had to install protobuf

# pip install accelerate transformers sentencepiece
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model directly from Hugging Face Hub
model_id = "vandijklab/C2S-Scale-Gemma-2-2B"

# Load tokenizer; requires sentencepiece to be installed
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
).to(device)

# Format prompt (see previous section)
cell_sentence = "MALAT1 TMSB4X B2M EEF1A1 H3F3B ACTB FTL RPL13 ..." # Truncated for example, use at least 200 genes for inference
num_genes = 1000
organism = "Homo sapiens"

prompt = f"""The following is a list of {num_genes} gene names ordered by descending expression level in a {organism} cell. Your task is to give the cell type which this cell belongs to based on its gene expression.
Cell sentence: {cell_sentence}.
The cell type corresponding to these genes is:"""

# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to(device)

# Generate response
outputs = model.generate(**input_ids, max_new_tokens=20)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# The predicted cell type will be the text immediately following the prompt
predicted_cell_type = response.split("The cell type corresponding to these genes is:")[1].strip()
print(f"Predicted Cell Type: {predicted_cell_type}")

Env:

Using Python 3.12.9 environment at: c2s
Package                 Version
----------------------- -----------
accelerate              1.12.0
anyio                   4.11.0
appnope                 0.1.4
asttokens               3.0.1
certifi                 2025.11.12
charset-normalizer      3.4.4
click                   8.3.1
comm                    0.2.3
debugpy                 1.8.17
decorator               5.2.1
executing               2.2.1
filelock                3.20.0
fsspec                  2025.10.0
h11                     0.16.0
hf-xet                  1.2.0
httpcore                1.0.9
httpx                   0.28.1
huggingface-hub         0.36.0
idna                    3.11
ipykernel               7.1.0
ipython                 9.7.0
ipython-pygments-lexers 1.1.1
jedi                    0.19.2
jinja2                  3.1.6
jupyter-client          8.6.3
jupyter-core            5.9.1
markupsafe              3.0.3
matplotlib-inline       0.2.1
mpmath                  1.3.0
nest-asyncio            1.6.0
networkx                3.6
numpy                   2.3.5
packaging               25.0
parso                   0.8.5
pexpect                 4.9.0
platformdirs            4.5.0
prompt-toolkit          3.0.52
protobuf                6.33.1
psutil                  7.1.3
ptyprocess              0.7.0
pure-eval               0.2.3
pygments                2.19.2
python-dateutil         2.9.0.post0
pyyaml                  6.0.3
pyzmq                   27.1.0
regex                   2025.11.3
requests                2.32.5
safetensors             0.7.0
sentencepiece           0.2.1
setuptools              80.9.0
shellingham             1.5.4
six                     1.17.0
sniffio                 1.3.1
stack-data              0.6.3
sympy                   1.14.0
tokenizers              0.22.1
torch                   2.9.1
tornado                 6.5.2
tqdm                    4.67.1
traitlets               5.14.3
transformers            4.57.3
typer-slim              0.20.0
typing-extensions       4.15.0
urllib3                 2.5.0
wcwidth                 0.2.14

Sign up or log in to comment