DistilBERT — Sentiment Classification

Fine-tuned DistilBERT for Amazon product review sentiment analysis. Classifies reviews into Negative (1–2★), Neutral (3★), and Positive (4–5★).

Architecture: distilbert-base-uncased fine-tuned with a classification head (3 classes)
Parameters: 66M (6 transformer layers, 768 hidden dim, 12 attention heads)
Tokenizer: BERT WordPiece uncased (vocab_size=30,522)
Dataset: Amazon Reviews 2023 — 149,761 train / 32,092 test (stratified 70/15/15 split)
Weights: model.safetensors (SafeTensors format, 255 MB)

How this model was trained

This is a supervised fine-tuning of DistilBERT for 3-class sentiment classification:

                   149,761 reviews (train)
                           │
                           ▼
              DistilBERT Tokenizer
              (max_length=256, truncation)
                           │
                           ▼
              distilbert-base-uncased
              (66M params, pretrained on BooksCorpus + English Wikipedia)
                           │
                           ▼
              Sequence Classification Head
              (dropout=0.3, 3 output logits → Negative / Neutral / Positive)
                           │
                           ▼
              Training: 5 epochs
              lr=2e-5, batch_size=32
              weight_decay=0.01, warmup_steps=2,340
              optimizer: AdamW
                           │
            ┌──────────────┴──────────────┐
            ▼                             ▼
    ┌───────────────┐          ┌─────────────────────┐
    │ model.safetensors│          │  Evaluation on       │
    │ (255 MB)         │          │  32,092 test reviews │
    │                  │          │  → metrics_distilbert│
    │  SERVED WEIGHTS  │          │     .json            │
    └───────────────┘          │  → predictions_distil │
                                   │     bert.csv          │
                                   │  PRODUCTION ARTIFACTS │
                                   └─────────────────────┘

Training stability

The loss converged smoothly over 5 epochs. Training started at loss 1.097 (step 50) and steadily decreased to ~0.50 by epoch 5. Evaluation metrics improved consistently without overfitting:

Epoch	Eval Loss	Accuracy	F1
1	0.601	75.04%	74.66%
2	0.551	76.74%	76.68%
3	0.546	77.10%	77.05%
4	0.551	77.32%	77.14%
5	0.550	77.51%	77.48%

The eval loss plateauing after epoch 3 with F1 still inching up suggests the model reached a stable optimum at epoch 5.

Full training history with per-step loss tracking: metrics_distilbert.json.

How to use `model.safetensors`

from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
import torch

# Load from local files
model_path = "data/models/distilbert"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizer.from_pretrained(model_path)

model.eval()

To classify a new review:

def classify(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256, padding=True)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=-1).squeeze()
        pred = torch.argmax(probs).item()

    labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
    return {
        "label": labels[pred],
        "confidence": round(probs[pred].item(), 4),
        "probabilities": {
            "Negative": round(probs[0].item(), 4),
            "Neutral": round(probs[1].item(), 4),
            "Positive": round(probs[2].item(), 4),
        }
    }

The model expects raw review text. The tokenizer handles lowercasing (uncased), padding, and truncation to 256 tokens automatically. No HTML cleaning or special preprocessing is needed — DistilBERT was trained on raw text and handles artifacts gracefully.

What the predictions mean

The model assigns each review to one of three sentiment classes. The test set is perfectly balanced by design (stratified split):

Class	Test samples	Predicted	Precision	Recall	F1
Negative (1–2★)	10,697	11,073	75.42%	78.07%	76.72%
Neutral (3★)	10,697	10,261	68.52%	65.73%	67.10%
Positive (4–5★)	10,698	10,758	87.49%	87.98%	87.73%

Confusion matrix

Where the model gets it right — and where it doesn't:

	Predicted Negative	Predicted Neutral	Predicted Positive
True Negative	8,351 (78.1%)	2,140 (20.0%)	206 (1.9%)
True Neutral	2,526 (23.6%)	7,031 (65.7%)	1,140 (10.7%)
True Positive	196 (1.8%)	1,090 (10.2%)	9,412 (88.0%)

Key observations

Positive → Positive is the safest path (88.0% correct). When the model says "Positive", you can trust it: precision is 87.5%. Only 1.8% of truly positive reviews are mistaken for Negative.
Neutral is the leaky class (65.7% correct). True Neutral reviews spill 23.6% into Negative but only 10.7% into Positive. The model has a negative bias on uncertainty — when it can't tell if a review is Neutral, it defaults to Negative rather than Positive. This makes sense: lukewarm reviews ("it's okay but the handle broke") share more vocabulary with complaints than with praise.
Cross-polarity errors are rare: only 1.9% of Negative reviews are called Positive, and only 1.8% of Positive reviews are called Negative. The model rarely makes the catastrophic mistake of flipping sentiment polarity.
Negative detection is solid (78.1% correct). The 20.0% that spill into Neutral are typically mildly negative reviews ("not great", "could be better", "disappointed but usable") that the model hedges on.
Overall accuracy: 77.26% — competitive for a 3-class sentiment task on e-commerce reviews. The class-balanced test set means accuracy is not inflated by majority-class bias.

Model strengths & limitations

Strength	Limitation
Excellent at polarity detection — rarely confuses Negative with Positive (only ~2% cross-polarity errors)	Struggles with Neutral reviews — 34% of true Neutrals are misclassified, mostly as Negative
Positive reviews: 88% correct, 87.5% precision. Can reliably filter 4–5★ reviews for recommendation systems	Negative bias on uncertainty — the model over-predicts Negative by 376 reviews. A Neutral review is more likely to be called Negative than Positive
Fast inference — 66M params, ~20ms per review on CPU. No GPU required for production use at moderate scale	Sarcasm and irony are not handled — the model reads words at face value ("Great, another broken product" → Positive)
No preprocessing needed — raw review text goes straight into the tokenizer. Handles HTML artifacts, emojis, and typos gracefully	Non-English reviews — the tokenizer is English-only (uncased). Reviews in other languages will produce garbage predictions
Confidence scores — outputs calibrated probabilities, not just labels. You can set custom confidence thresholds	Domain-specific — trained on Amazon product reviews. Performance on restaurant reviews, movie reviews, or tweets will be lower

Per-review predictions with confidence scores: predictions_distilbert.csv.

Metrics

From metrics_distilbert.json:

Metric	Value
Accuracy	77.26%
Weighted Precision	77.14%
Weighted Recall	77.26%
Weighted F1	77.18%
Negative F1	76.72%
Neutral F1	67.10%
Positive F1	87.73%

F1 interpretation: The weighted F1 of 77.18% is competitive for a 3-class sentiment task on e-commerce reviews. The 20-point gap between Positive (87.7%) and Neutral (67.1%) reflects a well-known challenge in sentiment analysis: extreme sentiments (very positive, very negative) are linguistically easier to separate than moderate/mixed ones. The Neutral class often absorbs sarcasm, factual-but-unemotional reviews, and genuinely mixed opinions — all hard to classify.

Training hyperparameters: learning_rate=2e-5, batch_size=32, epochs=5, max_length=256, weight_decay=0.01, warmup_steps=2,340, train_samples=149,761.

Usage Examples

Full pipeline: raw text → DistilBERT tokenizer → model inference → sentiment label + confidence.

Example 1 — Enthusiastic positive review → Positive (high confidence)

from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
import torch

model_path = "data/models/distilbert"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizer.from_pretrained(model_path)

review = "This blender is absolutely amazing! Smoothies every morning and it's so quiet. Best purchase ever."
inputs = tokenizer(review, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
    pred = torch.argmax(probs).item()

labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
print(f"Label: {labels[pred]} (confidence: {probs[pred]:.2%})")

Expected output: Label: Positive (confidence: ~98%).
Keywords like "amazing", "best", "love" and exclamation marks strongly trigger the Positive class.

Example 2 — Genuinely mixed 3-star review → Neutral

review = "The fabric is nice and the color is beautiful, but the sizing runs small and the zipper feels cheap."
inputs = tokenizer(review, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
    pred = torch.argmax(probs).item()

labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
print(f"Label: {labels[pred]} (confidence: {probs[pred]:.2%})")

Expected output: Label: Neutral (confidence: ~65%).
The "but" structure and mixture of praise ("nice", "beautiful") and criticism ("cheap", "runs small") are classic Neutral signals. Confidence is moderate because the model finds linguistic overlap with both Negative and Positive.

Example 3 — Frustrated negative review → Negative (high confidence)

review = "Stopped working after 3 days. Complete waste of money. Don't buy this garbage."
inputs = tokenizer(review, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
    pred = torch.argmax(probs).item()

labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
print(f"Label: {labels[pred]} (confidence: {probs[pred]:.2%})")

Expected output: Label: Negative (confidence: ~95%).
"Stopped working", "waste of money", "don't buy", "garbage" — classic complaint vocabulary. DistilBERT catches these patterns reliably.

Batch inference (N reviews at once)

reviews = [
    "Great product, highly recommend!",
    "It's okay. Nothing special but does the job.",
    "Terrible quality. Fell apart in a week."
]
inputs = tokenizer(reviews, return_tensors="pt", truncation=True, max_length=256, padding=True)
with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1)
    preds = torch.argmax(probs, dim=-1)

labels = ["Negative", "Neutral", "Positive"]
for i, review in enumerate(reviews):
    print(f"Review: {review[:60]}... → {labels[preds[i]]} ({probs[i][preds[i]]:.2%})")

Serving & Integration

All examples assume the model files live at data/models/distilbert/.

1. Python CLI script (zero dependencies beyond transformers)

Save as classify_sentiment.py and run: python classify_sentiment.py "Your review text here"

"""classify_sentiment.py — classify a review from the command line."""
import sys, json
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

MODEL_DIR = "data/models/distilbert"
LABELS = {0: "Negative", 1: "Neutral", 2: "Positive"}

# Load once at module level
model = DistilBertForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_DIR)
model.eval()

def classify(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
        pred = torch.argmax(probs).item()
    return {
        "text": text,
        "label": LABELS[pred],
        "confidence": round(probs[pred].item(), 4),
        "probabilities": {LABELS[i]: round(probs[i].item(), 4) for i in range(3)},
    }

if __name__ == "__main__":
    text = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Review: ")
    result = classify(text)
    print(json.dumps(result, indent=2))

Sample output:

{
  "text": "This keyboard is fantastic, the mechanical switches feel incredible.",
  "label": "Positive",
  "confidence": 0.9847,
  "probabilities": {
    "Negative": 0.0021,
    "Neutral": 0.0132,
    "Positive": 0.9847
  }
}

2. FastAPI microservice (REST JSON endpoint)

"""api.py — lightweight REST API. Run: uvicorn api:app --port 8000"""
from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

MODEL_DIR = "data/models/distilbert"
LABELS = {0: "Negative", 1: "Neutral", 2: "Positive"}

model = None
tokenizer = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global model, tokenizer
    model = DistilBertForSequenceClassification.from_pretrained(MODEL_DIR)
    tokenizer = DistilBertTokenizer.from_pretrained(MODEL_DIR)
    model.eval()
    yield

app = FastAPI(lifespan=lifespan)

class ReviewInput(BaseModel):
    text: str

class SentimentResult(BaseModel):
    label: str
    confidence: float
    probabilities: dict

@app.post("/classify", response_model=SentimentResult)
def classify(review: ReviewInput):
    inputs = tokenizer(review.text, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
        pred = torch.argmax(probs).item()
    return SentimentResult(
        label=LABELS[pred],
        confidence=round(probs[pred].item(), 4),
        probabilities={LABELS[i]: round(probs[i].item(), 4) for i in range(3)},
    )

Call it:

curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text": "This book was a page-turner from start to finish."}'

{"label":"Positive","confidence":0.9756,"probabilities":{"Negative":0.0042,"Neutral":0.0202,"Positive":0.9756}}

3. HTML form + vanilla JavaScript (browser)

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Sentiment Classifier</title>
  <style>
    body { font-family: system-ui; max-width: 600px; margin: 3rem auto; padding: 0 1rem; }
    textarea { width: 100%; height: 100px; margin-bottom: 0.5rem; }
    pre { background: #f5f5f5; padding: 1rem; border-radius: 6px; white-space: pre-wrap; }
    .Positive { border-left: 4px solid #0891B2; }
    .Negative { border-left: 4px solid #DC2626; }
    .Neutral { border-left: 4px solid #EA580C; }
  </style>
</head>
<body>
  <h2>What sentiment is this review?</h2>
  <textarea id="review" placeholder="Paste a product review..."></textarea>
  <button onclick="classify()">Classify</button>
  <pre id="result"></pre>

  <script>
    async function classify() {
      const text = document.getElementById("review").value;
      const res = await fetch("http://localhost:8000/classify", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text }),
      });
      const data = await res.json();
      document.getElementById("result").textContent = JSON.stringify(data, null, 2);
      document.getElementById("result").className = data.label;
    }
  </script>
</body>
</html>

4. Google Colab interactive widget

# Run in a Colab cell — instant text box + classify button
import ipywidgets as widgets
from IPython.display import display, JSON
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

MODEL_DIR = "data/models/distilbert"
LABELS = {0: "Negative", 1: "Neutral", 2: "Positive"}

model = DistilBertForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_DIR)
model.eval()

text_input = widgets.Textarea(placeholder="Paste a review...", layout={"width": "100%", "height": "80px"})
button = widgets.Button(description="Classify", button_style="primary")
output = widgets.Output()

def on_click(_):
    with output:
        output.clear_output()
        inputs = tokenizer(text_input.value, return_tensors="pt", truncation=True, max_length=256)
        with torch.no_grad():
            probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
            pred = torch.argmax(probs).item()
        display(JSON({
            "label": LABELS[pred],
            "confidence": round(probs[pred].item(), 4),
            "probabilities": {LABELS[i]: round(probs[i].item(), 4) for i in range(3)}
        }))

button.on_click(on_click)
display(text_input, button, output)

5. Streamlit dashboard

"""Save as streamlit_app.py — run: streamlit run streamlit_app.py"""
import streamlit as st
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

MODEL_DIR = "data/models/distilbert"
LABELS = {0: "Negative", 1: "Neutral", 2: "Positive"}
COLORS = {"Negative": "#DC2626", "Neutral": "#EA580C", "Positive": "#0891B2"}

@st.cache_resource
def load_model():
    model = DistilBertForSequenceClassification.from_pretrained(MODEL_DIR)
    tokenizer = DistilBertTokenizer.from_pretrained(MODEL_DIR)
    model.eval()
    return model, tokenizer

model, tokenizer = load_model()

st.title("Review Sentiment Classifier")
review = st.text_area("Paste a product review:", height=100)

if st.button("Classify"):
    inputs = tokenizer(review, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze()
        pred = torch.argmax(probs).item()

    st.markdown(f"## Sentiment: :{COLORS[LABELS[pred]]}[{LABELS[pred]}]")
    st.metric("Confidence", f"{probs[pred]:.1%}")

    col1, col2, col3 = st.columns(3)
    with col1:
        st.metric("Negative", f"{probs[0]:.1%}")
    with col2:
        st.metric("Neutral", f"{probs[1]:.1%}")
    with col3:
        st.metric("Positive", f"{probs[2]:.1%}")

    for i, label in enumerate(["Negative", "Neutral", "Positive"]):
        st.progress(float(probs[i]), text=f"{label}: {probs[i]:.1%}")

6. Batch CSV processor (process thousands of reviews at once)

"""batch_classify.py — reads a CSV with a 'review_text' column, writes results."""
import pandas as pd
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from torch.utils.data import DataLoader, Dataset

MODEL_DIR = "data/models/distilbert"
LABELS = {0: "Negative", 1: "Neutral", 2: "Positive"}
BATCH_SIZE = 32

model = DistilBertForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_DIR)
model.eval()

class ReviewDataset(Dataset):
    def __init__(self, texts):
        self.texts = texts
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        return tokenizer(self.texts[idx], truncation=True, max_length=256, padding="max_length", return_tensors="pt")

df = pd.read_csv("input_reviews.csv")
dataset = ReviewDataset(df["review_text"].tolist())
loader = DataLoader(dataset, batch_size=BATCH_SIZE)

all_preds = []
with torch.no_grad():
    for batch in loader:
        batch = {k: v.squeeze(1) for k, v in batch.items()}
        logits = model(**batch).logits
        preds = torch.argmax(logits, dim=-1)
        all_preds.extend(preds.tolist())

df["sentiment"] = [LABELS[p] for p in all_preds]
df.to_csv("classified_reviews.csv", index=False)
print(f"Done — {len(df)} reviews classified.")

Files in this folder

File	Description
`model.safetensors`	DistilBERT fine-tuned weights (SafeTensors, 255 MB, 66M params)
`config.json`	Model architecture config: DistilBertForSequenceClassification, label mappings, dropout values
`tokenizer.json`	BERT WordPiece tokenizer vocabulary (30,522 tokens, uncased)
`tokenizer_config.json`	Tokenizer configuration: max_length, special tokens, truncation side
`metrics_distilbert.json`	Full training history: per-step loss, per-epoch eval metrics, hyperparameters
`predictions_distilbert.csv`	Per-review predictions on 32,092 test samples: text, true_label, predicted_label, confidence
`confusion_matrix.png`	Heatmap visualization of the confusion matrix
`README.md`	This file

Downloads last month: 144

Safetensors

Model size

67M params

Tensor type

F32

SebasLopez-ai
/

distilbert-amazon-reviews-sentiment

DistilBERT — Sentiment Classification

How this model was trained

Training stability

How to use `model.safetensors`

What the predictions mean

Confusion matrix

Key observations

Model strengths & limitations

Metrics

Usage Examples

Example 1 — Enthusiastic positive review → Positive (high confidence)

Example 2 — Genuinely mixed 3-star review → Neutral

Example 3 — Frustrated negative review → Negative (high confidence)

Batch inference (N reviews at once)

Serving & Integration

1. Python CLI script (zero dependencies beyond transformers)

2. FastAPI microservice (REST JSON endpoint)

3. HTML form + vanilla JavaScript (browser)

4. Google Colab interactive widget

5. Streamlit dashboard

6. Batch CSV processor (process thousands of reviews at once)

Files in this folder

Dataset used to train SebasLopez-ai/distilbert-amazon-reviews-sentiment

Space using SebasLopez-ai/distilbert-amazon-reviews-sentiment 1

DistilBERT — Sentiment Classification

How this model was trained

Training stability

How to use model.safetensors

What the predictions mean

Confusion matrix

Key observations

Model strengths & limitations

Metrics

Usage Examples

Example 1 — Enthusiastic positive review → Positive (high confidence)

Example 2 — Genuinely mixed 3-star review → Neutral

Example 3 — Frustrated negative review → Negative (high confidence)

Batch inference (N reviews at once)

Serving & Integration

1. Python CLI script (zero dependencies beyond transformers)

2. FastAPI microservice (REST JSON endpoint)

3. HTML form + vanilla JavaScript (browser)

4. Google Colab interactive widget

5. Streamlit dashboard

6. Batch CSV processor (process thousands of reviews at once)

Files in this folder

Dataset used to train SebasLopez-ai/distilbert-amazon-reviews-sentiment

Space using SebasLopez-ai/distilbert-amazon-reviews-sentiment 1

How to use `model.safetensors`