en-bn-transformer-v1

A baseline English-to-Bengali Neural Machine Translation model built from scratch using a custom Transformer architecture.

Model Description

This is an encoder-decoder Transformer model trained for English → Bengali translation, implemented entirely from scratch in PyTorch (no HuggingFace Transformers dependency).

Architecture

Parameter	Value
d_model	512
Encoder/Decoder layers	6
Attention heads	8
Feed-forward dimension	2048
Dropout	0.1
Max sequence length	256
Source vocab size	30,000
Target vocab size	30,000
Total parameters	~90.5M

Training Details

Dataset: BanglaNMT (filtered for samples with ≥3 words)
Training steps: 500,000
Epochs: 8
Batch size: 8 (gradient accumulation ×16 = effective batch size 128)
Optimizer: Adam (lr=5e-5, eps=1e-7)
Scheduler: WarmupCosine (warmup 2000 steps, cosine decay to 1% of max LR)
Label smoothing: 0.1
Mixed precision: AMP with GradScaler
torch.compile: Enabled

Tokenizer

This model uses WordLevel tokenizers (one per language) with Whitespace pre-tokenization.

Note: This is a baseline tokenizer. A future version will use BPE with ByteLevel pre-tokenization for better handling of Bengali morphology and out-of-vocabulary words.

Known Limitations

WordLevel tokenizer: Cannot handle out-of-vocabulary words or Bengali morphological variations well. This is the biggest limitation.
Moderate translation quality: This is a first training run (val loss ≈ 4.47). Quality is baseline-level.
No beam search: Inference uses greedy decoding with repetition penalty.

How to Use

import torch
from tokenizers import Tokenizer
from model import transformer_work
from config import get_config

config = get_config()

# Load tokenizers
tokenizer_src = Tokenizer.from_file("tokenizeren.json")
tokenizer_tgt = Tokenizer.from_file("tokenizerbn.json")

# Build model
model = transformer_work(
    src_vocab=tokenizer_src.get_vocab_size(),
    tgt_vocab=tokenizer_tgt.get_vocab_size(),
    src_seq_len=config["seq_len"],
    tgt_seq_len=config["seq_len"],
    d_model=config["d_model"],
)

# Load full checkpoint (includes optimizer state)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("model_full_checkpoint.pt", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)
model.eval()

For inference, see evaluate.py in the source repository.

Files

File	Description
`config.json`	Model architecture and training configuration
`model_full_checkpoint.pt`	Full training checkpoint (model + optimizer + scheduler + scaler)
`model_weights.pt`	Model weights only (for inference)
`tokenizeren.json`	English WordLevel tokenizer
`tokenizerbn.json`	Bengali WordLevel tokenizer

Intended Uses

Baseline English → Bengali translation model
Research and experimentation with custom Transformer architectures
Starting point for improved versions with better tokenization

License

MIT

Downloads last month: 19