en-bn-transformer-v1
A baseline English-to-Bengali Neural Machine Translation model built from scratch using a custom Transformer architecture.
Model Description
This is an encoder-decoder Transformer model trained for English โ Bengali translation, implemented entirely from scratch in PyTorch (no HuggingFace Transformers dependency).
Architecture
| Parameter | Value |
|---|---|
| d_model | 512 |
| Encoder/Decoder layers | 6 |
| Attention heads | 8 |
| Feed-forward dimension | 2048 |
| Dropout | 0.1 |
| Max sequence length | 256 |
| Source vocab size | 30,000 |
| Target vocab size | 30,000 |
| Total parameters | ~90.5M |
Training Details
- Dataset: BanglaNMT (filtered for samples with โฅ3 words)
- Training steps: 500,000
- Epochs: 8
- Batch size: 8 (gradient accumulation ร16 = effective batch size 128)
- Optimizer: Adam (lr=5e-5, eps=1e-7)
- Scheduler: WarmupCosine (warmup 2000 steps, cosine decay to 1% of max LR)
- Label smoothing: 0.1
- Mixed precision: AMP with GradScaler
- torch.compile: Enabled
Tokenizer
This model uses WordLevel tokenizers (one per language) with Whitespace pre-tokenization.
Note: This is a baseline tokenizer. A future version will use BPE with ByteLevel pre-tokenization for better handling of Bengali morphology and out-of-vocabulary words.
Known Limitations
- WordLevel tokenizer: Cannot handle out-of-vocabulary words or Bengali morphological variations well. This is the biggest limitation.
- Moderate translation quality: This is a first training run (val loss โ 4.47). Quality is baseline-level.
- No beam search: Inference uses greedy decoding with repetition penalty.
How to Use
import torch
from tokenizers import Tokenizer
from model import transformer_work
from config import get_config
config = get_config()
# Load tokenizers
tokenizer_src = Tokenizer.from_file("tokenizeren.json")
tokenizer_tgt = Tokenizer.from_file("tokenizerbn.json")
# Build model
model = transformer_work(
src_vocab=tokenizer_src.get_vocab_size(),
tgt_vocab=tokenizer_tgt.get_vocab_size(),
src_seq_len=config["seq_len"],
tgt_seq_len=config["seq_len"],
d_model=config["d_model"],
)
# Load full checkpoint (includes optimizer state)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("model_full_checkpoint.pt", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)
model.eval()
For inference, see evaluate.py in the source repository.
Files
| File | Description |
|---|---|
config.json |
Model architecture and training configuration |
model_full_checkpoint.pt |
Full training checkpoint (model + optimizer + scheduler + scaler) |
model_weights.pt |
Model weights only (for inference) |
tokenizeren.json |
English WordLevel tokenizer |
tokenizerbn.json |
Bengali WordLevel tokenizer |
Intended Uses
- Baseline English โ Bengali translation model
- Research and experimentation with custom Transformer architectures
- Starting point for improved versions with better tokenization
License
MIT
- Downloads last month
- 19