en-bn-transformer-v1

A baseline English-to-Bengali Neural Machine Translation model built from scratch using a custom Transformer architecture.

Model Description

This is an encoder-decoder Transformer model trained for English โ†’ Bengali translation, implemented entirely from scratch in PyTorch (no HuggingFace Transformers dependency).

Architecture

Parameter Value
d_model 512
Encoder/Decoder layers 6
Attention heads 8
Feed-forward dimension 2048
Dropout 0.1
Max sequence length 256
Source vocab size 30,000
Target vocab size 30,000
Total parameters ~90.5M

Training Details

  • Dataset: BanglaNMT (filtered for samples with โ‰ฅ3 words)
  • Training steps: 500,000
  • Epochs: 8
  • Batch size: 8 (gradient accumulation ร—16 = effective batch size 128)
  • Optimizer: Adam (lr=5e-5, eps=1e-7)
  • Scheduler: WarmupCosine (warmup 2000 steps, cosine decay to 1% of max LR)
  • Label smoothing: 0.1
  • Mixed precision: AMP with GradScaler
  • torch.compile: Enabled

Tokenizer

This model uses WordLevel tokenizers (one per language) with Whitespace pre-tokenization.

Note: This is a baseline tokenizer. A future version will use BPE with ByteLevel pre-tokenization for better handling of Bengali morphology and out-of-vocabulary words.

Known Limitations

  1. WordLevel tokenizer: Cannot handle out-of-vocabulary words or Bengali morphological variations well. This is the biggest limitation.
  2. Moderate translation quality: This is a first training run (val loss โ‰ˆ 4.47). Quality is baseline-level.
  3. No beam search: Inference uses greedy decoding with repetition penalty.

How to Use

import torch
from tokenizers import Tokenizer
from model import transformer_work
from config import get_config

config = get_config()

# Load tokenizers
tokenizer_src = Tokenizer.from_file("tokenizeren.json")
tokenizer_tgt = Tokenizer.from_file("tokenizerbn.json")

# Build model
model = transformer_work(
    src_vocab=tokenizer_src.get_vocab_size(),
    tgt_vocab=tokenizer_tgt.get_vocab_size(),
    src_seq_len=config["seq_len"],
    tgt_seq_len=config["seq_len"],
    d_model=config["d_model"],
)

# Load full checkpoint (includes optimizer state)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("model_full_checkpoint.pt", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)
model.eval()

For inference, see evaluate.py in the source repository.

Files

File Description
config.json Model architecture and training configuration
model_full_checkpoint.pt Full training checkpoint (model + optimizer + scheduler + scaler)
model_weights.pt Model weights only (for inference)
tokenizeren.json English WordLevel tokenizer
tokenizerbn.json Bengali WordLevel tokenizer

Intended Uses

  • Baseline English โ†’ Bengali translation model
  • Research and experimentation with custom Transformer architectures
  • Starting point for improved versions with better tokenization

License

MIT

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support