ModernBERTić-large
The first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 395M parameters, native 8192-token context, FlashAttention 2.
State of the art on SuperGLUE-SR. Live leaderboard: balkanbench.com/leaderboard.
Looking for the smaller variant? See
permitt/galton-modernbertic-base(149M params).
TL;DR
| Architecture | ModernBERT-large (28 layers, 1024 hidden, 16 heads) |
| Parameters | 395M |
| Context length | 8192 tokens (RoPE base 160K) |
| Attention | Sliding window 256 + global every 2nd layer, FlashAttention 2 |
| Tokenizer | BPE, 50,304 vocab, Latin-only, cased |
| Pretraining tokens | 66B BCMS tokens, 22 sources |
| Compute | 64× A100-64GB on Leonardo HPC, ~10h wall clock |
Why this model exists
The de facto encoder for BCMS has been classla/bcms-bertic since 2021: 110M parameters, 512-token context, ELECTRA. Excellent within its envelope. Insufficient for production tasks that require long-document understanding (CV parsing, legal documents, retrieval over knowledge bases).
ModernBERTić ports the ModernBERT recipe to BCMS:
- 8K native context instead of 512, via RoPE
- FlashAttention 2 + unpadding for ~3.5× faster inference at identical hardware
- Alternating attention (sliding window 256 + full attention every 2nd layer) for O(n) cost on long inputs
- Latin-only BCMS-native tokenizer at 31% lower than mmBERT's multilingual SentencePiece
Results: SuperGLUE Serbian edition
Evaluation from BalkanBench v1.0. 5 random seeds per cell, mean reported om the website; standard deviations in the leaderboard UI.
Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: balkanbench.com/leaderboard.
Quickstart
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "permitt/galton-modernbertic-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable
torch_dtype=torch.bfloat16,
).to("cuda")
text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)
Long context
# 8192 tokens supported natively, no positional interpolation needed
tokenizer.model_max_length = 8192
inputs = tokenizer(very_long_document, return_tensors="pt", truncation=True).to("cuda")
Fine-tuning
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"permitt/galton-modernbertic-large",
num_labels=3,
attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here
Recommended hyperparameters (from our SuperGLUE-SR sweeps):
| Task type | Learning rate | Batch size | Epochs |
|---|---|---|---|
| Sequence classification | 2e-5 to 5e-5 | 16-32 | 3-5 |
| Token classification (NER, POS) | 3e-5 | 32 | 5-10 |
| Long-context tasks (>512 tok) | 1e-5 to 3e-5 | 8-16 | 3-5 |
Tokenizer
| Vocab | Tokens / character | OOV rate | |
|---|---|---|---|
| ModernBERTić | 50,304 | 0.229 | 0.000% |
| BERTić | 32,000 | 0.242 | 0.006% |
| XLM-R-BERTić | 250,002 | 0.274 | 0.008% |
| mmBERT | 256,000 | 0.334 | 0.000% |
Measured on 55.8M characters of held-out BCMS text.
The vocabulary is Latin-only. Cyrillic input should be transliterated upstream; Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).
Pretraining
- Corpus: 66B tokens, 227M documents, assembled from 22 BCMS sources (FineWiki, BERTić-MaCoCu, FineWeb-2, HPLT 3.0, FinePDFs, CLASSLA web, books, news, plus others). Tiered priority, BCMS-specific quality filters (gambling/content-farm/stop-word heuristics), MinHash LSH cross-source deduplication at 0.8 Jaccard threshold.
- Objective: Masked Language Modeling, 30% masking ratio.
- Optimizer: AdamW, peak LR 5e-4, warmup-stable-decay schedule with ~9% decay phase.
- Batch: 4096 sequences global, kept constant across GPU counts (strong scaling).
- Precision: bfloat16.
- Framework: MosaicML Composer + FlexBERT. MDS streaming dataset format with deterministic resume across the 24-hour Leonardo job limit.
Intended uses and limitations
Intended uses. Sequence classification, token classification (NER, POS), masked language modeling, long-document understanding, and as a base model for fine-tuned retrievers and rerankers (see EmbedBERTić and RerankerBERTić, releasing soon).
Out of scope. This is an encoder, not a generative model. It does not produce open-ended text. For text generation in BCMS, see the national LLM initiative announced April 2026 or general-purpose multilingual LLMs.
Limitations.
- Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
- Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
- Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.
Production note
ModernBERTić powers production features at Recrewty, an AI-assisted talent management platform for the Balkans, including long-document CV understanding, psychometric inference, and the candidate retrieve-then-rerank pipeline. The model is the same artifact in production and on this card; nothing is held back.
Citation
@misc{perovic2026modernbertic,
title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
author = {Perovic, Mitar},
year = {2026},
url = {https://huggingface.co/permitt/galton-modernbertic-large},
note = {Recrewty, EU-funded grant}
}
Acknowledgments
This work was developed at Recrewty as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.
Standing on the shoulders of:
- Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
- The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
- MosaicML / Databricks for Composer and the MDS streaming format.
- HuggingFace for the model hub, datasets, and
tokenizerslibrary. - JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
- EuroHPC and the Leonardo consortium for compute access.
See also
permitt/galton-modernbertic-base- 149M parameter variant- BalkanBench leaderboard - live evaluation across BCMS encoders
- Build-in-public series on LinkedIn - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results
- Medium release post - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release)
- All links in one place - You can find linkedin material from this single point
- Downloads last month
- 980
Datasets used to train permitt/galton-modernbertic-large
Space using permitt/galton-modernbertic-large 1
Evaluation results
- Average (6 tasks, 5 seeds) on SuperGLUE-SRself-reported73.440
- BoolQ on SuperGLUE-SRself-reported80.700
- CB on SuperGLUE-SRself-reported78.520
- COPA on SuperGLUE-SRself-reported76.840
- RTE on SuperGLUE-SRself-reported73.130
- MultiRC on SuperGLUE-SRself-reported67.900
- WSC on SuperGLUE-SRself-reported63.560