Fill-Mask
Transformers
Safetensors
Turkish
English
modernbert
masked-lm
long-context
turkish
code
mathematics
TabiBERT / README.md
ebrarkiziloglu's picture
Update README.md
36d24bf verified
---
library_name: transformers
license: apache-2.0
language:
- tr
- en
tags:
- fill-mask
- masked-lm
- long-context
- modernbert
- turkish
- code
- mathematics
pipeline_tag: fill-mask
inference: false
datasets:
- HuggingFaceFW/fineweb-2
- selimfirat/bilkent-turkish-writings-dataset
---
# TabiBERT
[![arXiv](https://img.shields.io/badge/arXiv-2512.23065-b31b1b.svg)](https://arxiv.org/abs/2512.23065)
[![GitHub](https://img.shields.io/badge/GitHub-TabiBERT-181717?logo=github)](https://github.com/boun-tabi-LMG/Tabibert)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Benchmark](https://img.shields.io/badge/Benchmark-TabiBench-purple)](https://huggingface.co/collections/boun-tabilab/tabibench)
<!-- [![Hugging Face](https://img.shields.io/badge/🤗-Model-yellow.svg)](https://huggingface.co/boun-tabilab/TabiBERT) -->
<img src="https://cdn-uploads.huggingface.co/production/uploads/679ce16128d0a203d8dc15ca/pTl7fMa1WLQDxMfohm9eT.png" alt="TabiBERT" width="500"/>
📄 **Paper**: [TabiBERT:A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish](https://arxiv.org/abs/2512.23065)
💻 **Code**: https://github.com/boun-tabi-LMG/TabiBERT
## Table of Contents
1. [Model Summary](#model-summary)
2. [Usage](#usage)
3. [Evaluation](#evaluation)
4. [Limitations](#limitations)
5. [Training](#training)
6. [License](#license)
7. [Citation](#citation)
---
## Model Summary
**TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
TabiBERT is pre-trained on **1 trillion tokens** of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.
TabiBERT inherits ModernBERT’s architectural improvements, such as:
- **Rotary Positional Embeddings (RoPE)** for long-context support.
- **Local-Global Alternating Attention** for efficiency on long inputs.
- **Unpadding and Flash Attention** for efficient inference.
This makes TabiBERT particularly suitable for:
- **Turkish NLP tasks** (classification, QA, retrieval, NLI, etc.).
- **Multilingual text understanding** (Turkish-English).
- **Code retrieval and representation learning.**
- **Mathematical and symbolic reasoning.**
- **Long-context understanding** such as document classification, retrieval, and semantic search.
*TabiBERT is built by [TABILAB](https://tabilab.cmpe.bogazici.edu.tr/) with the support of [VNGRS](https://vngrs.com/).*
---
## Usage
You can use TabiBERT directly with the `transformers` library (v4.48.0+):
```bash
pip install -U transformers>=4.48.0
```
Since TabiBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`.
**⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**
```bash
pip install flash-attn
```
Example usage with `AutoModelForMaskedLM`:
```py
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token: Güneş
```
Example with `pipeline`:
```py
from transformers import pipeline
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir.")[0]['sequence'])
# Ankara, Türkiye Cumhuriyeti'nin başkentidir.
```
## Pre-training Data
TabiBERT has been **pre-trained on 86 billion tokens** of diverse data, primarily:
- A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic
texts.
- **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens.
## Training
* **Architecture**: Encoder-only, Pre-Norm Transformer with GeGLU activations.
* **Sequence Length**: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
* **Data**: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
* **Optimizer**: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
* **Hardware**: Trained on 8x H100 GPUs.
---
## Evaluation
TabiBERT was comprehensively evaluated on **TabiBench**, a benchmark consisting of **28 datasets** spanning **8 task categories**.
The model achieves state-of-the-art performance among Turkish models, with a total average score of **77.58**,
surpassing the previous best Turkish model by **1.62 points**.
### Key Highlights
- **State-of-the-art performance**: TabiBERT outperforms all monolingual Turkish baselines across the evaluation suite
- **Largest improvement in QA**: Achieves an F1 score of **69.71**, outperforming the next best Turkish model by **9.55 points** (16% relative improvement)
- **Leading performance in 5 out of 8 task categories**: Including code retrieval and information retrieval
- **Strong long-context capabilities**: Native support for up to 8,192 tokens, providing advantages on longer sequences
### Benchmark: TabiBench
TabiBench is a comprehensive benchmark specifically designed for Turkish NLP, consisting of 28 datasets across 8 task types.
The benchmark includes both existing Turkish NLP datasets and newly created/translated datasets for code retrieval and academic domain tasks.
**Benchmark Collection**: [TabiBench on HuggingFace](https://huggingface.co/collections/boun-tabilab/tabibench)
### Overall Evaluation Results
**Comparison of downstream task performance across all evaluated models.**
For each column, the highest score among the models is shown in **bold**.
The evaluation metric used for each task type is also displayed in the column headers.
| Model | # of params<br/>(M) | Text Clf<br/>(F1) | Token Clf<br/>(F1) | STS<br/>(Pearson) | NLI<br/>(F1) | QA<br/>(F1) | Academic<br/>(F1) | Retrieval<br/>(NDCG@10) | Code Retrieval<br/>(NDCG@10) | Total Avg<br/>(tabibench) |
|-------|---------------------|-------------------|-------------------|-------------------|--------------|-------------|-------------------|-------------------------|----------------------------|---------------------------|
| TurkishBERTweet | 163 | 79.71 | 92.02 | 75.86 | 79.10 | 38.13 | 63.12 | 68.40 | 43.49 | 67.48 |
| YTU-BERT | 111 | **84.25** | 93.60 | 84.68 | 84.16 | 31.50 | 71.78 | 74.29 | 53.80 | 72.26 |
| BERTurk | 110 | 83.42 | **93.67** | **85.33** | 84.33 | 60.16 | 71.40 | 74.84 | 54.54 | 75.96 |
| **TabiBERT** | 149 | 83.44 | 93.42 | 84.74 | **84.51** | **69.71** | **72.44** | **75.44** | **56.95** | **77.58** |
### Evaluation Methodology
Systematic hyperparameter tuning was performed for all model-task pairs with the following search space:
| Parameter | Values |
|-----------|--------|
| Learning Rate | 5e-6, 1e-5, 2e-5, 3e-5 |
| Weight Decay | 1e-5, 1e-6 |
| Batch Size | 16, 32 |
| Epochs | Up to 10, with early stopping |
For each task category, a single score is reported by computing a **weighted average** across all datasets, where each dataset's weight is proportional to its test set size.
This ensures that larger, more representative datasets have corresponding influence on overall results (test set sizes range from 150 to 35,000 examples).
### Comparison with Baselines
TabiBERT was compared against three established Turkish BERT models:
- **BERTurk**: Widely used Turkish monolingual encoder, pre-trained on Web and special corpora
- **YTU-BERT**: Uncased Turkish BERT, pre-trained on large Turkish corpus (Web, Wikipedia, books)
- **TurkishBERTweet**: Uncased Turkish BERT, pre-trained at large scale for social media
**Result**: TabiBERT outperforms all monolingual Turkish models with a total average score of **77.58**, surpassing BERTurk (previous best) by **1.62 points**.
### Reproducibility
All evaluation datasets are publicly available on HuggingFace, under the [TabiBench](https://huggingface.co/collections/boun-tabilab/tabibench) collection to facilitate future research and comparisons.
---
## Limitations
* TabiBERT was trained mainly on **Turkish**, with additional **English, code, and math**. **Its** performance on English may be **limited** relative to Turkish, and it may underperform on other languages.
* As with any large-scale model, it may inherit **biases** from training data.
* While capable of handling up to **8k tokens**, inference on very long sequences may be slower.
* Still under evaluation — recommended to validate results before deployment in critical applications.
---
## License
TabiBERT model weights and training codebase are released under the **Apache 2.0** license.
## Citation
If you use TabiBERT in your project, please cite:
```
@misc{Türker2025Tabibert,
title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish},
author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı},
year={2025},
eprint={2512.23065},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.23065},
}
```