|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- tr |
|
|
- en |
|
|
tags: |
|
|
- fill-mask |
|
|
- masked-lm |
|
|
- long-context |
|
|
- modernbert |
|
|
- turkish |
|
|
- code |
|
|
- mathematics |
|
|
pipeline_tag: fill-mask |
|
|
inference: false |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-2 |
|
|
- selimfirat/bilkent-turkish-writings-dataset |
|
|
--- |
|
|
|
|
|
# TabiBERT |
|
|
|
|
|
[](https://arxiv.org/abs/2512.23065) |
|
|
[](https://github.com/boun-tabi-LMG/Tabibert) |
|
|
[](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
[](https://huggingface.co/collections/boun-tabilab/tabibench) |
|
|
<!-- [](https://huggingface.co/boun-tabilab/TabiBERT) --> |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/679ce16128d0a203d8dc15ca/pTl7fMa1WLQDxMfohm9eT.png" alt="TabiBERT" width="500"/> |
|
|
|
|
|
📄 **Paper**: [TabiBERT:A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish](https://arxiv.org/abs/2512.23065) |
|
|
|
|
|
💻 **Code**: https://github.com/boun-tabi-LMG/TabiBERT |
|
|
|
|
|
## Table of Contents |
|
|
1. [Model Summary](#model-summary) |
|
|
2. [Usage](#usage) |
|
|
3. [Evaluation](#evaluation) |
|
|
4. [Limitations](#limitations) |
|
|
5. [Training](#training) |
|
|
6. [License](#license) |
|
|
7. [Citation](#citation) |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
**TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture. |
|
|
TabiBERT is pre-trained on **1 trillion tokens** of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens. |
|
|
|
|
|
TabiBERT inherits ModernBERT’s architectural improvements, such as: |
|
|
|
|
|
- **Rotary Positional Embeddings (RoPE)** for long-context support. |
|
|
- **Local-Global Alternating Attention** for efficiency on long inputs. |
|
|
- **Unpadding and Flash Attention** for efficient inference. |
|
|
|
|
|
This makes TabiBERT particularly suitable for: |
|
|
|
|
|
- **Turkish NLP tasks** (classification, QA, retrieval, NLI, etc.). |
|
|
- **Multilingual text understanding** (Turkish-English). |
|
|
- **Code retrieval and representation learning.** |
|
|
- **Mathematical and symbolic reasoning.** |
|
|
- **Long-context understanding** such as document classification, retrieval, and semantic search. |
|
|
|
|
|
*TabiBERT is built by [TABILAB](https://tabilab.cmpe.bogazici.edu.tr/) with the support of [VNGRS](https://vngrs.com/).* |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use TabiBERT directly with the `transformers` library (v4.48.0+): |
|
|
|
|
|
```bash |
|
|
pip install -U transformers>=4.48.0 |
|
|
``` |
|
|
|
|
|
Since TabiBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`. |
|
|
|
|
|
**⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:** |
|
|
|
|
|
```bash |
|
|
pip install flash-attn |
|
|
``` |
|
|
|
|
|
Example usage with `AutoModelForMaskedLM`: |
|
|
```py |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
import torch |
|
|
|
|
|
model_id = "boun-tabilab/TabiBERT" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_id) |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = model.to(device) |
|
|
|
|
|
text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir." |
|
|
inputs = tokenizer(text, return_tensors="pt").to(device) |
|
|
outputs = model(**inputs) |
|
|
|
|
|
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) |
|
|
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1) |
|
|
print("Predicted token:", tokenizer.decode(predicted_id)) |
|
|
# Predicted token: Güneş |
|
|
``` |
|
|
|
|
|
Example with `pipeline`: |
|
|
```py |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT") |
|
|
|
|
|
print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir.")[0]['sequence']) |
|
|
# Ankara, Türkiye Cumhuriyeti'nin başkentidir. |
|
|
``` |
|
|
|
|
|
## Pre-training Data |
|
|
|
|
|
TabiBERT has been **pre-trained on 86 billion tokens** of diverse data, primarily: |
|
|
|
|
|
- A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic |
|
|
texts. |
|
|
- **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens. |
|
|
|
|
|
|
|
|
## Training |
|
|
|
|
|
* **Architecture**: Encoder-only, Pre-Norm Transformer with GeGLU activations. |
|
|
* **Sequence Length**: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens. |
|
|
* **Data**: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish). |
|
|
* **Optimizer**: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay. |
|
|
* **Hardware**: Trained on 8x H100 GPUs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
TabiBERT was comprehensively evaluated on **TabiBench**, a benchmark consisting of **28 datasets** spanning **8 task categories**. |
|
|
The model achieves state-of-the-art performance among Turkish models, with a total average score of **77.58**, |
|
|
surpassing the previous best Turkish model by **1.62 points**. |
|
|
|
|
|
### Key Highlights |
|
|
|
|
|
- **State-of-the-art performance**: TabiBERT outperforms all monolingual Turkish baselines across the evaluation suite |
|
|
- **Largest improvement in QA**: Achieves an F1 score of **69.71**, outperforming the next best Turkish model by **9.55 points** (16% relative improvement) |
|
|
- **Leading performance in 5 out of 8 task categories**: Including code retrieval and information retrieval |
|
|
- **Strong long-context capabilities**: Native support for up to 8,192 tokens, providing advantages on longer sequences |
|
|
|
|
|
### Benchmark: TabiBench |
|
|
|
|
|
TabiBench is a comprehensive benchmark specifically designed for Turkish NLP, consisting of 28 datasets across 8 task types. |
|
|
The benchmark includes both existing Turkish NLP datasets and newly created/translated datasets for code retrieval and academic domain tasks. |
|
|
|
|
|
**Benchmark Collection**: [TabiBench on HuggingFace](https://huggingface.co/collections/boun-tabilab/tabibench) |
|
|
|
|
|
### Overall Evaluation Results |
|
|
|
|
|
**Comparison of downstream task performance across all evaluated models.** |
|
|
|
|
|
For each column, the highest score among the models is shown in **bold**. |
|
|
The evaluation metric used for each task type is also displayed in the column headers. |
|
|
|
|
|
| Model | # of params<br/>(M) | Text Clf<br/>(F1) | Token Clf<br/>(F1) | STS<br/>(Pearson) | NLI<br/>(F1) | QA<br/>(F1) | Academic<br/>(F1) | Retrieval<br/>(NDCG@10) | Code Retrieval<br/>(NDCG@10) | Total Avg<br/>(tabibench) | |
|
|
|-------|---------------------|-------------------|-------------------|-------------------|--------------|-------------|-------------------|-------------------------|----------------------------|---------------------------| |
|
|
| TurkishBERTweet | 163 | 79.71 | 92.02 | 75.86 | 79.10 | 38.13 | 63.12 | 68.40 | 43.49 | 67.48 | |
|
|
| YTU-BERT | 111 | **84.25** | 93.60 | 84.68 | 84.16 | 31.50 | 71.78 | 74.29 | 53.80 | 72.26 | |
|
|
| BERTurk | 110 | 83.42 | **93.67** | **85.33** | 84.33 | 60.16 | 71.40 | 74.84 | 54.54 | 75.96 | |
|
|
| **TabiBERT** | 149 | 83.44 | 93.42 | 84.74 | **84.51** | **69.71** | **72.44** | **75.44** | **56.95** | **77.58** | |
|
|
|
|
|
### Evaluation Methodology |
|
|
|
|
|
|
|
|
Systematic hyperparameter tuning was performed for all model-task pairs with the following search space: |
|
|
| Parameter | Values | |
|
|
|-----------|--------| |
|
|
| Learning Rate | 5e-6, 1e-5, 2e-5, 3e-5 | |
|
|
| Weight Decay | 1e-5, 1e-6 | |
|
|
| Batch Size | 16, 32 | |
|
|
| Epochs | Up to 10, with early stopping | |
|
|
|
|
|
|
|
|
For each task category, a single score is reported by computing a **weighted average** across all datasets, where each dataset's weight is proportional to its test set size. |
|
|
This ensures that larger, more representative datasets have corresponding influence on overall results (test set sizes range from 150 to 35,000 examples). |
|
|
|
|
|
### Comparison with Baselines |
|
|
|
|
|
TabiBERT was compared against three established Turkish BERT models: |
|
|
- **BERTurk**: Widely used Turkish monolingual encoder, pre-trained on Web and special corpora |
|
|
- **YTU-BERT**: Uncased Turkish BERT, pre-trained on large Turkish corpus (Web, Wikipedia, books) |
|
|
- **TurkishBERTweet**: Uncased Turkish BERT, pre-trained at large scale for social media |
|
|
|
|
|
**Result**: TabiBERT outperforms all monolingual Turkish models with a total average score of **77.58**, surpassing BERTurk (previous best) by **1.62 points**. |
|
|
|
|
|
### Reproducibility |
|
|
|
|
|
All evaluation datasets are publicly available on HuggingFace, under the [TabiBench](https://huggingface.co/collections/boun-tabilab/tabibench) collection to facilitate future research and comparisons. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* TabiBERT was trained mainly on **Turkish**, with additional **English, code, and math**. **Its** performance on English may be **limited** relative to Turkish, and it may underperform on other languages. |
|
|
* As with any large-scale model, it may inherit **biases** from training data. |
|
|
* While capable of handling up to **8k tokens**, inference on very long sequences may be slower. |
|
|
* Still under evaluation — recommended to validate results before deployment in critical applications. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
TabiBERT model weights and training codebase are released under the **Apache 2.0** license. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use TabiBERT in your project, please cite: |
|
|
|
|
|
``` |
|
|
@misc{Türker2025Tabibert, |
|
|
title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish}, |
|
|
author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı}, |
|
|
year={2025}, |
|
|
eprint={2512.23065}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2512.23065}, |
|
|
} |
|
|
``` |