--- library_name: transformers license: apache-2.0 language: - tr - en tags: - fill-mask - masked-lm - long-context - modernbert - turkish - code - mathematics pipeline_tag: fill-mask inference: false datasets: - HuggingFaceFW/fineweb-2 - selimfirat/bilkent-turkish-writings-dataset --- # TabiBERT [![arXiv](https://img.shields.io/badge/arXiv-2512.23065-b31b1b.svg)](https://arxiv.org/abs/2512.23065) [![GitHub](https://img.shields.io/badge/GitHub-TabiBERT-181717?logo=github)](https://github.com/boun-tabi-LMG/Tabibert) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![Benchmark](https://img.shields.io/badge/Benchmark-TabiBench-purple)](https://huggingface.co/collections/boun-tabilab/tabibench) TabiBERT 📄 **Paper**: [TabiBERT:A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish](https://arxiv.org/abs/2512.23065) 💻 **Code**: https://github.com/boun-tabi-LMG/TabiBERT ## Table of Contents 1. [Model Summary](#model-summary) 2. [Usage](#usage) 3. [Evaluation](#evaluation) 4. [Limitations](#limitations) 5. [Training](#training) 6. [License](#license) 7. [Citation](#citation) --- ## Model Summary **TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture. TabiBERT is pre-trained on **1 trillion tokens** of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens. TabiBERT inherits ModernBERT’s architectural improvements, such as: - **Rotary Positional Embeddings (RoPE)** for long-context support. - **Local-Global Alternating Attention** for efficiency on long inputs. - **Unpadding and Flash Attention** for efficient inference. This makes TabiBERT particularly suitable for: - **Turkish NLP tasks** (classification, QA, retrieval, NLI, etc.). - **Multilingual text understanding** (Turkish-English). - **Code retrieval and representation learning.** - **Mathematical and symbolic reasoning.** - **Long-context understanding** such as document classification, retrieval, and semantic search. *TabiBERT is built by [TABILAB](https://tabilab.cmpe.bogazici.edu.tr/) with the support of [VNGRS](https://vngrs.com/).* --- ## Usage You can use TabiBERT directly with the `transformers` library (v4.48.0+): ```bash pip install -U transformers>=4.48.0 ``` Since TabiBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`. **⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:** ```bash pip install flash-attn ``` Example usage with `AutoModelForMaskedLM`: ```py from transformers import AutoTokenizer, AutoModelForMaskedLM import torch model_id = "boun-tabilab/TabiBERT" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir." inputs = tokenizer(text, return_tensors="pt").to(device) outputs = model(**inputs) masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) predicted_id = outputs.logits[0, masked_index].argmax(axis=-1) print("Predicted token:", tokenizer.decode(predicted_id)) # Predicted token: Güneş ``` Example with `pipeline`: ```py from transformers import pipeline pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT") print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir.")[0]['sequence']) # Ankara, Türkiye Cumhuriyeti'nin başkentidir. ``` ## Pre-training Data TabiBERT has been **pre-trained on 86 billion tokens** of diverse data, primarily: - A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic texts. - **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens. ## Training * **Architecture**: Encoder-only, Pre-Norm Transformer with GeGLU activations. * **Sequence Length**: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens. * **Data**: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish). * **Optimizer**: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay. * **Hardware**: Trained on 8x H100 GPUs. --- ## Evaluation TabiBERT was comprehensively evaluated on **TabiBench**, a benchmark consisting of **28 datasets** spanning **8 task categories**. The model achieves state-of-the-art performance among Turkish models, with a total average score of **77.58**, surpassing the previous best Turkish model by **1.62 points**. ### Key Highlights - **State-of-the-art performance**: TabiBERT outperforms all monolingual Turkish baselines across the evaluation suite - **Largest improvement in QA**: Achieves an F1 score of **69.71**, outperforming the next best Turkish model by **9.55 points** (16% relative improvement) - **Leading performance in 5 out of 8 task categories**: Including code retrieval and information retrieval - **Strong long-context capabilities**: Native support for up to 8,192 tokens, providing advantages on longer sequences ### Benchmark: TabiBench TabiBench is a comprehensive benchmark specifically designed for Turkish NLP, consisting of 28 datasets across 8 task types. The benchmark includes both existing Turkish NLP datasets and newly created/translated datasets for code retrieval and academic domain tasks. **Benchmark Collection**: [TabiBench on HuggingFace](https://huggingface.co/collections/boun-tabilab/tabibench) ### Overall Evaluation Results **Comparison of downstream task performance across all evaluated models.** For each column, the highest score among the models is shown in **bold**. The evaluation metric used for each task type is also displayed in the column headers. | Model | # of params
(M) | Text Clf
(F1) | Token Clf
(F1) | STS
(Pearson) | NLI
(F1) | QA
(F1) | Academic
(F1) | Retrieval
(NDCG@10) | Code Retrieval
(NDCG@10) | Total Avg
(tabibench) | |-------|---------------------|-------------------|-------------------|-------------------|--------------|-------------|-------------------|-------------------------|----------------------------|---------------------------| | TurkishBERTweet | 163 | 79.71 | 92.02 | 75.86 | 79.10 | 38.13 | 63.12 | 68.40 | 43.49 | 67.48 | | YTU-BERT | 111 | **84.25** | 93.60 | 84.68 | 84.16 | 31.50 | 71.78 | 74.29 | 53.80 | 72.26 | | BERTurk | 110 | 83.42 | **93.67** | **85.33** | 84.33 | 60.16 | 71.40 | 74.84 | 54.54 | 75.96 | | **TabiBERT** | 149 | 83.44 | 93.42 | 84.74 | **84.51** | **69.71** | **72.44** | **75.44** | **56.95** | **77.58** | ### Evaluation Methodology Systematic hyperparameter tuning was performed for all model-task pairs with the following search space: | Parameter | Values | |-----------|--------| | Learning Rate | 5e-6, 1e-5, 2e-5, 3e-5 | | Weight Decay | 1e-5, 1e-6 | | Batch Size | 16, 32 | | Epochs | Up to 10, with early stopping | For each task category, a single score is reported by computing a **weighted average** across all datasets, where each dataset's weight is proportional to its test set size. This ensures that larger, more representative datasets have corresponding influence on overall results (test set sizes range from 150 to 35,000 examples). ### Comparison with Baselines TabiBERT was compared against three established Turkish BERT models: - **BERTurk**: Widely used Turkish monolingual encoder, pre-trained on Web and special corpora - **YTU-BERT**: Uncased Turkish BERT, pre-trained on large Turkish corpus (Web, Wikipedia, books) - **TurkishBERTweet**: Uncased Turkish BERT, pre-trained at large scale for social media **Result**: TabiBERT outperforms all monolingual Turkish models with a total average score of **77.58**, surpassing BERTurk (previous best) by **1.62 points**. ### Reproducibility All evaluation datasets are publicly available on HuggingFace, under the [TabiBench](https://huggingface.co/collections/boun-tabilab/tabibench) collection to facilitate future research and comparisons. --- ## Limitations * TabiBERT was trained mainly on **Turkish**, with additional **English, code, and math**. **Its** performance on English may be **limited** relative to Turkish, and it may underperform on other languages. * As with any large-scale model, it may inherit **biases** from training data. * While capable of handling up to **8k tokens**, inference on very long sequences may be slower. * Still under evaluation — recommended to validate results before deployment in critical applications. --- ## License TabiBERT model weights and training codebase are released under the **Apache 2.0** license. ## Citation If you use TabiBERT in your project, please cite: ``` @misc{Türker2025Tabibert, title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish}, author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı}, year={2025}, eprint={2512.23065}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.23065}, } ```