TabiBERT / README.md

Update README.md

36d24bf verified 2 days ago

9.8 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- tr
	- en
	tags:
	- fill-mask
	- masked-lm
	- long-context
	- modernbert
	- turkish
	- code
	- mathematics
	pipeline_tag: fill-mask
	inference: false
	datasets:
	- HuggingFaceFW/fineweb-2
	- selimfirat/bilkent-turkish-writings-dataset
	---

	# TabiBERT

	[![arXiv](https://img.shields.io/badge/arXiv-2512.23065-b31b1b.svg)](https://arxiv.org/abs/2512.23065)
	[![GitHub](https://img.shields.io/badge/GitHub-TabiBERT-181717?logo=github)](https://github.com/boun-tabi-LMG/Tabibert)
	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
	[![Benchmark](https://img.shields.io/badge/Benchmark-TabiBench-purple)](https://huggingface.co/collections/boun-tabilab/tabibench)
	<!-- [![Hugging Face](https://img.shields.io/badge/🤗-Model-yellow.svg)](https://huggingface.co/boun-tabilab/TabiBERT) -->

	<img src="https://cdn-uploads.huggingface.co/production/uploads/679ce16128d0a203d8dc15ca/pTl7fMa1WLQDxMfohm9eT.png" alt="TabiBERT" width="500"/>

	📄 Paper: [TabiBERT:A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish](https://arxiv.org/abs/2512.23065)

	💻 Code: https://github.com/boun-tabi-LMG/TabiBERT

	## Table of Contents
	1. [Model Summary](#model-summary)
	2. [Usage](#usage)
	3. [Evaluation](#evaluation)
	4. [Limitations](#limitations)
	5. [Training](#training)
	6. [License](#license)
	7. [Citation](#citation)

	---

	## Model Summary

	TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
	TabiBERT is pre-trained on 1 trillion tokens of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.

	TabiBERT inherits ModernBERT’s architectural improvements, such as:

	- Rotary Positional Embeddings (RoPE) for long-context support.
	- Local-Global Alternating Attention for efficiency on long inputs.
	- Unpadding and Flash Attention for efficient inference.

	This makes TabiBERT particularly suitable for:

	- Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
	- Multilingual text understanding (Turkish-English).
	- Code retrieval and representation learning.
	- Mathematical and symbolic reasoning.
	- Long-context understanding such as document classification, retrieval, and semantic search.

	TabiBERT is built by [TABILAB](https://tabilab.cmpe.bogazici.edu.tr/) with the support of [VNGRS](https://vngrs.com/).

	---

	## Usage

	You can use TabiBERT directly with the `transformers` library (v4.48.0+):

	```bash
	pip install -U transformers>=4.48.0
	```

	Since TabiBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`.

	⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

	```bash
	pip install flash-attn
	```

	Example usage with `AutoModelForMaskedLM`:
	```py
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	import torch

	model_id = "boun-tabilab/TabiBERT"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(model_id)

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device)

	text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
	inputs = tokenizer(text, return_tensors="pt").to(device)
	outputs = model(**inputs)

	masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
	predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
	print("Predicted token:", tokenizer.decode(predicted_id))
	# Predicted token: Güneş
	```

	Example with `pipeline`:
	```py
	from transformers import pipeline

	pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")

	print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir.")[0]['sequence'])
	# Ankara, Türkiye Cumhuriyeti'nin başkentidir.
	```

	## Pre-training Data

	TabiBERT has been pre-trained on 86 billion tokens of diverse data, primarily:

	- A large-scale Turkish corpus covering literature, news, social media, Wikipedia, and academic
	texts.
	- English text, code with English commentary, and math problems in English — together making up about 13% non-Turkish tokens.


	## Training

	* Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
	* Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
	* Data: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
	* Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
	* Hardware: Trained on 8x H100 GPUs.

	---

	## Evaluation

	TabiBERT was comprehensively evaluated on TabiBench, a benchmark consisting of 28 datasets spanning 8 task categories.
	The model achieves state-of-the-art performance among Turkish models, with a total average score of 77.58,
	surpassing the previous best Turkish model by 1.62 points.

	### Key Highlights

	- State-of-the-art performance: TabiBERT outperforms all monolingual Turkish baselines across the evaluation suite
	- Largest improvement in QA: Achieves an F1 score of 69.71, outperforming the next best Turkish model by 9.55 points (16% relative improvement)
	- Leading performance in 5 out of 8 task categories: Including code retrieval and information retrieval
	- Strong long-context capabilities: Native support for up to 8,192 tokens, providing advantages on longer sequences

	### Benchmark: TabiBench

	TabiBench is a comprehensive benchmark specifically designed for Turkish NLP, consisting of 28 datasets across 8 task types.
	The benchmark includes both existing Turkish NLP datasets and newly created/translated datasets for code retrieval and academic domain tasks.

	Benchmark Collection: [TabiBench on HuggingFace](https://huggingface.co/collections/boun-tabilab/tabibench)

	### Overall Evaluation Results

	Comparison of downstream task performance across all evaluated models.

	For each column, the highest score among the models is shown in bold.
	The evaluation metric used for each task type is also displayed in the column headers.

	\| Model \| # of params<br/>(M) \| Text Clf<br/>(F1) \| Token Clf<br/>(F1) \| STS<br/>(Pearson) \| NLI<br/>(F1) \| QA<br/>(F1) \| Academic<br/>(F1) \| Retrieval<br/>(NDCG@10) \| Code Retrieval<br/>(NDCG@10) \| Total Avg<br/>(tabibench) \|
	\|-------\|---------------------\|-------------------\|-------------------\|-------------------\|--------------\|-------------\|-------------------\|-------------------------\|----------------------------\|---------------------------\|
	\| TurkishBERTweet \| 163 \| 79.71 \| 92.02 \| 75.86 \| 79.10 \| 38.13 \| 63.12 \| 68.40 \| 43.49 \| 67.48 \|
	\| YTU-BERT \| 111 \| 84.25 \| 93.60 \| 84.68 \| 84.16 \| 31.50 \| 71.78 \| 74.29 \| 53.80 \| 72.26 \|
	\| BERTurk \| 110 \| 83.42 \| 93.67 \| 85.33 \| 84.33 \| 60.16 \| 71.40 \| 74.84 \| 54.54 \| 75.96 \|
	\| TabiBERT \| 149 \| 83.44 \| 93.42 \| 84.74 \| 84.51 \| 69.71 \| 72.44 \| 75.44 \| 56.95 \| 77.58 \|

	### Evaluation Methodology


	Systematic hyperparameter tuning was performed for all model-task pairs with the following search space:
	\| Parameter \| Values \|
	\|-----------\|--------\|
	\| Learning Rate \| 5e-6, 1e-5, 2e-5, 3e-5 \|
	\| Weight Decay \| 1e-5, 1e-6 \|
	\| Batch Size \| 16, 32 \|
	\| Epochs \| Up to 10, with early stopping \|


	For each task category, a single score is reported by computing a weighted average across all datasets, where each dataset's weight is proportional to its test set size.
	This ensures that larger, more representative datasets have corresponding influence on overall results (test set sizes range from 150 to 35,000 examples).

	### Comparison with Baselines

	TabiBERT was compared against three established Turkish BERT models:
	- BERTurk: Widely used Turkish monolingual encoder, pre-trained on Web and special corpora
	- YTU-BERT: Uncased Turkish BERT, pre-trained on large Turkish corpus (Web, Wikipedia, books)
	- TurkishBERTweet: Uncased Turkish BERT, pre-trained at large scale for social media

	Result: TabiBERT outperforms all monolingual Turkish models with a total average score of 77.58, surpassing BERTurk (previous best) by 1.62 points.

	### Reproducibility

	All evaluation datasets are publicly available on HuggingFace, under the [TabiBench](https://huggingface.co/collections/boun-tabilab/tabibench) collection to facilitate future research and comparisons.

	---

	## Limitations

	* TabiBERT was trained mainly on Turkish, with additional English, code, and math. Its performance on English may be limited relative to Turkish, and it may underperform on other languages.
	* As with any large-scale model, it may inherit biases from training data.
	* While capable of handling up to 8k tokens, inference on very long sequences may be slower.
	* Still under evaluation — recommended to validate results before deployment in critical applications.

	---

	## License

	TabiBERT model weights and training codebase are released under the Apache 2.0 license.

	## Citation

	If you use TabiBERT in your project, please cite:

	```
	@misc{Türker2025Tabibert,
	title={TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish},
	author={Melikşah Türker and Asude Ebrar Kızıloğlu and Onur Güngör and Susan Üsküdarlı},
	year={2025},
	eprint={2512.23065},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2512.23065},
	}
	```