MrBERT-ca Model Card

MrBERT-ca is a new foundational bilingual language model for Catalan and English built on the ModernBERT architecture. It uses vocabulary adaptation from MrBERT-es, a method that initializes all weights from MrBERT while applying a specialized treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenizers.

Following initialization, the model is continually pretrained on a bilingual corpus of 47.4 billion tokens, evenly balanced between Catalan and English.

Technical Description

Technical details of the MrBERT-ca model.

Description Value
Model Parameters 150M
Tokenizer Type SPM
Vocabulary size 50304
Precision bfloat16
Context length 8192

Training Hyperparemeters

Hyperparameter Value
Pretraining Objective Masked Language Modeling
Learning Rate 1E-03
Learning Rate Scheduler WSD
Warmup 4,740,000,000 tokens
Optimizer decoupled_stableadamw
Optimizer Hyperparameters AdamW (β1=0.9,β2=0.98,ε =1e-06 )
Weight Decay 1E-05
Global Batch Size 480
Dropout 1E-01
Activation Function GeLU

How to use

>>> from transformers import pipeline
>>> from pprint import pprint

>>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT-ca')

>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
[{'score': 0.5078125,
  'sequence': "M'encanta la ciutat de Barcelona.",
  'token': 1125,
  'token_str': 'ciutat'},
 {'score': 0.060791015625,
  'sequence': "M'encanta la gastronomia de Barcelona.",
  'token': 10336,
  'token_str': 'gastronomia'},
 {'score': 0.041748046875,
  'sequence': "M'encanta la platja de Barcelona.",
  'token': 5404,
  'token_str': 'platja'}]
>>> pprint(unmasker("La ciència engloba disciplines com la<mask>y les matemàtiques.",top_k=3))
[{'score': 0.703125,
  'sequence': 'La ciència engloba disciplines com la física y les '
              'matemàtiques.',
  'token': 5096,
  'token_str': 'física'},
 {'score': 0.15625,
  'sequence': 'La ciència engloba disciplines com la biologia y les '
              'matemàtiques.',
  'token': 19234,
  'token_str': 'biologia'},
 {'score': 0.0576171875,
  'sequence': 'La ciència engloba disciplines com la química y les '
              'matemàtiques.',
  'token': 11562,
  'token_str': 'química'}]
>>> pprint(unmasker("Since I can't conquer the world yet, I'll just hide my weights under your<mask>and wait patiently.", top_k=3))
[{'score': 0.1796875,
  'sequence': "Since I can't conquer the world yet, I'll just hide my weights "
              'under your arm and wait patiently.',
  'token': 15234,
  'token_str': 'arm'},
 {'score': 0.1396484375,
  'sequence': "Since I can't conquer the world yet, I'll just hide my weights "
              'under your rug and wait patiently.',
  'token': 19473,
  'token_str': 'rug'},
 {'score': 0.10888671875,
  'sequence': "Since I can't conquer the world yet, I'll just hide my weights "
              'under your power and wait patiently.',
  'token': 32670,
  'token_str': 'power'}]

Which is equivalent to the following torch script:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model = AutoModelForMaskedLM.from_pretrained("BSC-LT/MrBERT-ca")
tokenizer = AutoTokenizer.from_pretrained("BSC-LT/MrBERT-ca")

# The index of "<mask>" token is -3 given that the -1 position is the EOS token "</s>" and -2 the position of the "." token.
outputs = model(**tokenizer("La capital d'España és<mask>.", return_tensors="pt")).logits
predicted_token = tokenizer.decode(torch.argmax(outputs[0,-3,:]))

print(f"The prediction is \"{predicted_token}\"." ) # The prediction is "Madrid"

EVALUATION

In addition to the MrBERT family, the following base foundation models were considered:

Multilingual Foundational Model Number of Parameters Vocab Size Description
RoBERTa-ca 125M 50K RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa.
xlm-roberta-base 279M 250K Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.
mRoBERTa 283M 256K RoBERTa base model pretrained with 35 European languages and a larger vocabulary size.
mmBERT 308M 250K Multilingual ModernBERT pre-trained with staged language learning.
mGTE 306M 250K Multilingual encoder also adapted for retrieval tasks.

We compare our models using CLUB (Catalan Language Understanding Benchmark), which consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.

tasks xlm-roberta-base (279M) mRoBERTa (283M) roberta-ca (125M) mmBERT (308M) mGTE (306M) MrBERT (308M) MrBERT-ca (150M)
ner (F1) 87.61 88.33 89.70 88.14 87.20 87.32 88.04
pos (F1) 98.91 98.98 99.00 99.01 98.67 99.01 99.03
sts (Person) 74.67 79.52 82.99 83.16 78.65 83.00 85.42
tc (Acc.) 72.57 72.41 72.81 74.11 74.68 73.79 74.97
te (Acc.) 79.59 82.38 82.14 83.18 79.40 84.03 86.92
viquiquad (F1) 86.93 87.86 87.31 89.86 86.78 89.25 89.59
xquad (F1) 69.69 69.40 70.53 73.88 69.27 73.96 74.47
Average 81.42 82.70 83.50 84.48 82.09 84.34 85.49

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337, as well as by the European Union – NextGenerationEU. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions through data contributions.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano, the "Instituto de Ingenieria del Conocimiento" and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{tamayo2026mrbertmodernmultilingualencoders,
      title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation}, 
      author={Daniel Tamayo and Iñaki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2026},
      eprint={2602.21379},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.21379}, 
}

License

Apache License, Version 2.0

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BSC-LT/MrBERT-ca

Paper for BSC-LT/MrBERT-ca