MC^2XLMR-large

Github Repo

We continually pretrain XLM-RoBERTa-large with MC^2, which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.

See details in the paper.

We have also released another model trained on MC^2: MC^2Llama-13B.

Citation

@article{zhang2024mc,
  title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
  author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
  journal={arXiv preprint arXiv:2311.08348},
  year={2024}
}

Downloads last month: 44

Dataset used to train pkupie/mc2-xlmr-large

Space using pkupie/mc2-xlmr-large 1

Collection including pkupie/mc2-xlmr-large

Low-Resource Languages in China

Collection

Towards more transparent, inclusive, and culture-aware NLP for low-resource languages! • 6 items • Updated 30 days ago • 1

Paper for pkupie/mc2-xlmr-large

MC^2: A Multilingual Corpus of Minority Languages in China

Paper • 2311.08348 • Published Nov 14, 2023