pkupie/mc2_corpus
Viewer • Updated • 504k • 94 • 14
How to use pkupie/mc2-xlmr-large with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="pkupie/mc2-xlmr-large") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("pkupie/mc2-xlmr-large")
model = AutoModelForMaskedLM.from_pretrained("pkupie/mc2-xlmr-large")We continually pretrain XLM-RoBERTa-large with MC^2, which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.
See details in the paper.
We have also released another model trained on MC^2: MC^2Llama-13B.
@article{zhang2024mc,
title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
journal={arXiv preprint arXiv:2311.08348},
year={2024}
}