Instructions to use openpecha/BoSentencePiece with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openpecha/BoSentencePiece with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("openpecha/BoSentencePiece", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - bo | |
| library_name: transformers | |
| tags: | |
| - tokenizer | |
| - sentencepiece | |
| - tibetan | |
| - unigram | |
| license: apache-2.0 | |
| # BoSentencePiece - Tibetan SentencePiece Tokenizer | |
| A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm. | |
| ## Model Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | **Model Type** | Unigram | | |
| | **Vocabulary Size** | 20,000 | | |
| | **Character Coverage** | 100% | | |
| | **Max Token Length** | 16 | | |
| ## Special Tokens | |
| | Token | ID | Description | | |
| |-------|-----|-------------| | |
| | `<unk>` | 0 | Unknown token | | |
| | `<s>` | 1 | Beginning of sequence | | |
| | `</s>` | 2 | End of sequence | | |
| | `<pad>` | 3 | Padding token | | |
| ## Usage | |
| ### With Transformers | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("openpecha/BoSentencePiece") | |
| text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།" | |
| tokens = tokenizer.tokenize(text) | |
| print(tokens) | |
| # Encode | |
| encoded = tokenizer.encode(text) | |
| print(encoded) | |
| # Decode | |
| decoded = tokenizer.decode(encoded) | |
| print(decoded) | |
| ``` | |
| ### With SentencePiece Directly | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| import sentencepiece as spm | |
| # Download the model file | |
| model_path = hf_hub_download("openpecha/BoSentencePiece", "spiece.model") | |
| sp = spm.SentencePieceProcessor() | |
| sp.load(model_path) | |
| text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།" | |
| tokens = sp.encode_as_pieces(text) | |
| print(tokens) | |
| ``` | |
| ## License | |
| Apache 2.0 | |