openpecha
/

BoSentencePiece

Model card Files Files and versions

BoSentencePiece / README.md

kaldan's picture

Upload folder using huggingface_hub

6f0f9fb verified 3 months ago

|

history blame contribute delete

1.55 kB

	---
	language:
	- bo
	library_name: transformers
	tags:
	- tokenizer
	- sentencepiece
	- tibetan
	- unigram
	license: apache-2.0
	---

	# BoSentencePiece - Tibetan SentencePiece Tokenizer

	A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm.

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Model Type \| Unigram \|
	\| Vocabulary Size \| 20,000 \|
	\| Character Coverage \| 100% \|
	\| Max Token Length \| 16 \|

	## Special Tokens

	\| Token \| ID \| Description \|
	\|-------\|-----\|-------------\|
	\| `<unk>` \| 0 \| Unknown token \|
	\| `<s>` \| 1 \| Beginning of sequence \|
	\| `</s>` \| 2 \| End of sequence \|
	\| `<pad>` \| 3 \| Padding token \|

	## Usage

	### With Transformers

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("openpecha/BoSentencePiece")

	text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
	tokens = tokenizer.tokenize(text)
	print(tokens)

	# Encode
	encoded = tokenizer.encode(text)
	print(encoded)

	# Decode
	decoded = tokenizer.decode(encoded)
	print(decoded)
	```

	### With SentencePiece Directly

	```python
	from huggingface_hub import hf_hub_download
	import sentencepiece as spm

	# Download the model file
	model_path = hf_hub_download("openpecha/BoSentencePiece", "spiece.model")

	sp = spm.SentencePieceProcessor()
	sp.load(model_path)

	text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
	tokens = sp.encode_as_pieces(text)
	print(tokens)
	```

	## License

	Apache 2.0