TheOpenMachine commited on
Commit
a0c093f
·
verified ·
1 Parent(s): 37646f7

Create README.md

Browse files

**UnitronX** is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code.
It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses
placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags.

language:
- en
- zh
- ja
- ko
- ru
- ar
- de
- fr
- es
- it
- cs
- hr
- sr
license: mit
library_name: tokenizers
tags:
- tokenizer
- bpe
- byte-level
- multilingual
- code
- unitronx
---

# {DISPLAY_NAME}


## Files
- `tokenizer.json`, `merges.txt`, `vocab.json`
- `tokenizer_config.json`, `special_tokens_map.json`
- `meta.json`
- *(optional)* `unitronx.tiktoken.json` (tiktoken-compatible)

## Load with Transformers

```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("YOUR_ORG_OR_USER/REPO_NAME")
print(tok.encode("don't split hyphen-words or fooBar123_id in code!"))

Files changed (1) hide show
  1. README.md +18 -0
README.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ro
5
+ - hr
6
+ - en
7
+ - de
8
+ - es
9
+ - sr
10
+ - zh
11
+ - ja
12
+ - ko
13
+ - fr
14
+ - is
15
+ - cs
16
+ tags:
17
+ - Tokenizer
18
+ ---