TheOpenMachine commited on
Commit
111acaf
·
verified ·
1 Parent(s): a0c093f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -15
README.md CHANGED
@@ -1,18 +1,32 @@
1
  ---
 
 
2
  license: mit
3
- language:
4
- - ro
5
- - hr
6
- - en
7
- - de
8
- - es
9
- - sr
10
- - zh
11
- - ja
12
- - ko
13
- - fr
14
- - is
15
- - cs
16
  tags:
17
- - Tokenizer
18
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ # Model Card for Model ID
3
+
4
  license: mit
5
+ library_name: tokenizers
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
+ - tokenizer
8
+ - bpe
9
+ - byte-level
10
+ - multilingual
11
+ - code
12
+ - unitronx
13
+ ---
14
+
15
+ # {DISPLAY_NAME}
16
+
17
+ **UnitronX** is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code.
18
+ It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses
19
+ placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags.
20
+
21
+ ## Files
22
+ - `tokenizer.json`, `merges.txt`, `vocab.json`
23
+ - `tokenizer_config.json`, `special_tokens_map.json`
24
+ - `meta.json`
25
+ - *(optional)* `unitronx.tiktoken.json` (tiktoken-compatible)
26
+
27
+ ## Load with Transformers
28
+
29
+ ```python
30
+ from transformers import AutoTokenizer
31
+ tok = AutoTokenizer.from_pretrained("UnitronX-Tokenizer-32k-v1")
32
+ print(tok.encode("don't split hyphen-words or fooBar123_id in code!"))