Tokenization is Killing our Multilingual LLM Dream

Community Article Published March 15, 2026

You're training an LLM for your language and you did everything right. You spent weeks, maybe months, carefully curating the data. You cleaned it, deduplicated it, filtered aggressively. You picked a reasonable architecture, tuned the learning rate, watched the loss curve behave exactly as expected. The norm looks good. The eval perplexity is respectable. You run some inference and the model is, frankly, crap. It hallucinates structure. It loses track of morphology mid-sentence. It fails on inputs that are trivially easy for a native speaker. You check the data again. You check the architecture again. The problem is somewhere you haven't looked yet.

It's the front door. It was always the front door.

In 2023, I trained the first version of Sawalni.ma - a language model for Moroccan Arabic and Amazigh. I spent nights curating data while model iterations trained, and none of them met the quality of my data. But English models worked great. This experience set me on a years-long investigation, and building Wikilangs shortly after - 1800+ NLP models across 340+ Wikipedia languages, I watched the same pattern repeat without exception. Every language that struggled, struggled first at the token boundary.

Put simply, when you have a good representation, you don't need as much data. Tokenization is the tax that low-resource languages cannot afford to pay, and they're being charged it on every token, in every layer, for every variant their speakers write.

Unpacking Tokenization

LLMs are big balls of numbers taking numbers as inputs and to output other numbers. This is easy for images since pixels are also numbers at the end of the day. But how can an LLM manipulate text?

Tokenization is the step that converts raw text into numbers a language model actually works with, its atomic pieces of meaning, its legos. That's how you get tokens, the units you are billed by when using a commercial LLM provider.

When you type in a prompt, the text is cut into strips (tokens) by a pair of scissors (the tokenizer) before the model ever reads it - and you didn't choose where to cut. If the cuts happen to land on meaningful units, whole words, recognizable morphemes, or syllables, the model can reconstruct meaning quickly. If the cuts are arbitrary, it's doing extra work just to figure out what it's looking at before it can begin to understand it. Everything that follows is about what happens when those cuts are systematically bad.

Good and Bad Tokenizations

The LLM produces the answer to your prompt by generating one token at a time. In this case, think of tokens as the keys on a keyboard the model is allowed to press. It has a fixed set and a limited budget for how many it can press per response. Here's where it gets interesting, after each token, the LLM can choose what to write next. When a word is cut in a non meaningful way, the LLM is presented with options that do not make sense. Options it shouldn't have been presented with in the first place.

Tokenization can be constructive and produce useful segmentation such as below.

Constructive generalization

But it can also cause the opposite and generate non-existent words.

Destructive generalization

But tokens aren't just input and output units. They determine the legos the model uses to build meaning internally too. Every concept it holds, every relationship it reasons about, every pattern it recognizes is assembled from those same bricks. If your legos are weirdly shaped, if they don't map cleanly onto the things you're trying to build, what you construct will fit together awkwardly and break in unexpected places. The structure looks roughly right from a distance. The details are wrong.

Lego hack

What is this monstrosity?

The Obvious Fix That Doesn't Scale

At this point you might think, "just" make a tokenizer for your language and make it fit your language's quirks. And you should, it's much better than using an English vocabulary or something even less meaningful. Wikilangs makes such tokenizers available for 340+ Wikipedia languages precisely because the alternative is worse.

But a custom tokenizer is a local optimization, not a solution. It reduces fertility. It partially improves boundary coherence. It does nothing for the variant recovery problem, since your tokenizer was trained on whatever clean text you had, not on the full distribution of how real speakers actually write and all the typos and variations that come with it.

More fundamentally, it destroys cross-lingual alignment: the moment you train a language-specific vocabulary, you diverge from the shared embedding space that makes multilingual transfer possible. Every token you add is a token the model has never seen in relation to the rest of its knowledge so you have to train it from scratch to a level the model can integrate it in its internal process.

The field has tried vocabulary expansion, token merging, bilingual tokenizer training, and script-specific sub-tokenizers. Each helps one language in isolation. None of them compose well. The dream of a single model that natively handles Arabic, Darija, Amazigh, Yoruba, and Khasi remains structurally blocked at the tokenization layer.

Looking at the numbers, let's say you add 4000 tokens per language, and that's on the extremely low bound. Over 340 languages means more than 1 million tokens the LLM should be able to handle. Such a vocabulary size is impractical. It increases the size of the model immensely (your 4B model becomes 20B with no gain in performance or output quality) and makes the generation step extremely slow due to the specifics of how the softmax function works.

You simply cannot patch your way to true multilingualism one vocabulary at a time.

What the Metrics Measure

Compression ratio and fertility are the two numbers everyone reaches for when evaluating a tokenizer. They're easy to compute, easy to compare, and easy to misread.

Fertility is defined as the average number of tokens per word. If a word is split in two tokens, that's a fertility of two. In general you want this number to go down. For morphologically rich languages (using prefixes and suffixes or more complex structures), a higher fertility might be desirable as long as the tokens map to inflection points in the language, rather than letter chunks with no particular meaning on their own. In general, a higher fertility means the model is doing more work per word, is slower, and has a higher chance of making mistakes.

Compression ratio on the other hand counts the ratio of bytes of text per token. Since tokens operate effectively as a lookup table, tokens that map to longer text result in better compression.

Like fertility, a better compression ratio means the model spends less effort for the same amount of text, ending up with higher output speed compared to a lower compression ratio.

Like many statistical figures, two very different tokenizers can result in the same fertility and compression ratios. Fertility simply tells you how long the sequence is. It says nothing about whether the cuts are meaningful. Compression ratio on the other hand is about pattern matching, and is also completely unconstrained by language morphology causing the results to diverge in ways that are destructive to the LLM.

Tokenizer Tokens
❌ Bad ev · lerd · en
✅ Good (language-aware) ev · ler · den

In Turkish: "evlerden" (= "from the houses") = ev (house) + ler (plural) + den (ablative)

The bad tokenizer destroys both plural and case information resulting in worse language modeling.

A 2025 ICML study across 70 languages confirmed the gap: morphological alignment does not explain much variance in model performance when measured by fertility alone, and oversegmentation can actually inflate apparent alignment scores. Other Supporting metrics such as STRR (how many words are single tokens) do nothing to help here.

Better proxies exist, and MorphBPE's Morphological Consistency F1 and Morphological Edit Distance are both predictive of training convergence speed in ways fertility simply isn't. But the noisiness of data (typos and the like) in the wild means that a pure morphological approach is not always possible.


What the Tokenizer Breaks, the Model Must Fix

You might have used Gemma, Qwen or ChatGPT and they do speak other languages, so you might want to challengethe claims I laid down.

The truth is there's an alternative path to language-specific tokenizers. Gemma pioneered a massive 250k token vocabulary size and Qwen followed suit in its latest 3.5 edition. Since so many languages shared the same writing script, it means latin, arabic and cyrillic alphabets are used to write completely different things and the tokenizer reflects this by producing "universal" tokens. Tokens that can be used to compress text from any language, but whose boundaries do not have a particular meaning for a given language.

But even if tokenization boundaries are not meaningful, the model under the pressure of gradient descent still has to find a way to mimic the training data and generate plausible responses. And it's the middle layers who bear this cost. But those middle layers are not surplus capacity sitting idle. They're simultaneously responsible for syntactic composition, semantic integration, reasoning and you know, doing the task you prompted the LLM for. The tokenizer sets the morphological reconstruction bill. The middle layers pay it out of a shared budget.

The cost you're paying for is a much dumber model. I know because developers come to Sawalni with this exact problem: "My agent works great in french or english. But if a user asks in Darija, it crashes in quality and becomes unusable." This is the cost frontier LLMs with the appearance of multilingualism must pay: less intelligence because half of its brain is too busy making sense of the non-sense pieces of text.

"But it works well in my language!" - for all things equal (model size, training data, compute) it would have worked better. The LLM is carrying deadweight holding it back on every token it generates.

Another piece of evidence comes from the "Tokenization Falling Short" (EMNLP 2024) paper who shows that scale partially recovers the gap introduced by bad tokenization. If scale buys back performance, then smaller models are using raw parameter budget as a substitute for clean input. That means you're not getting a 7B model's worth of reasoning, you're getting a 7B model spending a meaningful fraction of its capacity on reconstructing what should have been in the tokens to begin with.


Competing Directions: When Tokens Mean Too Many Things at Once

There are more than 500 languages from 20+ linguistic families using the latin alphabet for writing, under very different assumptions and with overlapping but ultimately very different semantics, phonetics, everything-ics. So when a model trains on multilingual data with a bad tokenizer (is there even a good one), each token needs to capture in its token embeddings a certain meaning reflecting all the contexts in which it was encountered.

A token that appears in too many morphologically distinct contexts doesn't converge to a clean embedding, it accumulates competing gradient updates, one for each context it appears in across training. It needs to serve too many directions simultaneously.

Anthropic's superposition research makes the mechanism precise: when a model must represent more features than it has embedding dimensions, it encodes multiple features per direction, with interference as the price of admission. This is exactly what happens when BPE segments morphologically rich words by frequency rather than meaning. The same token fragment recurs across unrelated contexts, accumulating competing gradient updates from each one. The embedding can't converge to a clean representation because it's being asked to serve too many directions at once.

The middle layers are then where this interference resolves. The polysemanticity emergence literature shows that competing feature directions converge specifically in middle-layer processing, meaning yet more of the shared capacity budget is consumed not on reasoning but on disambiguation the tokenizer manufactured in the first place. The "BPE Gets Picky" paper (EMNLP 2024) names this directly: standard BPE creates under-trained tokens by over-allocating vocabulary to high-frequency but semantically hollow units, degrading embedding parameter utilization overall.

To be precise about what's established vs. open: no single study has drawn a clean empirical line from boundary misalignment → elevated embedding directional variance → measurable reasoning degradation at fixed parameter count. That line is implied by the convergence of these findings. Closing it empirically is one of the specific questions this work is pursuing.


Typos, Diacritics, and the Brittleness Cascade

The fastest way to see what's wrong with discrete tokenization is to slightly corrupt the input. Interestingly, this problem is not unique to low-resource languages and impacts everyone.

Let's crank up some code and look closer:

variants = [
    "tell me",     # base
    "Tell me",     # capitalization
    "tell  me",    # double space
    "tllm e",      # transposition typo
    "tellme",      # omission typo
    "teell me",    # repetition typo
    "tell mé",     # diacritic
]

base_ids = set(tok.encode(variants[0]))
for v in variants:
    ids = set(tok.encode(v))
    jaccard = len(base_ids & ids) / len(base_ids | ids)
    print(f"{v!r:20} Jaccard: {jaccard:.2f}   {tok.tokenize(v)}")

# Result:
# 
# 'tell me'            Jaccard: 1.00   ['tell', 'Ġme']
# 'Tell me'            Jaccard: 0.33   ['Tell', 'Ġme']
# 'tell  me'           Jaccard: 0.67   ['tell', 'Ġ', 'Ġme']
# 'tllm e'             Jaccard: 0.00   ['t', 'll', 'm', 'Ġe']
# 'tellme'             Jaccard: 0.33   ['tell', 'me']
# 'teell me'           Jaccard: 0.25   ['te', 'ell', 'Ġme']
# 'tell mé'            Jaccard: 0.33   ['tell', 'Ġmé']

The Jaccard column tells part of the story. "tell mé" shares essentially nothing with "tell me". No overlap in tokens, no structural relationship, no shared gradient history. A human reads both as identical intent in under 100 milliseconds. As far as model is concerned, those are two completely foreign sequences.

If the model learns to recover some of these typos, it's not at the token embedding level. These are token embeddings and they should not be so further apart. In an ideal world, similarity would be close to 1:

Token similarity across typos in English

The issue is predictably not much better in other languages, here in Moroccan Arabic:

Token similarity across typos in Moroccan Arabic

These are inconveniences for high-resource languages, which have seen enough variant co-occurrences to embed them nearby in representation space. For low-resource languages, those correspondences were never learned because the data to build them never existed. Khasi's ï and ñ are stripped or replaced at rates of 18–50% in model outputs and those characters are not decorative, they are meaning-bearing. A leading space creates a completely different token identity: ▁tell != tell. In agglutinative languages, this interacts destructively with prefix and suffix morphology at every word boundary.

Let's recap: when the model encounters a corrupted or variant-form input for a low-resource language, it has to do three things in its middle layers serially: reconstruct the intended characters from sub-token fragments, recover the morphological structure from those fragments, and map back to the semantic concept. In a well trained model with a good tokenizer, this is the role of the embedding layer, leaving the later middle layers to do useful work.

For a high-resource language with a well-fitted tokenizer, this chain barely activates since token embeddings are immediately usable. For a low-resource language with a misaligned tokenizer, it fires on nearly every token.


The Four Compounding Taxes

Low-resource languages don't just have less data. They carry a multiplicative penalty stack, where each problem amplifies the next.

Tax 1: Fertility overhead. More tokens per word. Shorter effective context window. Higher attention compute per sentence. And weird generalizations that do not map well to the language.

Tax 2: Morphological incoherence. Boundaries don't respect morphemes. The model spends middle-layer depth reconstructing what the tokenizer destroyed, instead of doing the task it was asked to do.

Tax 3: No variant recovery. Insufficient data to learn orthographic correspondences. Every typo, diacritic variant, normalization mismatch, and case variation is a cold start, completely unrelated sequences in embedding space, forever.

Tax 4: Capacity spillover. Taxes 1–3 consume context positions, layer depth, and embedding dimensions. What remains for actual reasoning is systematically smaller than what a high-resource language gets from an equivalent model.

It's a runaway effect. The less data a language has, the worse its tokenization quality. The worse its tokenization, the more data needed to compensate. The standard prescription, collect more data, presupposes a fixed tokenization overhead that low-resource languages are never positioned to pay off. You cannot data-scale your way out of a broken input pipeline. The tax compounds on the language that can least afford it.

The Second Deepseek Moment

But it's not a lost cause.

Deepseek's OCR results demonstrated that feeding text as a rendered image to a vision encoder outperforms feeding the same text as tokens for character-level tasks. Practitioners independently rediscovered this as a "hack" with VLLMs, literally screenshotting text before passing it to multimodal models to sidestep tokenization artifacts.

Why does it work? Vision encoders define a continuous latent space. A slightly shifted edge is still an edge. A slightly different pixel is still part of the same gradient. The representation absorbs variation by design, the structural opposite of what text tokenization does. There is no discrete lookup nor out-of-vocabulary problem. There is no Unicode normalization trap. There is just a continuous signal with smooth geometry.

Which raises a question critical for multilingualism: what would it mean to give a language model the same kind of perceptual front-end that vision models already take for granted? What if text, just a sequence of bytes at its most primitive level, could be consumed as a continuous signal rather than a lookup in a discrete symbol table? What if we could get rid of tokenization altogether?

The most robust tokenizer in production today might be a JPEG encoder.

Where do we go from here?

Tokenization-free architectures are gaining serious traction. ByT5, byte-level models, Meta FAIR's Large Concept Model operating in a concept embedding space rather than at the token level. These are genuine advances. But they require training from scratch, trade sequence efficiency for robustness, and are not deployable as improvements to the models that already serve the 340 languages Wikilangs covers.

What doesn't exist yet is a continuous pre-tokenization layer, a component that sits between raw text and the LLM's attention and MLP layers, mapping the brittle discrete token space into a smooth representation space where orthographic variants, diacritics, normalization forms, and morphological fragments collapse to nearby regions before the model ever sees them.

"Tokenization Falling Short" (EMNLP 2024) explicitly names perturbation-invariant tokenization strategies as future work. That future work is what this post is calling for.

Several concrete empirical questions follow and remain open:

  • Does degree of morphological misalignment at the token boundary predict, at fixed parameter count, measurable degradation in downstream reasoning tasks? Scale-sensitivity evidence implies yes. No study has controlled for this directly.

  • Can a continuous pre-tokenization layer trained contrastively to embed orthographic variants and morphological fragments to nearby regions recover the performance gap without retraining the LLM itself?

  • Does such a layer generalize across language families, or does it require per-family inductive biases? Wikilangs provides evaluation infrastructure across 340 languages to test this at scale. Sawalni is the proving ground for Moroccan languages specifically.

Tokenization is not a solved problem. It is the one structural barrier that compounds every other disadvantage low-resource languages carry. It's a leaky bucket, chipping away at your resources despite scaling LLMs.

If you work on multilingual representations, input encoding architectures, or low-resource NLP, these are concrete open experiments. Reach out and let's figure it out.

Resources and references:

Post by Omar Kamali, an AI researcher from Berlin, founder of Sawalni.ma and focused on multilingualism and cultural alignment.

https://omarkamali.com

https://x.com/omarkamali

Community

Sign up or log in to comment