wikiwonka

community
Activity Feed

AI & ML interests

None defined yet.

omarkamaliΒ 
posted an update 1 day ago
view post
Post
2459
We got Qwen 3.5 to count Rs in Strawberry correctly! 🚨

Building on Sawtone, we’ve been testing a different way to feed language into an LLM to build the next generation of multilingual AI.

The usual setup gives the model tokenized text and asks it to perform various linguistic tasks. That works surprisingly well, until it doesn’t. Accents disappear. Words get mangled. Internal structure gets blurred away. And the cost of that gets higher once you move into multilingual and lower-resource settings.

So we tried adding a second path.

In addition to the normal text input, the model also receives Sawtone: a byte-level word representation that preserves how a word is written, how it sounds, and how it is structured.

Same LLM. Better interface.

In this proof of concept with Qwen 3.5 0.8B, that pushed our eval from 64% to 88%. The gains showed up exactly where tokenized models usually get shaky: diacritics, character order, exact spelling, and other form-sensitive behavior.

Sawtone itself is tokenizer-free, byte-level, and pre-trained across 507 languages.

Still early, but promising!

omarkamaliΒ 
posted an update 9 days ago
view post
Post
210
🌐 LID Benchmark update:

β€’ 10 Regional Leaderboards
β€’ 17 LID models (+7 new, incl. non-fastText based)
β€’ 449 languages in total (200+ additional)
β€’ Fixed: F1 macro reporting error
β€’ Normalized language codes for more accurate results

The dataset is also updated, now with individual model predictions to reproduce and validate our findings.

omneity-labs/lid-benchmark
omarkamaliΒ 
posted an update 19 days ago
view post
Post
215
Omneity Labs LID Benchmark is live πŸ”₯

- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!

Come find your language and which LID model supports it best in this space πŸ‘‡

omneity-labs/lid-benchmark
omarkamaliΒ 
posted an update 20 days ago
view post
Post
1870
I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🀯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

  • 14 replies
Β·
omarkamaliΒ 
posted an update about 1 month ago
view post
Post
336
You're probably training on outdated Wikipedia data right now and don't know it. πŸ’‘

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
β€’ For English, that's 700,000 missing articles.
β€’ For Moroccan Arabic, 30% of the language's entire Wikipedia.
β€’ For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly πŸ‘‡

https://omarkamali.com/blog/wikipedia-monthly-pipeline
omarkamaliΒ 
posted an update 3 months ago
view post
Post
1699
New year, new dataset πŸš€

I just released omarkamali/wikipedia-labels, with all the structural labels and namespace from wikipedia in 300+ languages. A gift for the data preprocessors and cleaners among us.

Happy new year 2026 everyone! πŸŽ†
omarkamaliΒ 
posted an update 4 months ago
view post
Post
302
Picomon v0.2.0 released! πŸ’«

- Supports all of AMD, Nvidia and Apple Silicon πŸ§‘β€πŸ§‘β€πŸ§’β€πŸ§’
- Beautiful TUI with themes (who said monitoring should be boring?) πŸ’…
- Shareable Rig Cards! Boast to friends, family and foes alike 🫨

Get it now! uvx picomon or pip install picomon then picomon
  • 3 replies
Β·
omarkamaliΒ 
posted an update 4 months ago
view post
Post
3498
Hello picomon! AMD GPU Monitoring made easy

Just run uvx picomon and behold:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GPU 0  GFX  42%  UMC  21%                β”‚  β”‚ GPU 1  GFX  78%  UMC  66%                β”‚
β”‚ PWR 135/250W (54%)  VRAM 10.0/16.0GB 62% β”‚  β”‚ PWR 210/250W (84%)  VRAM 14.5/16.0GB 90% β”‚
β”‚                                          β”‚  β”‚                                          β”‚
β”‚ GFX β–β–‚β–‚β–ƒβ–„β–„β–…β–†β–†β–‡β–ˆβ–‡β–†β–…β–„β–ƒβ–‚β–                   β”‚  β”‚ GFX β–‚β–ƒβ–„β–…β–†β–‡β–ˆβ–ˆβ–‡β–†β–…β–„β–‚β–‚β–ƒβ–…β–†                    β”‚
β”‚ PWR β–β–β–‚β–‚β–ƒβ–„β–„β–…β–†β–‡β–ˆβ–ˆβ–‡β–†β–…β–„β–‚β–                   β”‚  β”‚ PWR β–‚β–‚β–ƒβ–„β–…β–†β–‡β–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–‚β–‚β–ƒ                    β”‚
β”‚ VRM β–β–β–‚β–‚β–ƒβ–„β–„β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–‚                   β”‚  β”‚ VRM β–‚β–ƒβ–„β–…β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–‚β–‚                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


Repo at https://github.com/omarkamali/picomon
Or pypi at https://pypi.org/project/picomon
omarkamaliΒ 
posted an update 4 months ago
view post
Post
5233
Exciting updates to the Wikipedia Monthly dataset for November! πŸš€

・ Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__
・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.

Check out the dataset:
omarkamali/wikipedia-monthly
omarkamaliΒ 
posted an update 6 months ago
view post
Post
306
Another month, another Wikipedia Monthly release! πŸŽƒ

Highlights of October's edition:
Β· πŸ—£οΈ 341 languages
Β· πŸ“š 64.7M articles (+2.5%)
Β· πŸ“¦ 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! 🧰

omarkamali/wikipedia-monthly
  • 2 replies
Β·
omarkamaliΒ 
posted an update 7 months ago
view post
Post
1608
**Wikipedia Monthly's September edition is now live πŸŽ‰**

Highlights of this edition:
Β· πŸ—£οΈ 341 languages
Β· πŸ“š 63.1M articles
Β· πŸ“¦ 86.5GB of data

This update also solves upload issues in the August edition where some languages had missing parts. Happy data engineering!

omarkamali/wikipedia-monthly
  • 2 replies
Β·