Spaces:
Running
Running
| title: README | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: pink | |
| sdk: static | |
| pinned: false | |
| # π· FineData | |
| This is the home of the π· **FineData** team, a branch of the π€ **Hugging Face** [Science Team](https://hf.co/science) releasing large scale pre-training datasets to accelerate open LLM development. | |
| - **[π· FineWeb](https://huggingface.co/collections/HuggingFaceFW/fineweb-662458592d61edba3d2f245d)**: A 15T tokens English dataset for LLM pre-training. See the [blogpost](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [paper](https://arxiv.org/abs/2406.17557). | |
| - **[π FineWeb-Edu](https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd)**: a filtered subset of the most educational content from FineWeb. | |
| - **[π₯ FineWeb2](https://huggingface.co/collections/HuggingFaceFW/fineweb2-6755657a481dae41e8fbba4d)**: an extension of FineWeb to over 1000 languages. See the [paper](https://arxiv.org/abs/2506.20920). | |
| - **[π FinePDFs](https://huggingface.co/collections/HuggingFaceFW/finepdfs-68bd02d20928419c1dc12296)**: 3T tokens of text data extracted from PDFs sourced from the Web. | |
| - **[π FineWiki](https://huggingface.co/collections/HuggingFaceFW/finewiki-68f6615c6bb86563dcd5e846)**: an updated, better extracted version of Wikipedia in 300+ languages. | |
| - **[π FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu)**: 350B+ highly educational tokens filtered from π FinePDFs |