Corey Hubbard 7/9/25 Corey Hubbard 7/9/25

The 20TB Multilingual LLM Data Revolution | Scale to 1000+ Languages with One Pipeline

Unlock the full potential of state-of-the-art multilingual LLMs with FineWeb2, a groundbreaking 20 terabyte (5 billion document) dataset. This new pre-training data is generated by a revolutionary curation pipeline that automatically adapts to support any language. Overcoming the inherent difficulty of tailoring filtering and deduplication for a large number of languages, FineWeb2 has been scaled to over 1000 languages using Common Crawl snapshots. It produces more performant models than prior datasets for non-English corpora and includes a principled approach to rebalance datasets for additional performance uplift. Access the released dataset, pipeline, training, and evaluation codebases today!

Corey Hubbard 6/25/25 Corey Hubbard 6/25/25

AI: Less Magic, More Machine (Seriously, It's No Wizard!)

Thinking AI is magic? Think again! While it's brilliant at distinguishing puppies from muffins, it still struggles with sarcasm and can't quite write its own articles (yet!). Discover how humans are still crucial for truly smart AI. Your brain's still the boss!

The 20TB Multilingual LLM Data Revolution | Scale to 1000+ Languages with One Pipeline

AI: Less Magic, More Machine (Seriously, It's No Wizard!)

Glassbury AI

Established 2024

Made with

Squarespace