A multilingual web dataset focused on language and accessibility.
🌐 View Website • 📦 Get Dataset • 🤝 Contribute
LangCrux is a dataset of 120,000 websites across 12 countries that use non-Latin scripts. It shows how language appears on the web — both what users read and what screen readers process.
It helps answer:
- What languages are used in visible web content?
- Are accessibility labels written in the same language as the page?
- How often do sites mix native language content with English?
Most existing datasets tell you how popular a site is. LangCrux tells you what language it speaks.
| Field | Description |
|---|---|
| Websites | 120,000 total |
| Countries | 12 |
| Scripts | Non-Latin (e.g. Arabic, Devanagari) |
| Source | Chrome UX Report (CrUX) |
| Language filter | Sites with 50%+ content in target language |
| Country | Language | Speakers (M) | Sites |
|---|---|---|---|
| 🇨🇳 China | Mandarin | 1200 | 10,000 |
| 🇮🇳 India | Hindi | 609 | 10,000 |
| 🇩🇿 Algeria | Arabic | 335 | 10,000 |
| 🇧🇩 Bangladesh | Bangla | 284 | 10,000 |
| 🇷🇺 Russia | Russian | 253 | 10,000 |
| 🇯🇵 Japan | Japanese | 126 | 10,000 |
| 🇪🇬 Egypt | Arabic | 119 | 10,000 |
| 🇭🇰 Hong Kong | Cantonese | 85.5 | 10,000 |
| 🇰🇷 South Korea | Korean | 82 | 10,000 |
| 🇹🇭 Thailand | Thai | 71 | 10,000 |
| 🇬🇷 Greece | Greek | 13.5 | 10,000 |
| 🇮🇱 Israel | Hebrew | 9 | 10,000 |
👉 Interactive viewer:
https://anonymous.4open.science/w/LangCrux-F68F/
- Each dot = one website
- Click to see URL, metadata, and language stats
- Compare visible text vs. accessibility tags
- Country
- Native Language %: how much of the site is in the native language
- Exact Match Only: keep only sites where visible and accessibility text are in the same language
- Data Sampling: control sample size (e.g., 100% = full dataset)
⚠ Some websites may include adult content or be inaccessible from some regions.
LangCrux includes Kizuki, a Lighthouse extension that adds checks for language consistency in:
altaria-label- form
labels
It helps find mismatches between what’s shown and what assistive tools describe.
Want to help?
You can:
- Suggest new language-country pairs
- Improve language detection logic
- Report bugs in the viewer or audit scripts
- Help with documentation
Open a pull request or issue to get started.
Apache 2.0
For questions or feedback, email:
langcrux@protonmail.com
