-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Unfortunately not able to do all the steps for a pull request (sorry!), but here is a suggested dataset including relevant links to sources and information and a cleaning script.
The original data is released as CC-BY 4.0 and should be cited as:
Hammarström, Harald & Forkel, Robert & Haspelmath, Martin & Bank, Sebastian. 2025. Glottolog 5.2.1. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://glottolog.org/)
Glottolog is the most comprehensive language database in linguistics, and contains information (names, genealogy, geographical information, endangerment status, etc.) of over 8,000 languages of the world.
-
This dataset has not already been used in TidyTuesday.
-
The dataset will (probably) be less than 20MB when saved as a tidy CSV.
-
I can imagine a data visualization related to this dataset.
-
title: The Languages of the World
-
article: An example of the dataset being used, such as a blog post or a README about the dataset.
- title: "Glottolog: A Free, Online, Comprehensive Bibliography of the World’s Languages", but more relevant information at https://glottolog.org and https://glottolog.org/langdoc/status
- url: The link to the article. https://pure.mpg.de/rest/items/item_2354764/component/file_2354763/content
-
data_source:
- title: Glottolog 5.2
- url: https://github.com/glottolog/glottolog-cldf/tree/master/cldf, also https://zenodo.org/records/15525265
-
images: One or more images related to the dataset. For each image, provide:
- file: A url to download the image, or an attached file.
- alt: Text that can serve as a replacement for the image for those who cannot see the image (whether through visual impairment or because the image does not load).
-
cleaning_script: A script to fetch and clean the data, resulting in one or more data.frames (or equivalent structures) that can be saved as CSVs.
# Download raw data and filter to endangered status
endangered_status <-
readr::read_csv("https://raw.githubusercontent.com/glottolog/glottolog-cldf/refs/heads/master/cldf/values.csv") |>
dplyr::filter(Parameter_ID == "aes") |>
dplyr::select(Language_ID, Value, Code_ID) |>
dplyr::rename(id = Language_ID,
status_code = Value,
status_label = Code_ID) |>
dplyr::mutate(status_label = stringr::str_replace(stringr::str_remove(status_label, "^aes-"), "_", " "))
# Download language and family data
fam_lgs <-
readr::read_csv("https://raw.githubusercontent.com/glottolog/glottolog-cldf/refs/heads/master/cldf/languages.csv")
# Filter and clean language family data
families <-
fam_lgs |>
dplyr::filter(Level == "family") |>
dplyr::select(ID, Name) |>
dplyr::rename(Family = Name) |>
dplyr::rename_with(stringr::str_to_lower, dplyr::everything())
# Filter and clean language data
languages <-
fam_lgs |>
dplyr::filter(Level == "language") |>
dplyr::select(ID, Name, Macroarea, Latitude, Longitude, ISO639P3code, Countries, Is_Isolate, Family_ID) |>
dplyr::rename_with(stringr::str_to_lower, dplyr::everything())
- data_dictionary: A description of each column in the dataset, including the column name, the data type of the column, and a description of the column.
languages
| variable | class | description |
|---|---|---|
| id | character | Unique identifier for language |
| name | character | Language name |
| macroarea | character | General geographic area in which the language is found |
| latitude | double | Latitude of language location (as point) |
| longitude | double | Longitude of language location (as point) |
| iso639p3code | character | ISO 639-3 identifier of language (if available) |
| countries | character | Countries in which language is used (separated by ";") |
| is_isolate | logical | Whether language is an isolate (i.e. has no known relatives) |
| family_id | character | Unique identifier of family that the language is part of (if not isolate) |
families
| variable | class | description |
|---|---|---|
| id | character | Unique identifier for language family |
| name | character | Language family name |
endangered_status
| variable | class | description |
|---|---|---|
| id | character | Unique identifier for language |
| status_code | character | Code of the agglomerated endangerment status (1–6) |
| status_label | character | Descriptive label of endangerment category |