Skip to content

Glottolog database of the world's languages #908

@borstell

Description

@borstell

Unfortunately not able to do all the steps for a pull request (sorry!), but here is a suggested dataset including relevant links to sources and information and a cleaning script.

The original data is released as CC-BY 4.0 and should be cited as:

Hammarström, Harald & Forkel, Robert & Haspelmath, Martin & Bank, Sebastian. 2025. Glottolog 5.2.1. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://glottolog.org/)

Glottolog is the most comprehensive language database in linguistics, and contains information (names, genealogy, geographical information, endangerment status, etc.) of over 8,000 languages of the world.

  • This dataset has not already been used in TidyTuesday.

  • The dataset will (probably) be less than 20MB when saved as a tidy CSV.

  • I can imagine a data visualization related to this dataset.

  • title: The Languages of the World

  • article: An example of the dataset being used, such as a blog post or a README about the dataset.

  • data_source:

  • images: One or more images related to the dataset. For each image, provide:

    • file: A url to download the image, or an attached file.
    • alt: Text that can serve as a replacement for the image for those who cannot see the image (whether through visual impairment or because the image does not load).
  • cleaning_script: A script to fetch and clean the data, resulting in one or more data.frames (or equivalent structures) that can be saved as CSVs.

# Download raw data and filter to endangered status
endangered_status <- 
  readr::read_csv("https://raw.githubusercontent.com/glottolog/glottolog-cldf/refs/heads/master/cldf/values.csv") |> 
  dplyr::filter(Parameter_ID == "aes") |> 
  dplyr::select(Language_ID, Value, Code_ID) |> 
  dplyr::rename(id = Language_ID,
                status_code = Value,
                status_label = Code_ID) |> 
  dplyr::mutate(status_label = stringr::str_replace(stringr::str_remove(status_label, "^aes-"), "_", " "))

# Download language and family data
fam_lgs <- 
  readr::read_csv("https://raw.githubusercontent.com/glottolog/glottolog-cldf/refs/heads/master/cldf/languages.csv")

# Filter and clean language family data
families <- 
  fam_lgs |> 
  dplyr::filter(Level == "family") |> 
  dplyr::select(ID, Name) |> 
  dplyr::rename(Family = Name) |> 
  dplyr::rename_with(stringr::str_to_lower, dplyr::everything())

# Filter and clean language data
languages <- 
  fam_lgs |> 
  dplyr::filter(Level == "language") |> 
  dplyr::select(ID, Name, Macroarea, Latitude, Longitude, ISO639P3code, Countries, Is_Isolate, Family_ID) |> 
  dplyr::rename_with(stringr::str_to_lower, dplyr::everything()) 
  • data_dictionary: A description of each column in the dataset, including the column name, the data type of the column, and a description of the column.

languages

variable class description
id character Unique identifier for language
name character Language name
macroarea character General geographic area in which the language is found
latitude double Latitude of language location (as point)
longitude double Longitude of language location (as point)
iso639p3code character ISO 639-3 identifier of language (if available)
countries character Countries in which language is used (separated by ";")
is_isolate logical Whether language is an isolate (i.e. has no known relatives)
family_id character Unique identifier of family that the language is part of (if not isolate)

families

variable class description
id character Unique identifier for language family
name character Language family name

endangered_status

variable class description
id character Unique identifier for language
status_code character Code of the agglomerated endangerment status (1–6)
status_label character Descriptive label of endangerment category

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions