Glottolog database of the world's languages

Unfortunately not able to do all the steps for a pull request (sorry!), but here is a suggested dataset including relevant links to sources and information and a cleaning script.

The original data is released as CC-BY 4.0 and should be cited as:

> Hammarström, Harald & Forkel, Robert & Haspelmath, Martin & Bank, Sebastian. 2025. Glottolog 5.2.1. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://glottolog.org/)

Glottolog is the most comprehensive language database in linguistics, and contains information (names, genealogy, geographical information, endangerment status, etc.) of over 8,000 languages of the world.

- [x] This dataset has not already been used in TidyTuesday.
- [x] The dataset will (probably) be less than 20MB when saved as a tidy CSV.
- [x] I can imagine a data visualization related to this dataset.

- [x] **title:** The Languages of the World
- [x] **article:** An example of the dataset being used, such as a blog post or a README about the dataset.
  - [x] **title:** "_Glottolog: A Free, Online, Comprehensive Bibliography of the World’s Languages_", but more relevant information at <https://glottolog.org> and <https://glottolog.org/langdoc/status>
  - [x] **url:** The link to the article. <https://pure.mpg.de/rest/items/item_2354764/component/file_2354763/content> 
- [x] **data_source:** 
  - [x] **title:** Glottolog 5.2
  - [x] **url:** <https://github.com/glottolog/glottolog-cldf/tree/master/cldf>, also <https://zenodo.org/records/15525265>
- [ ] **images:** One or more images related to the dataset. For each image, provide:
  - [ ] **file:** A url to download the image, or an attached file.
  - [ ] **alt:** Text that can serve as a replacement for the image for those who cannot see the image (whether through visual impairment or because the image does not load).


- [x] **cleaning_script:** A script to fetch and clean the data, resulting in one or more data.frames (or equivalent structures) that can be saved as CSVs.

```r
# Download raw data and filter to endangered status
endangered_status <- 
  readr::read_csv("https://raw.githubusercontent.com/glottolog/glottolog-cldf/refs/heads/master/cldf/values.csv") |> 
  dplyr::filter(Parameter_ID == "aes") |> 
  dplyr::select(Language_ID, Value, Code_ID) |> 
  dplyr::rename(id = Language_ID,
                status_code = Value,
                status_label = Code_ID) |> 
  dplyr::mutate(status_label = stringr::str_replace(stringr::str_remove(status_label, "^aes-"), "_", " "))

# Download language and family data
fam_lgs <- 
  readr::read_csv("https://raw.githubusercontent.com/glottolog/glottolog-cldf/refs/heads/master/cldf/languages.csv")

# Filter and clean language family data
families <- 
  fam_lgs |> 
  dplyr::filter(Level == "family") |> 
  dplyr::select(ID, Name) |> 
  dplyr::rename(Family = Name) |> 
  dplyr::rename_with(stringr::str_to_lower, dplyr::everything())

# Filter and clean language data
languages <- 
  fam_lgs |> 
  dplyr::filter(Level == "language") |> 
  dplyr::select(ID, Name, Macroarea, Latitude, Longitude, ISO639P3code, Countries, Is_Isolate, Family_ID) |> 
  dplyr::rename_with(stringr::str_to_lower, dplyr::everything()) 

```

- [ ] **data_dictionary:** A description of each column in the dataset, including the column name, the data type of the column, and a description of the column.

`languages`
|variable            |class     |description         |
|:-------------------|:---------|:-------------------|
|id            |character     |Unique identifier for language         |
|name |character|Language name|
|macroarea|character|General geographic area in which the language is found|
|latitude|double|Latitude of language location (as point)|
|longitude|double|Longitude of language location (as point)|
|iso639p3code|character|ISO 639-3 identifier of language (if available)|
|countries|character|Countries in which language is used (separated by ";")|
|is_isolate|logical|Whether language is an isolate (i.e. has no known relatives)|
|family_id|character|Unique identifier of family that the language is part of (if not isolate)|

`families`
|variable            |class     |description         |
|:-------------------|:---------|:-------------------|
|id            |character     |Unique identifier for language family         |
|name |character|Language family name|

`endangered_status`
|variable            |class     |description         |
|:-------------------|:---------|:-------------------|
|id            |character     |Unique identifier for language         |
|status_code |character|Code of the agglomerated endangerment status (1–6)|
|status_label|character|Descriptive label of endangerment category|

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Glottolog database of the world's languages #908

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

variable	class	description
id	character	Unique identifier for language
name	character	Language name
macroarea	character	General geographic area in which the language is found
latitude	double	Latitude of language location (as point)
longitude	double	Longitude of language location (as point)
iso639p3code	character	ISO 639-3 identifier of language (if available)
countries	character	Countries in which language is used (separated by ";")
is_isolate	logical	Whether language is an isolate (i.e. has no known relatives)
family_id	character	Unique identifier of family that the language is part of (if not isolate)

variable	class	description
id	character	Unique identifier for language family
name	character	Language family name

variable	class	description
id	character	Unique identifier for language
status_code	character	Code of the agglomerated endangerment status (1–6)
status_label	character	Descriptive label of endangerment category

Glottolog database of the world's languages #908

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions