GTAA Phylogenetic Diversity Analysis

This repository accompanies a paper that investigates archival bias patterns in collections from the Dutch National Archives using Faith's Phylogenetic Diversity (PD) metrics applied to the GTAA (General Thesaurus for Audiovisual Archives) vocabulary. The project quantifies institutional bias by assessing how well various archival subcollections cover the conceptual space defined by the GTAA vocabulary. By borrowing phylogenetic diversity metrics from biodiversity research, the analysis measures conceptual diversity and identifies gaps in archival preservation.

Methodology

Based on Faith's Phylogenetic Diversity and Chao1 unseen diversity estimation, the analysis calculates three key ratios:

Coverage Ratio: collection_pd / gtaa_total_pd
- What fraction of total possible conceptual space does this collection cover?
Completeness Ratio: collection_pd / (collection_pd + unseen_pd)
- Within the conceptual territory this collection covers, how thoroughly is it documented?
- High completeness = few missing subjects that should be there
- Low completeness = many related concepts are missing → incomplete preservation
Efficiency Ratio: coverage_ratio / log(collection_size)
- How efficient is the collection at covering conceptual space?
- High efficiency = small specialized collections with conceptual breadth
- Low efficiency = massive collections with repetitive content

Project Structure

GTAA_PD/
├── README.md                          # This file
├── archival_bias_detection.ipynb      # Main analysis notebook
├── data/
│   ├── external/
│   │   └── gtaa_ontology.csv          # GTAA vocabulary data
│   └── processed/                     # Processed data in Parquet format
│   │   └── photos_archives.parquet    # NA photo metadata in parquet format
├── results/                           # Analysis outputs
└── src/                               # Source code modules
    ├── archival_bias_detection.py     # Main analysis class (includes ontology analysis)
    ├── data_processing.py             # Data preprocessing utilities
    ├── faith_pd.py                    # Faith's PD implementation
    ├── graph_builder.py               # GTAA graph construction
    ├── test_unseenpd.py               # Testing utilities
    └── unseen_pd.py                   # Unseen diversity estimation

Installation using UV

curl -LsSf https://astral.sh/uv/install.sh | sh
navigate to project directory
add data to data. You can skip the creation of the parquet file with metadata from the raw JSONS and just add the photo_archive.parquet file to data/processed and the gtaa_ontology.csv to data/external
uv sync
uv run jupyter notebook archival_bias_detection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTAA Phylogenetic Diversity Analysis

Methodology

Project Structure

Installation using UV

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GTAA Phylogenetic Diversity Analysis

Methodology

Project Structure

Installation using UV