This repository accompanies a paper that investigates archival bias patterns in collections from the Dutch National Archives using Faith's Phylogenetic Diversity (PD) metrics applied to the GTAA (General Thesaurus for Audiovisual Archives) vocabulary. The project quantifies institutional bias by assessing how well various archival subcollections cover the conceptual space defined by the GTAA vocabulary. By borrowing phylogenetic diversity metrics from biodiversity research, the analysis measures conceptual diversity and identifies gaps in archival preservation.
Based on Faith's Phylogenetic Diversity and Chao1 unseen diversity estimation, the analysis calculates three key ratios:
-
Coverage Ratio:
collection_pd / gtaa_total_pd- What fraction of total possible conceptual space does this collection cover?
-
Completeness Ratio:
collection_pd / (collection_pd + unseen_pd)- Within the conceptual territory this collection covers, how thoroughly is it documented?
- High completeness = few missing subjects that should be there
- Low completeness = many related concepts are missing → incomplete preservation
-
Efficiency Ratio:
coverage_ratio / log(collection_size)- How efficient is the collection at covering conceptual space?
- High efficiency = small specialized collections with conceptual breadth
- Low efficiency = massive collections with repetitive content
GTAA_PD/
├── README.md # This file
├── archival_bias_detection.ipynb # Main analysis notebook
├── data/
│ ├── external/
│ │ └── gtaa_ontology.csv # GTAA vocabulary data
│ └── processed/ # Processed data in Parquet format
│ │ └── photos_archives.parquet # NA photo metadata in parquet format
├── results/ # Analysis outputs
└── src/ # Source code modules
├── archival_bias_detection.py # Main analysis class (includes ontology analysis)
├── data_processing.py # Data preprocessing utilities
├── faith_pd.py # Faith's PD implementation
├── graph_builder.py # GTAA graph construction
├── test_unseenpd.py # Testing utilities
└── unseen_pd.py # Unseen diversity estimation
curl -LsSf https://astral.sh/uv/install.sh | sh- navigate to project directory
- add data to
data. You can skip the creation of the parquet file with metadata from the raw JSONS and just add thephoto_archive.parquetfile to data/processed and thegtaa_ontology.csvtodata/external uv syncuv run jupyter notebook archival_bias_detection.ipynb