Skip to content

Registry of data portals, catalogs, data repositories including data catalogs dataset and catalog description standard

License

Notifications You must be signed in to change notification settings

commondataio/dataportals-registry

Repository files navigation

dataportals-registry

Registry of data portals, catalogs, data repositories and e.t.c.

This is a transitional repository to create registry of all existing open data portals and repositories.

This is the first pillar of the open search engine project. Other pillars include:

  • registry of all catalogs (this one)
  • datasets raw metadata database
  • unified dataset search index and search engine
  • datasets backup and file cache

Please take a look at project mindmap to see it's goals and structure.

What kind of data catalogs collected?

This registry includes description of the following data catalogs:

  • Open data portals
  • Geoportals
  • Scientific data repositories
  • Indicators catalogs
  • Microdata catalogs
  • Machine learning catalogs
  • Data search engines
  • API Catalogs
  • Data marketplaces
  • Other

Inspiration

This project inspired by Re3Data and Fairsharing projects. Key difference is the focus on open data as a broad topic, not just open research data.

Final version of this repository will be reorganized as database with publicly available open API and bulk data dumps.

How this repository organized

Warning: this is temporary description and subject of change

Entities

Data catalog descriptions are YAML files in data/entities folder. Files separated by country/territory folders and inside each country folder there are folders like scientific, opendata, microdata, geo, search, marketplace, other.

Example

Data.gov YAML file

access_mode:
- open
api: true
api_status: active
catalog_type: Open data portal
content_types:
- dataset
coverage:
- location:
    country:
      id: US
      name: United States
    level: 1
endpoints:
- type: ckanapi
  url: https://catalog.data.gov/api/3
export_standard: CKAN API
id: catalogdatagov
identifiers:
- id: wikidata
  url: https://www.wikidata.org/wiki/Q5227102
  value: Q5227102
- id: re3data
  url: https://www.re3data.org/repository/r3d100010078
  value: r3d100010078
- id: fairsharing
  url: https://fairsharing.org/FAIRsharing.6069e1
  value: FAIRsharing.6069e1
langs:
- EN
link: https://catalog.data.gov
name: NETL Energy Data eXchange
owner:
  location:
    country:
      id: US
      name: United States
    level: 1
  name: U.S. Department of Energy
  type: Central government
software: CKAN
status: active
tags:
- government
- has_api

Datasets and code

Datasets kept in data/datasets folder, right now it's catalogs.jsonl file generated by script builder.py in scripts folder.

Run python builder.py build in scripts folder to regenerate catalogs.jsonl file from YAML files.

Data exports

Latest snapshot (2026-02-21):

  • data/datasets/catalogs.jsonl (+ .zst): 13,877 catalog records
  • data/datasets/software.jsonl (+ .zst): 136 software/platform definitions
  • data/datasets/scheduled.jsonl (+ .zst): scheduled sources to crawl
  • data/datasets/full.jsonl (+ .zst): 13,877 combined entities + scheduled records
  • data/datasets/full.parquet, data/datasets/datasets.duckdb: analytics-friendly exports
  • data/datasets/bytype/, data/datasets/bysoftware/: sliced JSONL exports by catalog type or platform

All .zst files can be decompressed with unzstd file.zst (zstd), and DuckDB exports can be queried directly with duckdb or Python's duckdb package.

Discovery

How to find catalogs in this registry:

By geography

  • Entity YAMLs live under data/entities/COUNTRY_CODE/ (e.g. US, FR, BR).
  • Use Federal/ for federal-level catalogs and subregion codes for states/regions (e.g. US-CA, US-VA, BR-SP).
  • One YAML per catalog; filename is the catalog id.

By catalog type

  • Under each country (or scheduled/), type folders: opendata/, geo/, scientific/, microdata/, indicators/, ml/, search/, api/, marketplace/, other/.

From export artifacts

  • catalogs.jsonl / full.jsonl: line-delimited JSON (entities only, or entities + scheduled).
  • full.parquet, data/datasets/datasets.duckdb: for analytics; query with DuckDB or pandas.
  • data/datasets/bytype/ and data/datasets/bysoftware/: pre-sliced JSONL by catalog type or software platform.

Example DuckDB query (all CKAN catalogs in the US from the full export). The built DuckDB store normalizes nested fields to JSON strings, so filter on the software and coverage string columns:

SELECT id, name, link
FROM catalogs
WHERE software LIKE '%"id":"ckan"%'
  AND coverage LIKE '%"id":"US"%';

Data Quality and Validation

The repository includes tools for analyzing and validating data quality:

  • Duplicate Detection: Scripts to identify duplicate UID's and ID's across all records
  • Schema Validation: Validation against JSON schemas in data/schemes/
  • Data Quality Reports: Analysis reports written to the dataquality/ directory

To run data quality analysis:

python scripts/builder.py analyze-quality

Reports are written to dataquality/ (e.g. full_report.txt, primary_priority.jsonl, and per-country/per-priority breakouts).

Re3data Enrichment

Catalogs with re3data identifiers can be automatically enriched with metadata from re3data.org. The enrichment adds a _re3data field containing keywords, content types, contact information, persistent identifiers, software information, and more.

To enrich catalogs:

# Preview enrichment (dry run)
python scripts/re3data_enrichment.py enrich --dry-run

# Apply enrichment
python scripts/re3data_enrichment.py enrich

See devdocs/re3data_enrichment.md for detailed documentation.

CKAN Ecosystem Synchronization

CKAN websites can be automatically discovered and synchronized from the official CKAN ecosystem dataset. The sync script fetches CKAN site metadata, checks for duplicates, and adds missing sites to the registry with enriched metadata.

To synchronize CKAN sites:

# Preview sync (dry run) - see what would be added
python scripts/sync_ckan_ecosystem.py --dry-run

# Sync and add to scheduled directory (default)
python scripts/sync_ckan_ecosystem.py

# Sync and add directly to entities directory
python scripts/sync_ckan_ecosystem.py --entities

# Customize delay between requests (seconds)
python scripts/sync_ckan_ecosystem.py --delay 2.0

# Disable web scraping enrichment
python scripts/sync_ckan_ecosystem.py --no-enrich

The script automatically:

  • Fetches CKAN sites from ecosystem.ckan.org via CKAN API
  • Detects duplicates by URL/domain matching
  • Enriches metadata from both the dataset and web scraping
  • Adds missing sites using existing registry infrastructure

See devdocs/ckan_ecosystem_sync.md for detailed documentation.

This generates comprehensive reports on:

  • Duplicate UID's and ID's
  • Missing required fields
  • Filename mismatches
  • Empty files and parsing errors

See devdocs/duplicates_and_errors_report.md for detailed findings. Generated reports and per-country breakouts are stored in dataquality/ alongside a summary data_quality_report.txt. Helper scripts (scripts/fix_*_issues.py) can be used to apply automated fixes based on the reported priorities.

How to contribute?

If you find any mistake or you have an additional data catalog to add, please generate pull request or write an issue.

Data sources

Following data sources used:

License

Source code licensed under MIT license Data licensed under CC-BY 4.0 license

About

Registry of data portals, catalogs, data repositories including data catalogs dataset and catalog description standard

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Languages