dataportals-registry

Registry of data portals, catalogs, data repositories and e.t.c.

This is a transitional repository to create registry of all existing open data portals and repositories.

This is the first pillar of the open search engine project. Other pillars include:

registry of all catalogs (this one)
datasets raw metadata database
unified dataset search index and search engine
datasets backup and file cache

Please take a look at project mindmap to see it's goals and structure.

What kind of data catalogs collected?

This registry includes description of the following data catalogs:

Open data portals
Geoportals
Scientific data repositories
Indicators catalogs
Microdata catalogs
Machine learning catalogs
Data search engines
API Catalogs
Data marketplaces
Other

Inspiration

This project inspired by Re3Data and Fairsharing projects. Key difference is the focus on open data as a broad topic, not just open research data.

Final version of this repository will be reorganized as database with publicly available open API and bulk data dumps.

How this repository organized

Warning: this is temporary description and subject of change

Entities

Data catalog descriptions are YAML files in data/entities folder. Files separated by country/territory folders and inside each country folder there are folders like scientific, opendata, microdata, geo, search, marketplace, other.

Example

Data.gov YAML file

access_mode:
- open
api: true
api_status: active
catalog_type: Open data portal
content_types:
- dataset
coverage:
- location:
    country:
      id: US
      name: United States
    level: 1
endpoints:
- type: ckanapi
  url: https://catalog.data.gov/api/3
export_standard: CKAN API
id: catalogdatagov
identifiers:
- id: wikidata
  url: https://www.wikidata.org/wiki/Q5227102
  value: Q5227102
- id: re3data
  url: https://www.re3data.org/repository/r3d100010078
  value: r3d100010078
- id: fairsharing
  url: https://fairsharing.org/FAIRsharing.6069e1
  value: FAIRsharing.6069e1
langs:
- EN
link: https://catalog.data.gov
name: NETL Energy Data eXchange
owner:
  location:
    country:
      id: US
      name: United States
    level: 1
  name: U.S. Department of Energy
  type: Central government
software: CKAN
status: active
tags:
- government
- has_api

Datasets and code

Datasets kept in data/datasets folder, right now it's catalogs.jsonl file generated by script builder.py in scripts folder.

Run python builder.py build in scripts folder to regenerate catalogs.jsonl file from YAML files.

Data exports

Latest snapshot (2026-02-21):

data/datasets/catalogs.jsonl (+ .zst): 13,877 catalog records
data/datasets/software.jsonl (+ .zst): 136 software/platform definitions
data/datasets/scheduled.jsonl (+ .zst): scheduled sources to crawl
data/datasets/full.jsonl (+ .zst): 13,877 combined entities + scheduled records
data/datasets/full.parquet, data/datasets/datasets.duckdb: analytics-friendly exports
data/datasets/bytype/, data/datasets/bysoftware/: sliced JSONL exports by catalog type or platform

All .zst files can be decompressed with unzstd file.zst (zstd), and DuckDB exports can be queried directly with duckdb or Python's duckdb package.

Discovery

How to find catalogs in this registry:

By geography

Entity YAMLs live under data/entities/COUNTRY_CODE/ (e.g. US, FR, BR).
Use Federal/ for federal-level catalogs and subregion codes for states/regions (e.g. US-CA, US-VA, BR-SP).
One YAML per catalog; filename is the catalog id.

By catalog type

Under each country (or scheduled/), type folders: opendata/, geo/, scientific/, microdata/, indicators/, ml/, search/, api/, marketplace/, other/.

From export artifacts

catalogs.jsonl / full.jsonl: line-delimited JSON (entities only, or entities + scheduled).
full.parquet, data/datasets/datasets.duckdb: for analytics; query with DuckDB or pandas.
data/datasets/bytype/ and data/datasets/bysoftware/: pre-sliced JSONL by catalog type or software platform.

Example DuckDB query (all CKAN catalogs in the US from the full export). The built DuckDB store normalizes nested fields to JSON strings, so filter on the software and coverage string columns:

SELECT id, name, link
FROM catalogs
WHERE software LIKE '%"id":"ckan"%'
  AND coverage LIKE '%"id":"US"%';

Data Quality and Validation

The repository includes tools for analyzing and validating data quality:

Duplicate Detection: Scripts to identify duplicate UID's and ID's across all records
Schema Validation: Validation against JSON schemas in data/schemes/
Data Quality Reports: Analysis reports written to the dataquality/ directory

To run data quality analysis:

python scripts/builder.py analyze-quality

Reports are written to dataquality/ (e.g. full_report.txt, primary_priority.jsonl, and per-country/per-priority breakouts).

Re3data Enrichment

Catalogs with re3data identifiers can be automatically enriched with metadata from re3data.org. The enrichment adds a _re3data field containing keywords, content types, contact information, persistent identifiers, software information, and more.

To enrich catalogs:

# Preview enrichment (dry run)
python scripts/re3data_enrichment.py enrich --dry-run

# Apply enrichment
python scripts/re3data_enrichment.py enrich

See devdocs/re3data_enrichment.md for detailed documentation.

CKAN Ecosystem Synchronization

CKAN websites can be automatically discovered and synchronized from the official CKAN ecosystem dataset. The sync script fetches CKAN site metadata, checks for duplicates, and adds missing sites to the registry with enriched metadata.

To synchronize CKAN sites:

# Preview sync (dry run) - see what would be added
python scripts/sync_ckan_ecosystem.py --dry-run

# Sync and add to scheduled directory (default)
python scripts/sync_ckan_ecosystem.py

# Sync and add directly to entities directory
python scripts/sync_ckan_ecosystem.py --entities

# Customize delay between requests (seconds)
python scripts/sync_ckan_ecosystem.py --delay 2.0

# Disable web scraping enrichment
python scripts/sync_ckan_ecosystem.py --no-enrich

The script automatically:

Fetches CKAN sites from ecosystem.ckan.org via CKAN API
Detects duplicates by URL/domain matching
Enriches metadata from both the dataset and web scraping
Adds missing sites using existing registry infrastructure

See devdocs/ckan_ecosystem_sync.md for detailed documentation.

This generates comprehensive reports on:

Duplicate UID's and ID's
Missing required fields
Filename mismatches
Empty files and parsing errors

See devdocs/duplicates_and_errors_report.md for detailed findings. Generated reports and per-country breakouts are stored in dataquality/ alongside a summary data_quality_report.txt. Helper scripts (scripts/fix_*_issues.py) can be used to apply automated fixes based on the reported priorities.

How to contribute?

If you find any mistake or you have an additional data catalog to add, please generate pull request or write an issue.

Data sources

Following data sources used:

Stac Catalogs https://stacindex.org/catalogs - done
Dataverse Installations https://iqss.github.io/dataverse-installations/data/data.json - done
Open Data Inception https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/information/ - done
CKAN Portals across the world https://datashades.info/ - done
CKAN Ecosystem Sites https://ecosystem.ckan.org/dataset/ckan-sites-metadata - done (automated sync)
Geonetwork Showcase https://github.com/geonetwork/doc/blob/develop/source/annexes/gallery/gallery-urls.csv - done
PxWeb examples https://www.scb.se/en/services/statistical-programs-for-px-files/px-web/pxweb-examples/ - done
DKAN Community https://getdkan.org/community - done
Junar Clients https://junar.com/customers/ - done
Datashades data portals list https://datashades.info/api/portal/list - done
OpenSDG installations https://open-sdg.org/community - done
MyCore Installations https://www.mycore.de/site/applications/list/ - done
Elsevier Pure installations - https://www.elsevier.com/solutions/pure/pure-in-action - done
CoreTrustSeal Repositories https://amt.coretrustseal.org/certificates - done
GeoOrchestra installations https://www.georchestra.org/community.html - done
CKAN Ecosystem https://ecosystem.ckan.org
EUDAT Repositories https://b2find.eudat.eu/organization/
Data.Europe.eu catalogues https://data.europa.eu/data/catalogues?locale=en
Re3Data https://www.re3data.org/
RISources https://risources.dfg.de
Spanish opendata initiatives https://datos.gob.es/en/accessible-initiatives
INSPIRE Country catalogs https://inspire-geoportal.ec.europa.eu/overview.html?view=thematicEuOverview&theme=none
Socrata OpenDataNetwork https://www.opendatanetwork.com/search?q= - done
ArcGIS Hub search https://hub.arcgis.com/ - done
Brazilian Catalogs of geodata metadata https://inde.gov.br/Estatisticas/CatalogosMetadados
Open Data Monitor (outdated, but useful) https://www.opendatamonitor.eu
List of French open data catalogs https://airtable.com/shrWxHPi2XjLu9xtM/tblwklJPsyayeH5lX
Brazilian local government (state and municipal) open data portals https://github.com/augusto-herrmann/transparencia-dados-abertos-brasil/blob/main/data/valid/brazilian-transparency-and-open-data-portals.csv
Russian and CIS countries data catalogs https://datacatalogs.ru
EntryScape customers (Sweden) https://entryscape.com/en/customers/ - done
Geolode, catalog of open geodata websites https://geolode.org
WebCommons Dataset subset http://webdatacommons.org/structureddata/2022-12/stats/schema_org_subsets.html
Major Smart Cities with Open Data (updated 2019) https://rlist.io/l/major-smart-cities-with-open-data-portals
Registry of Open Access Repositories http://roar.eprints.org
IPT: Integrated Publishing Toolkit installations - https://www.gbif.org/ipt
Geoblacklight showcase - https://geoblacklight.org/showcase/ - done

License

Source code licensed under MIT license Data licensed under CC-BY 4.0 license

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
.agent/workflows		.agent/workflows
.cursor		.cursor
.github/workflows		.github/workflows
.tldextract_cache/publicsuffix.org-tlds		.tldextract_cache/publicsuffix.org-tlds
assets		assets
data		data
dataquality		dataquality
devdocs		devdocs
docs		docs
openspec		openspec
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataportals-registry

What kind of data catalogs collected?

Inspiration

How this repository organized

Entities

Example

Datasets and code

Data exports

Discovery

Data Quality and Validation

Re3data Enrichment

CKAN Ecosystem Synchronization

How to contribute?

Data sources

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors 2

Languages

License

commondataio/dataportals-registry

Folders and files

Latest commit

History

Repository files navigation

dataportals-registry

What kind of data catalogs collected?

Inspiration

How this repository organized

Entities

Example

Datasets and code

Data exports

Discovery

Data Quality and Validation

Re3data Enrichment

CKAN Ecosystem Synchronization

How to contribute?

Data sources

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors 2

Languages

Packages