These instructions are for AI assistants working in this project.
Always open @/openspec/AGENTS.md when the request:
- Mentions planning or proposals (words like proposal, spec, change, plan)
- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work
- Sounds ambiguous and you need the authoritative spec before coding
Use @/openspec/AGENTS.md to learn:
- How to create and apply change proposals
- Spec format and conventions
- Project structure and guidelines
Keep this managed block so 'openspec update' can refresh the instructions.
The dataportals-registry is a comprehensive registry of data portals, catalogs, data repositories, and related data infrastructure. It serves as the first pillar of the open search engine project, aiming to create a unified discovery system for open data across the globe.
The registry collects and maintains structured metadata about:
- Open data portals
- Geoportals
- Scientific data repositories
- Indicators catalogs
- Microdata catalogs
- Machine learning catalogs
- Data search engines
- API Catalogs
- Data marketplaces
- Other data infrastructure
As of February 2026, the registry contains 13,877 catalog entries from countries worldwide, stored as individual YAML files and exported as JSONL, Parquet, and DuckDB formats.
| Component | Technology |
|---|---|
| Language | Python 3.9-3.12 |
| Data Storage | YAML files (individual catalog entries) |
| Export Formats | JSONL, Parquet, DuckDB |
| Compression | zstandard (zstd) |
| CLI Framework | typer |
| Schema Validation | Cerberus, pydantic |
| Testing | pytest with coverage (pytest-cov) |
| Data Analysis | pandas, DuckDB |
| HTTP Client | requests |
| YAML Processing | PyYAML |
| Terminal UI | rich (progress bars, tables) |
| Web Scraping | beautifulsoup4 |
dataportals-registry/
├── data/
│ ├── entities/ # Verified catalog entries (YAML)
│ │ ├── US/ # Country code folders
│ │ │ ├── Federal/ # Federal-level catalogs
│ │ │ ├── US-CA/ # Subregion (state) catalogs
│ │ │ └── ...
│ │ └── ... # 195+ countries/territories
│ ├── scheduled/ # Unverified/scheduled entries
│ ├── software/ # Software/platform definitions (YAML)
│ ├── schemes/ # JSON schemas for validation
│ │ ├── catalog.json # Main catalog schema
│ │ └── software.json # Software schema
│ ├── datasets/ # Generated exports
│ │ ├── catalogs.jsonl # Main catalog export
│ │ ├── software.jsonl # Software export
│ │ ├── full.jsonl # Combined entities + scheduled
│ │ ├── *.zst # Compressed versions
│ │ ├── datasets.duckdb # DuckDB database
│ │ └── full.parquet # Parquet format
│ └── reference/ # Reference data and vocabularies
├── scripts/ # Python automation scripts
│ ├── builder.py # Main build/validation CLI
│ ├── constants.py # Constants and mappings
│ ├── re3data_enrichment.py # Re3Data integration
│ ├── sync_ckan_ecosystem.py # CKAN ecosystem sync
│ ├── fix_*_issues.py # Data quality fix scripts
│ └── ...
├── tests/ # pytest test suite
│ ├── test_builder.py
│ ├── test_yaml.py
│ └── ...
├── dataquality/ # Quality analysis outputs
│ ├── full_report.txt # Human-readable quality report
│ ├── primary_priority.jsonl # Machine-readable issues
│ ├── countries/ # Per-country breakdowns
│ └── priorities/ # By priority level
├── openspec/ # OpenSpec for spec-driven dev
│ ├── AGENTS.md # OpenSpec instructions
│ ├── project.md # Project conventions
│ ├── specs/ # Current capability specs
│ └── changes/ # Proposed changes
├── devdocs/ # Development documentation
│ ├── quality-fix-workflow.md
│ ├── ckan_ecosystem_sync.md
│ └── scheduled-to-entities.md
├── requirements.txt # Python dependencies
├── pytest.ini # pytest configuration
├── CONTRIBUTING.md # Contribution guidelines
└── README.md # Project overview
Entities are organized hierarchically:
data/entities/
├── {COUNTRY_CODE}/ # ISO country code (US, GB, FR, etc.)
│ ├── Federal/ # Federal/national level
│ ├── {SUBREGION_CODE}/ # State/province codes (US-CA, GB-SCT)
│ │ └── {CATALOG_TYPE}/ # Type subdirectory
│ │ └── {id}.yaml # Individual catalog entry
│ └── {CATALOG_TYPE}/
│ └── {id}.yaml
| Type | Subdirectory | Description |
|---|---|---|
| Open data portal | opendata/ |
Default, government open data |
| Geoportal | geo/ |
Geographic/spatial data portals |
| Scientific data repository | scientific/ |
Research data repositories |
| Indicators catalog | indicators/ |
Statistical indicators |
| Microdata catalog | microdata/ |
Survey/microdata catalogs |
| Machine learning catalog | ml/ |
ML datasets and models |
| Data search engine | search/ |
Dataset search engines |
| API Catalog | api/ |
API directories |
| Data marketplace | marketplace/ |
Commercial data markets |
| Metadata catalog | metadata/ |
Metadata registries |
| Other | other/ |
Uncategorized |
Every catalog entry MUST include these fields:
id: catalogdatagov # Unique ID (matches filename)
uid: cdi00001616 # Unique identifier (cdi######## format)
name: The Home of the U.S. Government Open Data # Display name
link: https://catalog.data.gov # URL to catalog
catalog_type: Open data portal # One of the allowed types
access_mode: # List of access modes
- open
status: active # active, inactive, or scheduled
software: # Software platform info
id: ckan
name: CKAN
owner: # Owner organization
name: GSA Technology Transformation Services
type: Central government # Owner type
location: # Geographic location
country:
id: US
name: United States
coverage: # Geographic coverage
- location:
country:
id: US
name: United States
level: 20 # Geographic level- Filename must match the
idfield exactly - Use only lowercase letters and numbers
- Remove special characters (dots, dashes, underscores)
- Example:
https://catalog.data.gov→id: catalogdatagov→catalogdatagov.yaml
Schema is defined in data/schemes/catalog.json using Cerberus format. Key validation rules:
access_mode: Must be list of "open" or "restricted"catalog_type: Must be one of the allowed types (see table above)status: Must be "active", "inactive", or "scheduled"software: Must haveidandnamesubfieldsowner: Must havename,type, andlocationwith country infouid: Formatcdi########for entities,temp########for scheduled
# Install dependencies
pip install -r requirements.txt
# Build all datasets (from YAML to JSONL/DuckDB)
python scripts/builder.py build
# Validate all YAML files against schema (use with options to verify single entry if single entry added/edited)
python scripts/builder.py validate-yaml
# Run full test suite with coverage
pytest
# Run specific test file
pytest tests/test_builder.py -v
# Assign UIDs to new entries (run after adding entries)
python scripts/builder.py assign
# Analyze data quality
python scripts/builder.py analyze-quality
# Generate quality control metrics report
python scripts/builder.py quality-controlMethod 1: Using CLI (recommended)
python scripts/builder.py add-single \
--url "https://example.com/data" \
--software "ckan" \
--catalog-type "Open data portal" \
--name "Example Data Portal" \
--country "US" \
--scheduledMethod 2: Manual YAML creation
- Create file in correct location:
data/entities/{COUNTRY}/{TYPE}/{id}.yaml - Ensure
idfield matches filename - Run
python scripts/builder.py assignto generate UID - Run
python scripts/builder.py validate-yamlto verify
# 1. Generate quality report
python scripts/builder.py analyze-quality
# 2. Review reports
cat dataquality/full_report.txt
# 3. Apply fixes (choose one method)
# Method A: Priority-based scripts
python scripts/fix_critical_issues.py
python scripts/fix_important_issues.py
# Method B: Generate Cursor commands
python scripts/generate_cursor_commands.py
# Then use scripts/update_all_issues.sh
# 4. Validate fixes
python scripts/builder.py validate-yaml
# 5. Re-run quality check
python scripts/builder.py analyze-quality- Follow PEP 8 style guidelines
- Use meaningful variable and function names
- Add docstrings to functions and classes
- Keep functions focused and small
- Use type hints where appropriate
- Import ordering: standard library, third-party, local
- Use 2 spaces for indentation (no tabs)
- Use consistent formatting
- Keep lines under 100 characters when possible
- Use quotes for strings with special characters
- Use lists for multiple values
- Filename must match the
idfield
- Write clear, descriptive commit messages
- Start with a verb: "Add", "Fix", "Update", "Remove"
- Make atomic commits (one logical change per commit)
- Reference issue numbers when applicable:
"Add example catalog (fixes #123)"
Configured in pytest.ini:
- Test files:
test_*.pypattern - Coverage for
scripts/directory - Markers:
unit,integration,slow - Reports: terminal, HTML, XML
# Run all tests with coverage
pytest
# Run with verbose output
pytest -v
# Run specific test class
pytest tests/test_builder.py::TestLoadJsonl -v
# Run only unit tests
pytest -m unit
# Run without coverage (faster)
pytest --no-cov- Place tests in
tests/directory - Use pytest fixtures from
conftest.py - Test both valid and invalid cases
- Mock external API calls
- Test file I/O with temporary directories
CLI commands available:
| Command | Description |
|---|---|
build |
Build JSONL datasets from YAML files |
validate-yaml |
Validate all YAML files against schema |
validate |
Validate JSONL against schema |
assign |
Assign UIDs to entries missing them |
add-single |
Add a single catalog via CLI |
add-list |
Add catalogs from a list file |
analyze-quality |
Run data quality analysis |
quality-control |
Generate quality metrics report |
export |
Export to CSV format |
stats |
Generate statistics tables |
report |
Report incomplete data |
| Script | Purpose |
|---|---|
re3data_enrichment.py |
Enrich entries with Re3Data metadata |
sync_ckan_ecosystem.py |
Sync with CKAN ecosystem dataset |
enrich.py |
General enrichment utilities |
enrich_soft.py |
Software detection enrichment |
| Script | Priority |
|---|---|
fix_critical_issues.py |
CRITICAL |
fix_important_issues.py |
IMPORTANT |
fix_medium_issues.py |
MEDIUM |
fix_low_issues.py |
LOW |
fix_all_issues.py |
All priorities |
fix_duplicate_tags.py |
Tag duplicates |
fix_tag_hygiene.py |
Tag quality |
fix_software_id.py |
Software ID fixes |
fix_api_status_mismatch.py |
API status fixes |
| Script | Purpose |
|---|---|
promote_scheduled.py |
Promote scheduled entries to entities |
remove_scheduled_duplicates.py |
Remove scheduled records that exist in entities |
Enriches scientific repositories with metadata from re3data.org:
# Preview enrichment
python scripts/re3data_enrichment.py enrich --dry-run
# Apply enrichment
python scripts/re3data_enrichment.py enrichDiscovers and adds CKAN sites from ecosystem.ckan.org:
# Preview sync
python scripts/sync_ckan_ecosystem.py --dry-run
# Sync and add to scheduled
python scripts/sync_ckan_ecosystem.py
# Sync and add to entities (verified)
python scripts/sync_ckan_ecosystem.py --entitiesThis project uses OpenSpec for spec-driven development of new features and breaking changes.
Create a proposal for:
- New features or capabilities
- Breaking changes (API, schema)
- Architecture or pattern changes
- Performance optimizations that change behavior
- Security pattern updates
Work directly for:
- Bug fixes (restore intended behavior)
- Typos, formatting, comments
- Adding/editing single catalog entries
- Dependency updates (non-breaking)
- Tests for existing behavior
# List active changes
openspec list
# List specifications
openspec list --specs
# Show change details
openspec show <change-id>
# Validate change
openspec validate <change-id> --strict
# Archive completed change
openspec archive <change-id> --yesopenspec/
├── AGENTS.md # This file - OpenSpec instructions
├── project.md # Project conventions and context
├── specs/ # Current capability specs
│ └── [capability]/
│ ├── spec.md # Requirements and scenarios
│ └── design.md # Technical patterns
└── changes/ # Proposed changes
├── [change-id]/
│ ├── proposal.md # Why and what
│ ├── tasks.md # Implementation checklist
│ ├── design.md # Technical decisions (optional)
│ └── specs/ # Delta specs
│ └── [capability]/
│ └── spec.md
└── archive/ # Completed changes
See openspec/AGENTS.md for full OpenSpec instructions.
- Check if catalog already exists in
data/entities/ordata/scheduled/ - Use CLI to add:
python scripts/builder.py add-single --url ... --scheduled - Or create YAML manually in correct location
- Run
python scripts/builder.py assignto generate UID - Run
python scripts/builder.py validate-yamlto verify - Run
pytestto ensure tests pass
- Run
python scripts/builder.py analyze-quality - Review
dataquality/full_report.txtanddataquality/primary_priority.jsonl - Apply fixes using appropriate script or manual editing
- Validate:
python scripts/builder.py validate-yaml - Re-run quality analysis to confirm fixes
- Read
data/schemes/catalog.jsonfor current schema - If breaking change: Create OpenSpec proposal first
- Modify schema in
data/schemes/catalog.json - Update validation logic in
scripts/builder.pyif needed - Run
python scripts/builder.py validate-yamlto test - Run
pytestto ensure tests pass
- Check existing software in
data/software/ - Create new YAML file with software metadata
- Update
scripts/constants.pyif needed for mappings - Run
python scripts/builder.py buildto regenerate - Validate and test
- No secrets in code: Do not commit API keys, passwords, or tokens
- URL validation: All URLs are validated for proper format (scheme + netloc)
- Input sanitization: YAML parsing uses safe loading
- HTTP requests: Use requests library with proper timeout and SSL verification
- File permissions: YAML files should be world-readable (0644)
- Code: MIT License
- Data: CC-BY 4.0 License
- README.md - Project overview and data sources
- CONTRIBUTING.md - Full contribution guidelines
- openspec/project.md - Project conventions
- openspec/AGENTS.md - OpenSpec instructions
- devdocs/quality-fix-workflow.md - Quality fix procedures