@@ -8,7 +8,7 @@ Data pipeline workflows for cataloging metadata from CKAN instances and extensio
88
99## Architecture
1010
11- Two independent pipelines, each with 4-stage workflows :
11+ Three independent pipelines:
1212
1313### Extensions Pipeline (` extensions-workflow/ ` )
1414Collects GitHub metrics for CKAN extensions:
@@ -24,8 +24,20 @@ Collects statistics from live CKAN instances:
24243 . ` 3updateSitesCatalog.py ` - Update CKAN site metadata
25254 . ` datapump.py ` - Append snapshots to datastore
2626
27+ ### YAML Metadata Pipeline (` yaml-workflow/ ` )
28+ Fetches ` ckan_ecosystem.yaml ` from extension repositories and updates catalog metadata:
29+ 1 . ` update_from_yaml.py ` - Interactive/CI script that:
30+ - Accepts extension names, catalog URLs, or ` all ` to process every extension
31+ - Fetches extension details from CKAN catalog to find the GitHub URL
32+ - Downloads and parses ` ckan_ecosystem.yaml ` from the repo (tries ` main ` then ` master ` branch)
33+ - Maps YAML fields (title, notes, tags, ckan_version, publisher, license, etc.) to CKAN package fields
34+ - Updates the catalog via ` package_patch ` API
35+ - Supports non-interactive mode via stdin for CI (reads piped input, auto-confirms)
36+
2737### Data Flow
28- Both pipelines follow: Discovery -> API Collection -> Catalog Sync -> Datastore Append
38+ Extensions and Sites pipelines follow: Discovery -> API Collection -> Catalog Sync -> Datastore Append
39+
40+ YAML pipeline follows: Extension Lookup -> GitHub YAML Fetch -> Field Mapping -> Catalog Sync
2941
3042All scripts target ` https://ecosystem.ckan.org ` as the CKAN base URL.
3143
@@ -40,6 +52,13 @@ GITHUB_TOKEN=your-token python 2refresh.py
4052CKAN_API_KEY=your-key python 3updateCatalog.py
4153CKAN_API_KEY=your-key python datapump.py
4254
55+ # YAML metadata pipeline
56+ pip install -r requirements.txt
57+ cd yaml-workflow
58+ CKAN_API_KEY=your-key python update_from_yaml.py # interactive mode
59+ echo " ckanext-spatial" | CKAN_API_KEY=your-key python update_from_yaml.py # CI mode
60+ echo " all" | CKAN_API_KEY=your-key AUTO_CONFIRM=true python update_from_yaml.py # all extensions
61+
4362# Sites pipeline
4463pip install -r sites-workflow/requirements.txt
4564cd sites-workflow
@@ -53,6 +72,7 @@ CKAN_API_KEY=your-key python datapump.py
5372
5473- ` GITHUB_TOKEN ` - GitHub Personal Access Token (extensions pipeline)
5574- ` CKAN_API_KEY ` - CKAN API key with write permissions
75+ - ` AUTO_CONFIRM ` - Set to ` true ` to skip interactive confirmation prompts (yaml pipeline)
5676
5777## GitHub Actions
5878
@@ -64,5 +84,6 @@ Secrets required: `GH_METADATA_TOKEN`, `CKAN_API_KEY`
6484## Key Dependencies
6585
6686- ` pandas ` - CSV/DataFrame operations
67- - ` requests ` - HTTP API calls
87+ - ` cloudscraper ` - HTTP API calls with Cloudflare bypass
6888- ` PyGithub ` - GitHub API wrapper (extensions only)
89+ - ` PyYAML ` - YAML parsing (yaml pipeline)
0 commit comments