Skip to content

Commit 4f5e8e9

Browse files
a5durclaude
andcommitted
Add YAML metadata pipeline for updating catalog from ckan_ecosystem.yaml
New workflow that fetches ckan_ecosystem.yaml from extension repositories and updates the CKAN ecosystem catalog with declared metadata (title, tags, CKAN version compatibility, publisher, license, etc.). Supports interactive mode, piped stdin for CI, and processing all extensions at once. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 2956551 commit 4f5e8e9

File tree

3 files changed

+455
-3
lines changed

3 files changed

+455
-3
lines changed

CLAUDE.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Data pipeline workflows for cataloging metadata from CKAN instances and extensio
88

99
## Architecture
1010

11-
Two independent pipelines, each with 4-stage workflows:
11+
Three independent pipelines:
1212

1313
### Extensions Pipeline (`extensions-workflow/`)
1414
Collects GitHub metrics for CKAN extensions:
@@ -24,8 +24,20 @@ Collects statistics from live CKAN instances:
2424
3. `3updateSitesCatalog.py` - Update CKAN site metadata
2525
4. `datapump.py` - Append snapshots to datastore
2626

27+
### YAML Metadata Pipeline (`yaml-workflow/`)
28+
Fetches `ckan_ecosystem.yaml` from extension repositories and updates catalog metadata:
29+
1. `update_from_yaml.py` - Interactive/CI script that:
30+
- Accepts extension names, catalog URLs, or `all` to process every extension
31+
- Fetches extension details from CKAN catalog to find the GitHub URL
32+
- Downloads and parses `ckan_ecosystem.yaml` from the repo (tries `main` then `master` branch)
33+
- Maps YAML fields (title, notes, tags, ckan_version, publisher, license, etc.) to CKAN package fields
34+
- Updates the catalog via `package_patch` API
35+
- Supports non-interactive mode via stdin for CI (reads piped input, auto-confirms)
36+
2737
### Data Flow
28-
Both pipelines follow: Discovery -> API Collection -> Catalog Sync -> Datastore Append
38+
Extensions and Sites pipelines follow: Discovery -> API Collection -> Catalog Sync -> Datastore Append
39+
40+
YAML pipeline follows: Extension Lookup -> GitHub YAML Fetch -> Field Mapping -> Catalog Sync
2941

3042
All scripts target `https://ecosystem.ckan.org` as the CKAN base URL.
3143

@@ -40,6 +52,13 @@ GITHUB_TOKEN=your-token python 2refresh.py
4052
CKAN_API_KEY=your-key python 3updateCatalog.py
4153
CKAN_API_KEY=your-key python datapump.py
4254

55+
# YAML metadata pipeline
56+
pip install -r requirements.txt
57+
cd yaml-workflow
58+
CKAN_API_KEY=your-key python update_from_yaml.py # interactive mode
59+
echo "ckanext-spatial" | CKAN_API_KEY=your-key python update_from_yaml.py # CI mode
60+
echo "all" | CKAN_API_KEY=your-key AUTO_CONFIRM=true python update_from_yaml.py # all extensions
61+
4362
# Sites pipeline
4463
pip install -r sites-workflow/requirements.txt
4564
cd sites-workflow
@@ -53,6 +72,7 @@ CKAN_API_KEY=your-key python datapump.py
5372

5473
- `GITHUB_TOKEN` - GitHub Personal Access Token (extensions pipeline)
5574
- `CKAN_API_KEY` - CKAN API key with write permissions
75+
- `AUTO_CONFIRM` - Set to `true` to skip interactive confirmation prompts (yaml pipeline)
5676

5777
## GitHub Actions
5878

@@ -64,5 +84,6 @@ Secrets required: `GH_METADATA_TOKEN`, `CKAN_API_KEY`
6484
## Key Dependencies
6585

6686
- `pandas` - CSV/DataFrame operations
67-
- `requests` - HTTP API calls
87+
- `cloudscraper` - HTTP API calls with Cloudflare bypass
6888
- `PyGithub` - GitHub API wrapper (extensions only)
89+
- `PyYAML` - YAML parsing (yaml pipeline)

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ pandas>=1.3.0
22
cloudscraper>=1.2.71
33
PyGithub>=1.55.0
44
python-dateutil>=2.8.0
5+
PyYAML>=6.0

0 commit comments

Comments
 (0)