11# Dataset API
22
33## Overview
4- The Dataset API provides a unified, lineage-aware, metadata-rich interface to datasets stored across heterogeneous backends (PostgreSQL, S3, filesystem). It exposes a DCAT-AP 3.0.0–compatible catalogue, OpenLineage-enriched metadata, and a controlled dataset exposure policy.
5-
6- ## Features
7- - DCAT-AP 3.0.0 catalogue (` /catalogue ` )
8- - Detailed dataset metadata (` /dataset/<id>/metadata ` )
9- - Dataset schema (` /dataset/<id>/schema ` )
10- - Query API using SQL-like syntax (` /dataset/<id>/query ` )
11- - Strong governance through controlled dataset exposure (` expose: true/false ` )
12- - OpenLineage integration and provenance tracking
13- - YAML-based catalogue import/export
14- - CLI tools for validation, extraction, import, and migration
15- - Backend-agnostic support for PostgreSQL, S3, and filesystem data
16-
17- ## Architecture Summary
18- - ** Catalogue layer** stored in PostgreSQL schema (` settings.catalogue_schema ` )
19- - ** DatasetEntry** model capturing backend config, tags, lineage, licensing, and metadata
20- - ** DCAT builders** generate catalogue and dataset JSON-LD outputs
21- - ** OpenLineage extractor** fetches metadata via Marquez and exports YAML
22- - ** Importer CLI** loads YAML → validates via Pydantic → imports into catalogue DB
23- - ** Exposure semantics** ensure only selected datasets are queryable
24- - ** Alembic migrations** support async engines and schema scoping
25-
26- ## CLI Commands
27- ### Export OpenLineage to YAML
28- ```
29- dataset export openlineage --ns prod -o data/ --expose
30- ```
4+ The Dataset API provides a ** secure, lineage-aware, metadata-rich interface** to heterogeneous datasets (PostgreSQL, object storage, filesystem).
5+ It exposes a ** DCAT-AP 3.0.0–compatible catalogue** , a ** governed SQL query interface** , and ** OpenLineage-integrated provenance** , designed to support Digital Twins and analytical applications.
316
32- ### Import catalogue from YAML
33- ```
34- dataset import catalogue -i data/*.yaml --api-url http://localhost:8000
35- ```
7+ This README gives a ** high-level orientation** .
8+ Detailed concepts and workflows are documented in ` docs/*.md ` (see links below).
369
37- ### Validate catalogue file(s)
38- ```
39- dataset validate catalogue -i data/*.yaml --strict
40- ```
10+ ---
4111
42- ### Alembic migrations
43- ```
44- uv run alembic upgrade head
45- uv run alembic revision --autogenerate -m "update"
12+ ## Core Capabilities
13+
14+ - ** Dataset catalogue (DCAT-AP 3.0.0)**
15+ Public catalogue endpoint exposing datasets and distributions.
16+
17+ - ** Governed query API**
18+ SQL ` SELECT ` queries over exposed datasets with:
19+ - strict SQL validation
20+ - server-side pagination & limits
21+ - dataset-level access control (auth + OPA)
22+
23+ - ** Strong governance & disclosure model**
24+ - ` open ` , ` internal ` , ` restricted ` access levels
25+ - JWT-based authentication
26+ - OPA policy evaluation
27+
28+ - ** Lineage & provenance**
29+ - OpenLineage ingestion (Marquez)
30+ - Namespace-based dataset grouping
31+ - Provenance surfaced in catalogue & metadata
32+
33+ - ** Schema & metadata introspection**
34+ - JSON Schema (2020-12) generated from physical tables
35+ - Column-level metadata for UI and clients
36+
37+ - ** CLI-driven lifecycle**
38+ - Export lineage → YAML
39+ - Validate catalogue definitions
40+ - Import & reconcile catalogue state
41+ - Automatic cleanup of stale datasets
42+
43+ ---
44+
45+ ## API Surface (High-Level)
46+
47+ | Area | Description |
48+ | -----| -------------|
49+ | ` /catalogue ` | DCAT-AP catalogue (exposed datasets only) |
50+ | ` /catalogue/{dataset_id}/schema ` | JSON Schema of dataset |
51+ | ` /query ` | Governed SQL query endpoint |
52+ | ` /admin/catalogue ` | Catalogue import (CLI-only) |
53+ | ` /health ` | Health check |
54+
55+ Detailed endpoint semantics are described in the docs.
56+
57+ ---
58+
59+ ## CLI Overview
60+
61+ The CLI is the ** primary control plane** for the Dataset API.
62+
63+ ``` bash
64+ dataset-cli --help
4665```
4766
48- ## Backends
49- - ** postgres** – SQL tables
50- - ** s3** – raw objects with optional public URL
51- - ** fs** – direct file-based datasets
52-
53- ## Lineage Support
54- The system stores structured lineage metadata from OpenLineage, including:
55- - namespace
56- - sourceName
57- - timestamps
58- - lifecycle state
59- - facets
60- - tags
61-
62- Pydantic models allow flexible ingestion (` extra="allow" ` ).
63-
64- ## DCAT-AP Compliance
65- Each dataset includes:
66- - identifiers, titles, descriptions
67- - keywords, themes
68- - publisher, rights holder, license
69- - language & spatial coverage
70- - distributions (API access and raw file access)
71- - provenance (` prov:wasDerivedFrom ` ) using lineage information
72-
73- ## Development
74- ### Dump all Python source files into a single file:
75- A provided tool gathers all ` dataset/**/*.py ` into one bundle while ignoring ` __pycache__ ` .
76-
77- ### Supported Python tooling:
78- - ` uv ` package runner
79- - Typer CLI
67+ Main commands:
68+ - ` export openlineage ` – extract lineage from Marquez
69+ - ` import catalogue ` – validate & import dataset catalogue
70+ - ` validate catalogue ` – schema validation only
71+ - ` ontology ` – ontology fetch, analysis, tree generation
72+
73+ ---
74+
75+ ## Documentation
76+
77+ Additional documentation available
78+
79+ - [ Architecture overview] ( docs/architecture.md )
80+ - [ Catalogue Management] ( docs/catalogue-management.md )
81+ - [ CLI operations] ( docs/cli-operations.md )
82+ - [ Governance and security] ( docs/governance-security.md )
83+ - [ Query engine] ( docs/query-engine.md )
84+
85+ ---
86+
87+ ## Development & Contribution
88+
89+ - Python ≥ 3.11
90+ - Async SQLAlchemy
8091- Pydantic v2
81- - SQLAlchemy (async)
82- - Alembic (async migrations)
83- - httpx (async HTTP calls)
92+ - FastAPI + httpx
93+ - sqlglot-based SQL validation
94+
95+ Before opening a PR:
96+ - validate all YAML definitions
97+ - add tests for new API behavior
98+ - include migrations for schema changes
99+ - keep docs in sync with API behavior
100+
101+ ---
84102
85- ## License
103+ ## License
86104
87105Copyright >=2025 Spindox Labs
88106
@@ -97,17 +115,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
97115WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
98116See the License for the specific language governing permissions and
99117limitations under the License.
100-
101-
102- ## Contributing
103- Ensure:
104- - All YAML definitions validate via the CLI
105- - No API endpoint accepts unvalidated data
106- - Alembic migrations are generated for schema changes
107-
108- PRs should include tests for:
109- - catalogue import
110- - DCAT output
111- - lineage extraction
112- - backend resolution
113-
0 commit comments