Skip to content

Commit b744e11

Browse files
committed
add docs
1 parent f8f4819 commit b744e11

File tree

6 files changed

+850
-89
lines changed

6 files changed

+850
-89
lines changed

README.md

Lines changed: 93 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,88 +1,106 @@
11
# Dataset API
22

33
## Overview
4-
The Dataset API provides a unified, lineage-aware, metadata-rich interface to datasets stored across heterogeneous backends (PostgreSQL, S3, filesystem). It exposes a DCAT-AP 3.0.0–compatible catalogue, OpenLineage-enriched metadata, and a controlled dataset exposure policy.
5-
6-
## Features
7-
- DCAT-AP 3.0.0 catalogue (`/catalogue`)
8-
- Detailed dataset metadata (`/dataset/<id>/metadata`)
9-
- Dataset schema (`/dataset/<id>/schema`)
10-
- Query API using SQL-like syntax (`/dataset/<id>/query`)
11-
- Strong governance through controlled dataset exposure (`expose: true/false`)
12-
- OpenLineage integration and provenance tracking
13-
- YAML-based catalogue import/export
14-
- CLI tools for validation, extraction, import, and migration
15-
- Backend-agnostic support for PostgreSQL, S3, and filesystem data
16-
17-
## Architecture Summary
18-
- **Catalogue layer** stored in PostgreSQL schema (`settings.catalogue_schema`)
19-
- **DatasetEntry** model capturing backend config, tags, lineage, licensing, and metadata
20-
- **DCAT builders** generate catalogue and dataset JSON-LD outputs
21-
- **OpenLineage extractor** fetches metadata via Marquez and exports YAML
22-
- **Importer CLI** loads YAML → validates via Pydantic → imports into catalogue DB
23-
- **Exposure semantics** ensure only selected datasets are queryable
24-
- **Alembic migrations** support async engines and schema scoping
25-
26-
## CLI Commands
27-
### Export OpenLineage to YAML
28-
```
29-
dataset export openlineage --ns prod -o data/ --expose
30-
```
4+
The Dataset API provides a **secure, lineage-aware, metadata-rich interface** to heterogeneous datasets (PostgreSQL, object storage, filesystem).
5+
It exposes a **DCAT-AP 3.0.0–compatible catalogue**, a **governed SQL query interface**, and **OpenLineage-integrated provenance**, designed to support Digital Twins and analytical applications.
316

32-
### Import catalogue from YAML
33-
```
34-
dataset import catalogue -i data/*.yaml --api-url http://localhost:8000
35-
```
7+
This README gives a **high-level orientation**.
8+
Detailed concepts and workflows are documented in `docs/*.md` (see links below).
369

37-
### Validate catalogue file(s)
38-
```
39-
dataset validate catalogue -i data/*.yaml --strict
40-
```
10+
---
4111

42-
### Alembic migrations
43-
```
44-
uv run alembic upgrade head
45-
uv run alembic revision --autogenerate -m "update"
12+
## Core Capabilities
13+
14+
- **Dataset catalogue (DCAT-AP 3.0.0)**
15+
Public catalogue endpoint exposing datasets and distributions.
16+
17+
- **Governed query API**
18+
SQL `SELECT` queries over exposed datasets with:
19+
- strict SQL validation
20+
- server-side pagination & limits
21+
- dataset-level access control (auth + OPA)
22+
23+
- **Strong governance & disclosure model**
24+
- `open`, `internal`, `restricted` access levels
25+
- JWT-based authentication
26+
- OPA policy evaluation
27+
28+
- **Lineage & provenance**
29+
- OpenLineage ingestion (Marquez)
30+
- Namespace-based dataset grouping
31+
- Provenance surfaced in catalogue & metadata
32+
33+
- **Schema & metadata introspection**
34+
- JSON Schema (2020-12) generated from physical tables
35+
- Column-level metadata for UI and clients
36+
37+
- **CLI-driven lifecycle**
38+
- Export lineage → YAML
39+
- Validate catalogue definitions
40+
- Import & reconcile catalogue state
41+
- Automatic cleanup of stale datasets
42+
43+
---
44+
45+
## API Surface (High-Level)
46+
47+
| Area | Description |
48+
|-----|-------------|
49+
| `/catalogue` | DCAT-AP catalogue (exposed datasets only) |
50+
| `/catalogue/{dataset_id}/schema` | JSON Schema of dataset |
51+
| `/query` | Governed SQL query endpoint |
52+
| `/admin/catalogue` | Catalogue import (CLI-only) |
53+
| `/health` | Health check |
54+
55+
Detailed endpoint semantics are described in the docs.
56+
57+
---
58+
59+
## CLI Overview
60+
61+
The CLI is the **primary control plane** for the Dataset API.
62+
63+
```bash
64+
dataset-cli --help
4665
```
4766

48-
## Backends
49-
- **postgres** – SQL tables
50-
- **s3** – raw objects with optional public URL
51-
- **fs** – direct file-based datasets
52-
53-
## Lineage Support
54-
The system stores structured lineage metadata from OpenLineage, including:
55-
- namespace
56-
- sourceName
57-
- timestamps
58-
- lifecycle state
59-
- facets
60-
- tags
61-
62-
Pydantic models allow flexible ingestion (`extra="allow"`).
63-
64-
## DCAT-AP Compliance
65-
Each dataset includes:
66-
- identifiers, titles, descriptions
67-
- keywords, themes
68-
- publisher, rights holder, license
69-
- language & spatial coverage
70-
- distributions (API access and raw file access)
71-
- provenance (`prov:wasDerivedFrom`) using lineage information
72-
73-
## Development
74-
### Dump all Python source files into a single file:
75-
A provided tool gathers all `dataset/**/*.py` into one bundle while ignoring `__pycache__`.
76-
77-
### Supported Python tooling:
78-
- `uv` package runner
79-
- Typer CLI
67+
Main commands:
68+
- `export openlineage` – extract lineage from Marquez
69+
- `import catalogue` – validate & import dataset catalogue
70+
- `validate catalogue` – schema validation only
71+
- `ontology` – ontology fetch, analysis, tree generation
72+
73+
---
74+
75+
## Documentation
76+
77+
Additional documentation available
78+
79+
- [Architecture overview](docs/architecture.md)
80+
- [Catalogue Management](docs/catalogue-management.md)
81+
- [CLI operations](docs/cli-operations.md)
82+
- [Governance and security](docs/governance-security.md)
83+
- [Query engine](docs/query-engine.md)
84+
85+
---
86+
87+
## Development & Contribution
88+
89+
- Python ≥ 3.11
90+
- Async SQLAlchemy
8091
- Pydantic v2
81-
- SQLAlchemy (async)
82-
- Alembic (async migrations)
83-
- httpx (async HTTP calls)
92+
- FastAPI + httpx
93+
- sqlglot-based SQL validation
94+
95+
Before opening a PR:
96+
- validate all YAML definitions
97+
- add tests for new API behavior
98+
- include migrations for schema changes
99+
- keep docs in sync with API behavior
100+
101+
---
84102

85-
## License
103+
## License
86104

87105
Copyright >=2025 Spindox Labs
88106

@@ -97,17 +115,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
97115
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
98116
See the License for the specific language governing permissions and
99117
limitations under the License.
100-
101-
102-
## Contributing
103-
Ensure:
104-
- All YAML definitions validate via the CLI
105-
- No API endpoint accepts unvalidated data
106-
- Alembic migrations are generated for schema changes
107-
108-
PRs should include tests for:
109-
- catalogue import
110-
- DCAT output
111-
- lineage extraction
112-
- backend resolution
113-

docs/architecture.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Architecture & Core Concepts
2+
3+
This document explains **what the Dataset API is**, the **core domain model**, and how the major subsystems fit together.
4+
5+
---
6+
7+
## What the Dataset API is
8+
9+
The Dataset API is a **governed, read-only data access layer** that:
10+
11+
- publishes a **dataset catalogue** (DCAT-AP compatible)
12+
- exposes a **restricted SQL query interface** over *catalogued* datasets
13+
- provides **schema and metadata introspection** for clients and UIs
14+
- integrates with **OpenLineage** to keep provenance and trust up to date
15+
16+
The API is not an ingestion tool. Pipelines produce data; the Dataset API governs and serves it.
17+
18+
---
19+
20+
## System Context
21+
22+
### Actors
23+
24+
- **Producers (pipelines)**: create/refresh physical tables and emit lineage (OpenLineage)
25+
- **Operators**: manage catalogue definitions via CLI and ensure OPA/policy config is correct
26+
- **Consumers (apps/DTs/BI)**: discover datasets, fetch schemas, run governed queries
27+
28+
### External Dependencies
29+
30+
- **Physical storage**: PostgreSQL (tables, views)
31+
- **Authorization**: OPA (policy decision point)
32+
- **Lineage backend**: Marquez (OpenLineage ingestion/query)
33+
- **Identity provider**: issues JWTs (users + service accounts)
34+
35+
---
36+
37+
## High-level Architecture
38+
39+
```
40+
+------------------------+
41+
| Pipelines (ETL/dbt/..) |
42+
+-----------+------------+
43+
|
44+
| OpenLineage events
45+
v
46+
+-----------+
47+
| Marquez |
48+
+-----+-----+
49+
|
50+
| export lineage + metadata
51+
v
52+
+---------+ +---------------------+ +--------------------+
53+
| Clients | --> | Dataset API | <--> | OPA (Policy) |
54+
| (apps) | | | | allow/deny |
55+
+---------+ | - Catalogue | +--------------------+
56+
| - Query Engine |
57+
| - Schema API |
58+
| - Metadata API |
59+
+----------+----------+
60+
|
61+
v
62+
+---------------+
63+
| PostgreSQL |
64+
| tables/views |
65+
+---------------+
66+
```
67+
68+
---
69+
70+
## Core Domain Model
71+
72+
### Dataset
73+
74+
A **dataset** is a governed contract over a physical data asset.
75+
76+
**Dataset identity**
77+
- `dataset_id` (stable string; often namespace-qualified)
78+
79+
**Dataset governance**
80+
- `access_level`: `open` | `internal` | `restricted`
81+
- ownership / stewardship fields
82+
- classification, tags, retention hints
83+
84+
**Dataset physical mapping**
85+
- resolved storage reference (e.g., Postgres table/view)
86+
- schema and column metadata derived from reflection
87+
88+
### Namespace
89+
90+
Namespaces are a first-class taxonomy for lifecycle and intent:
91+
- `raw`: ingestion/staging
92+
- `silver`: enriched internal
93+
- `gold`: curated/exposed
94+
95+
Namespaces drive:
96+
- catalogue selection filters
97+
- policy rules (e.g., only `gold` exposed externally)
98+
- operational grouping
99+
100+
### Distribution
101+
102+
In DCAT terms, a dataset can expose one or more **distributions** (e.g., SQL endpoint, files, API resource).
103+
In practice:
104+
- the API exposes a **query distribution**
105+
- optional documentation and external references may be included
106+
107+
---
108+
109+
## Read-only Contract
110+
111+
Consumers cannot mutate data or catalogue state. Mutations happen only via:
112+
- data pipelines (tables/views)
113+
- CLI-managed catalogue imports
114+
- admin endpoints used by CLI
115+
116+
This guarantees:
117+
- reproducibility
118+
- auditability
119+
- consistent governance enforcement
120+
121+
---
122+
123+
## Catalogue vs Storage Reality
124+
125+
The catalogue is **validated against storage**.
126+
127+
Expected behaviors:
128+
- if a dataset points to a missing table/view, it should be marked invalid and/or removed during cleanup
129+
- imports reconcile desired state (YAML) vs actual DB objects
130+
- schema endpoints reflect what exists in storage today
131+
132+
The catalogue **never creates** physical data.
133+
134+
---
135+
136+
## Lifecycle of a Dataset (Conceptual)
137+
138+
1. Pipeline creates/refreshes physical table/view
139+
2. Lineage is emitted to Marquez (OpenLineage)
140+
3. Operator exports lineage-derived candidates (CLI)
141+
4. Operator curates YAML (titles, descriptions, access levels, tags, docs)
142+
5. CLI imports catalogue (create/update)
143+
6. API exposes dataset in catalogue (if allowed)
144+
7. Consumers query datasets under governance
145+
8. Cleanup removes stale entries when physical assets disappear
146+
147+
---
148+
149+
## Data Integrity & Guardrails
150+
151+
- SQL must be validated (AST-based, allowlisted)
152+
- dataset references must resolve to catalogued assets
153+
- access is policy-controlled (OPA)
154+
- limits and pagination protect the system from unbounded workloads
155+

0 commit comments

Comments
 (0)