celine-eu
diff --git a/‎README.md‎
Lines changed: 93 additions & 89 deletions b/‎README.md‎
Lines changed: 93 additions & 89 deletions
diff --git a/‎docs/architecture.md‎
Lines changed: 155 additions & 0 deletions b/‎docs/architecture.md‎
Lines changed: 155 additions & 0 deletions
@@ -1,88 +1,106 @@
 # Dataset API
 
 ## Overview
-The Dataset API provides a unified, lineage-aware, metadata-rich interface to datasets stored across heterogeneous backends (PostgreSQL, S3, filesystem). It exposes a DCAT-AP 3.0.0–compatible catalogue, OpenLineage-enriched metadata, and a controlled dataset exposure policy.
-
-## Features
-- DCAT-AP 3.0.0 catalogue (`/catalogue`)
-- Detailed dataset metadata (`/dataset/<id>/metadata`)
-- Dataset schema (`/dataset/<id>/schema`)
-- Query API using SQL-like syntax (`/dataset/<id>/query`)
-- Strong governance through controlled dataset exposure (`expose: true/false`)
-- OpenLineage integration and provenance tracking
-- YAML-based catalogue import/export
-- CLI tools for validation, extraction, import, and migration
-- Backend-agnostic support for PostgreSQL, S3, and filesystem data
-
-## Architecture Summary
-- **Catalogue layer** stored in PostgreSQL schema (`settings.catalogue_schema`)
-- **DatasetEntry** model capturing backend config, tags, lineage, licensing, and metadata
-- **DCAT builders** generate catalogue and dataset JSON-LD outputs
-- **OpenLineage extractor** fetches metadata via Marquez and exports YAML
-- **Importer CLI** loads YAML → validates via Pydantic → imports into catalogue DB
-- **Exposure semantics** ensure only selected datasets are queryable
-- **Alembic migrations** support async engines and schema scoping
-
-## CLI Commands
-### Export OpenLineage to YAML
-```
-dataset export openlineage --ns prod -o data/ --expose
-```
+The Dataset API provides a **secure, lineage-aware, metadata-rich interface** to heterogeneous datasets (PostgreSQL, object storage, filesystem).  
+It exposes a **DCAT-AP 3.0.0–compatible catalogue**, a **governed SQL query interface**, and **OpenLineage-integrated provenance**, designed to support Digital Twins and analytical applications.
 
-### Import catalogue from YAML
-```
-dataset import catalogue -i data/*.yaml --api-url http://localhost:8000
-```
+This README gives a **high-level orientation**.  
+Detailed concepts and workflows are documented in `docs/*.md` (see links below).
 
-### Validate catalogue file(s)
-```
-dataset validate catalogue -i data/*.yaml --strict
-```
+---
 
-### Alembic migrations
-```
-uv run alembic upgrade head
-uv run alembic revision --autogenerate -m "update"
+## Core Capabilities
+
+- **Dataset catalogue (DCAT-AP 3.0.0)**  
+  Public catalogue endpoint exposing datasets and distributions.
+
+- **Governed query API**  
+  SQL `SELECT` queries over exposed datasets with:
+  - strict SQL validation
+  - server-side pagination & limits
+  - dataset-level access control (auth + OPA)
+
+- **Strong governance & disclosure model**
+  - `open`, `internal`, `restricted` access levels
+  - JWT-based authentication
+  - OPA policy evaluation
+
+- **Lineage & provenance**
+  - OpenLineage ingestion (Marquez)
+  - Namespace-based dataset grouping
+  - Provenance surfaced in catalogue & metadata
+
+- **Schema & metadata introspection**
+  - JSON Schema (2020-12) generated from physical tables
+  - Column-level metadata for UI and clients
+
+- **CLI-driven lifecycle**
+  - Export lineage → YAML
+  - Validate catalogue definitions
+  - Import & reconcile catalogue state
+  - Automatic cleanup of stale datasets
+
+---
+
+## API Surface (High-Level)
+
+| Area | Description |
+|-----|-------------|
+| `/catalogue` | DCAT-AP catalogue (exposed datasets only) |
+| `/catalogue/{dataset_id}/schema` | JSON Schema of dataset |
+| `/query` | Governed SQL query endpoint |
+| `/admin/catalogue` | Catalogue import (CLI-only) |
+| `/health` | Health check |
+
+Detailed endpoint semantics are described in the docs.
+
+---
+
+## CLI Overview
+
+The CLI is the **primary control plane** for the Dataset API.
+
+```bash
+dataset-cli --help
 ```
 
-## Backends
-- **postgres** – SQL tables
-- **s3** – raw objects with optional public URL
-- **fs** – direct file-based datasets
-
-## Lineage Support
-The system stores structured lineage metadata from OpenLineage, including:
-- namespace
-- sourceName
-- timestamps
-- lifecycle state
-- facets
-- tags
-
-Pydantic models allow flexible ingestion (`extra="allow"`).
-
-## DCAT-AP Compliance
-Each dataset includes:
-- identifiers, titles, descriptions
-- keywords, themes
-- publisher, rights holder, license
-- language & spatial coverage
-- distributions (API access and raw file access)
-- provenance (`prov:wasDerivedFrom`) using lineage information
-
-## Development
-### Dump all Python source files into a single file:
-A provided tool gathers all `dataset/**/*.py` into one bundle while ignoring `__pycache__`.
-
-### Supported Python tooling:
-- `uv` package runner
-- Typer CLI
+Main commands:
+- `export openlineage` – extract lineage from Marquez
+- `import catalogue` – validate & import dataset catalogue
+- `validate catalogue` – schema validation only
+- `ontology` – ontology fetch, analysis, tree generation
+
+---
+
+## Documentation
+
+Additional documentation available
+
+- [Architecture overview](docs/architecture.md)
+- [Catalogue Management](docs/catalogue-management.md)
+- [CLI operations](docs/cli-operations.md)
+- [Governance and security](docs/governance-security.md)
+- [Query engine](docs/query-engine.md)
+
+---
+
+## Development & Contribution
+
+- Python ≥ 3.11
+- Async SQLAlchemy
 - Pydantic v2
-- SQLAlchemy (async)
-- Alembic (async migrations)
-- httpx (async HTTP calls)
+- FastAPI + httpx
+- sqlglot-based SQL validation
+
+Before opening a PR:
+- validate all YAML definitions
+- add tests for new API behavior
+- include migrations for schema changes
+- keep docs in sync with API behavior
+
+---
 
-## License
+## License 
 
 Copyright >=2025 Spindox Labs
 
@@ -97,17 +115,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-
-
-## Contributing
-Ensure:
-- All YAML definitions validate via the CLI
-- No API endpoint accepts unvalidated data
-- Alembic migrations are generated for schema changes
-
-PRs should include tests for:
-- catalogue import
-- DCAT output
-- lineage extraction
-- backend resolution
-
 
@@ -0,0 +1,155 @@
+# Architecture & Core Concepts
+
+This document explains **what the Dataset API is**, the **core domain model**, and how the major subsystems fit together.
+
+---
+
+## What the Dataset API is
+
+The Dataset API is a **governed, read-only data access layer** that:
+
+- publishes a **dataset catalogue** (DCAT-AP compatible)
+- exposes a **restricted SQL query interface** over *catalogued* datasets
+- provides **schema and metadata introspection** for clients and UIs
+- integrates with **OpenLineage** to keep provenance and trust up to date
+
+The API is not an ingestion tool. Pipelines produce data; the Dataset API governs and serves it.
+
+---
+
+## System Context
+
+### Actors
+
+- **Producers (pipelines)**: create/refresh physical tables and emit lineage (OpenLineage)
+- **Operators**: manage catalogue definitions via CLI and ensure OPA/policy config is correct
+- **Consumers (apps/DTs/BI)**: discover datasets, fetch schemas, run governed queries
+
+### External Dependencies
+
+- **Physical storage**: PostgreSQL (tables, views)
+- **Authorization**: OPA (policy decision point)
+- **Lineage backend**: Marquez (OpenLineage ingestion/query)
+- **Identity provider**: issues JWTs (users + service accounts)
+
+---
+
+## High-level Architecture
+
+```
+            +------------------------+
+            | Pipelines (ETL/dbt/..) |
+            +-----------+------------+
+                        |
+                        | OpenLineage events
+                        v
+                  +-----------+
+                  | Marquez   |
+                  +-----+-----+
+                        |
+                        | export lineage + metadata
+                        v
++---------+     +---------------------+      +--------------------+
+| Clients | --> |     Dataset API     | <--> | OPA (Policy)       |
+| (apps)  |     |                     |      | allow/deny         |
++---------+     | - Catalogue         |      +--------------------+
+                | - Query Engine      |
+                | - Schema API        |
+                | - Metadata API      |
+                +----------+----------+
+                           |
+                           v
+                   +---------------+
+                   | PostgreSQL    |
+                   | tables/views  |
+                   +---------------+
+```
+
+---
+
+## Core Domain Model
+
+### Dataset
+
+A **dataset** is a governed contract over a physical data asset.
+
+**Dataset identity**
+- `dataset_id` (stable string; often namespace-qualified)
+
+**Dataset governance**
+- `access_level`: `open` | `internal` | `restricted`
+- ownership / stewardship fields
+- classification, tags, retention hints
+
+**Dataset physical mapping**
+- resolved storage reference (e.g., Postgres table/view)
+- schema and column metadata derived from reflection
+
+### Namespace
+
+Namespaces are a first-class taxonomy for lifecycle and intent:
+- `raw`: ingestion/staging
+- `silver`: enriched internal
+- `gold`: curated/exposed
+
+Namespaces drive:
+- catalogue selection filters
+- policy rules (e.g., only `gold` exposed externally)
+- operational grouping
+
+### Distribution
+
+In DCAT terms, a dataset can expose one or more **distributions** (e.g., SQL endpoint, files, API resource).
+In practice:
+- the API exposes a **query distribution**
+- optional documentation and external references may be included
+
+---
+
+## Read-only Contract
+
+Consumers cannot mutate data or catalogue state. Mutations happen only via:
+- data pipelines (tables/views)
+- CLI-managed catalogue imports
+- admin endpoints used by CLI
+
+This guarantees:
+- reproducibility
+- auditability
+- consistent governance enforcement
+
+---
+
+## Catalogue vs Storage Reality
+
+The catalogue is **validated against storage**.
+
+Expected behaviors:
+- if a dataset points to a missing table/view, it should be marked invalid and/or removed during cleanup
+- imports reconcile desired state (YAML) vs actual DB objects
+- schema endpoints reflect what exists in storage today
+
+The catalogue **never creates** physical data.
+
+---
+
+## Lifecycle of a Dataset (Conceptual)
+
+1. Pipeline creates/refreshes physical table/view
+2. Lineage is emitted to Marquez (OpenLineage)
+3. Operator exports lineage-derived candidates (CLI)
+4. Operator curates YAML (titles, descriptions, access levels, tags, docs)
+5. CLI imports catalogue (create/update)
+6. API exposes dataset in catalogue (if allowed)
+7. Consumers query datasets under governance
+8. Cleanup removes stale entries when physical assets disappear
+
+---
+
+## Data Integrity & Guardrails
+
+- SQL must be validated (AST-based, allowlisted)
+- dataset references must resolve to catalogued assets
+- access is policy-controlled (OPA)
+- limits and pagination protect the system from unbounded workloads
+