Skip to content

Commit 95c4c1b

Browse files
committed
bump
1 parent 0ea0a17 commit 95c4c1b

File tree

2 files changed

+202
-1
lines changed

2 files changed

+202
-1
lines changed

CATALOG_METADATA_FOR_LLMS.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
**Catalog Metadata Recommendations for LLM Semantic Layer**
2+
3+
Purpose
4+
- Provide recommended metadata items and annotations to add to datasets, columns, and related catalog entities so LLMs can accurately discover, interpret, and ground answers using the catalog data.
5+
6+
How to read this document
7+
- For each metadata item: **What** the item is, **How to create/store/maintain** it, and **How an LLM will use it**.
8+
9+
1) Dataset-level metadata
10+
- Description
11+
- What: Concise human-readable overview of dataset purpose, scope, and typical use-cases (2–4 sentences). Link to detailed README when available.
12+
- Create/Store/Maintain: Store as a catalog field (`description`). Keep in repo or metastore README and surface in the catalog UI. Update via CI when schema or owners change; include last-updated timestamp and version.
13+
- LLM usage: Provides context for retrieval and prompt grounding; used to choose relevant datasets and to craft natural-language explanations.
14+
15+
- Owner & contacts
16+
- What: Dataset owner, steward, and on-call contacts; GitHub/Slack/email links.
17+
- Create/Store/Maintain: Store as structured fields (`owner`, `steward`, `contact`). Sync with org directory or Git metadata; validate via periodic checks.
18+
- LLM usage: When answers require clarification or approval, LLMs can surface contacts and generate recommended outreach text.
19+
20+
- Sensitivity / Classification / PII tag
21+
- What: One or more classification labels (e.g., `public`, `internal`, `confidential`, `restricted`, `pii`), plus GDPR/CCPA flags.
22+
- Create/Store/Maintain: Use controlled vocabulary; store as structured tags. Apply automated PII detectors and manual review. Enforce policy on ingest and propagation to column-level.
23+
- LLM usage: Filter retrieval and enforce redaction/response constraints; avoid exposing sensitive values; add safety warnings to generated content.
24+
25+
- Freshness & cadence
26+
- What: `last_updated`, `update_frequency` (daily/hourly/real-time), `data_latency` (how stale data typically is).
27+
- Create/Store/Maintain: Populate from ingestion pipelines; update automatically in ETL. Include source commit/manifest versions for reproducibility.
28+
- LLM usage: Prefer fresher sources for time-sensitive answers; indicate confidence based on age of data.
29+
30+
- Row count & high-level statistics
31+
- What: Approximate `row_count`, record size, table size, partition layout summary.
32+
- Create/Store/Maintain: Compute during ingestion or via scheduled jobs; update metrics store.
33+
- LLM usage: Assess reliability and representativeness; help prioritize datasets for retrieval.
34+
35+
- Business domain & canonical terms
36+
- What: Business domain name (e.g., `billing`, `customer-360`) and canonical dataset tags mapped to an ontology or glossary.
37+
- Create/Store/Maintain: Maintain a central business glossary; link dataset to canonical terms via IDs.
38+
- LLM usage: Map user queries to domain-specific datasets; disambiguation and slot-filling during query-generation.
39+
40+
- Lineage & provenance
41+
- What: Source systems, upstream datasets, transformations, jobs, timestamps, and commit hashes.
42+
- Create/Store/Maintain: Capture automatically in ETL frameworks (e.g., as part of job metadata). Store as structured lineage graph or DAG references.
43+
- LLM usage: Provide provenance evidence, justify answers, enable traceability and chain-of-thought grounding.
44+
45+
- Sample rows / schema examples
46+
- What: Small anonymized sample (10–50 rows) or schema-typed examples demonstrating typical values.
47+
- Create/Store/Maintain: Generate from dataset with redaction for PII; store as reversible or irreversible sampled snapshot depending on policy.
48+
- LLM usage: Help the model understand value formats and craft better queries and extraction prompts.
49+
50+
2) Column-level metadata
51+
- Column description
52+
- What: Natural-language description of what the column represents and business interpretation.
53+
- Create/Store/Maintain: Add as `description` on column definitions; source from data producers and data stewards. Keep small and precise.
54+
- LLM usage: Crucial for mapping natural-language attributes to schema fields when generating queries or answering questions.
55+
56+
- Semantic type / ontology mapping
57+
- What: Semantic tag (e.g., `email`, `currency`, `timestamp`, `country`, `user_id`) and link to canonical ontology term.
58+
- Create/Store/Maintain: Standardize tags (controlled vocabulary) and map via schema/registry. Use detectors to suggest mappings; require steward approval.
59+
- LLM usage: Improves normalization, unit-aware reasoning, and safe handling (PII awareness). Helps in entity linking and canonicalization.
60+
61+
- Units & format
62+
- What: Units (`USD`, `meters`), timezone for timestamps, expected format (ISO date, RFC3339), regex examples.
63+
- Create/Store/Maintain: Maintain as metadata fields; validate in ETL and during schema checks.
64+
- LLM usage: Enables correct conversions, comparisons, and localized formatting in answers.
65+
66+
- Value examples / top values
67+
- What: 5–10 typical or top cardinal values, plus common error patterns.
68+
- Create/Store/Maintain: Compute periodically; keep derived stats (top-k values, distinct_count, null_fraction).
69+
- LLM usage: Supports disambiguation, common-case assumptions, and helps avoid hallucinating unexpected values.
70+
71+
- Cardinality & distinct count
72+
- What: High-cardinality flag, distinct counts, null ratio, and uniqueness (candidate primary key).
73+
- Create/Store/Maintain: Compute in metrics jobs; update with data refresh.
74+
- LLM usage: Guide join strategy, entity resolution, and explainability of joins.
75+
76+
- Referential links (foreign keys)
77+
- What: Declared relationships to other dataset columns (FK -> primary key reference).
78+
- Create/Store/Maintain: Detect via profiling or declare in schema; validate periodically.
79+
- LLM usage: Assist query planning, join recommendations, and preserving referential integrity when constructing SQL.
80+
81+
- Derived / computed flag & transform expression
82+
- What: Whether column is derived; store the transformation SQL/logic or pointer to transformation job.
83+
- Create/Store/Maintain: Capture in ETL metadata and code repository; version the expression.
84+
- LLM usage: Explain derivation and enable tracing of how reported values were produced.
85+
86+
3) File/manifest/partition-level metadata
87+
- Storage format & location
88+
- What: File type (Parquet/CSV/ORC), bucket/path, partitioning scheme and partition keys.
89+
- Create/Store/Maintain: Store in manifest metadata and file manifest; keep hashes and sizes per file.
90+
- LLM usage: Answer storage/availability questions; determine efficient access patterns and cost estimates.
91+
92+
- Partition statistics
93+
- What: Per-partition row counts, min/max values for partition keys, and freshness per partition.
94+
- Create/Store/Maintain: Aggregated by ingestion jobs; indexed by partition.
95+
- LLM usage: Helps narrow retrieval to relevant partitions for RAG pipelines and to limit data scanned.
96+
97+
4) Catalog-level artifacts useful to LLMs
98+
- Business glossary & term definitions
99+
- What: Canonical business definitions, synonyms, and mappings to schema elements.
100+
- Create/Store/Maintain: Central glossary service or JSON-LD store; tie terms to dataset/column IDs.
101+
- LLM usage: Disambiguate user language to schema; provide more accurate slot filling and entity resolution.
102+
103+
- Embeddings & vector indexes
104+
- What: Semantic embeddings for dataset descriptions, column descriptions, sample rows, and business terms.
105+
- Create/Store/Maintain: Generate with a chosen encoder; store vectors in a vector DB with pointers to canonical IDs and timestamps. Recompute on description or sample updates.
106+
- LLM usage: Retrieval-augmented generation (RAG): find the most relevant schema pieces, example rows, and docs for prompts.
107+
108+
- Index of FAQ / usage examples / canned queries
109+
- What: Curated list of example queries, typical SQL snippets, common pitfalls, and recommended joins.
110+
- Create/Store/Maintain: Curated by data stewards; surfaced via dataset README and catalog UI.
111+
- LLM usage: Use examples as few-shot context to improve generated queries and recommended actions.
112+
113+
- Data quality rules & test results
114+
- What: Rules (uniqueness, ranges, not-null) and latest validation outcomes with severity.
115+
- Create/Store/Maintain: Store test definitions and results in metadata store; CI gating for failing tests.
116+
- LLM usage: Modify confidence, add caveats to answers, and suggest remediation steps.
117+
118+
5) Formats and storage recommendations
119+
- Use structured, machine-readable metadata (JSON / JSON-LD)
120+
- Why: LLMs and downstream services can easily parse and ingest structured fields; JSON-LD helps link to ontologies.
121+
122+
- Use a single source-of-truth metastore
123+
- Why: Consistency for ingestion, tooling and LLM retrieval. Options: existing metastore (Hive/Glue/BigQuery/Firestore), Data Catalog solutions (OpenMetadata, Apache Atlas), or a small dedicated metadata DB linked to the catalog.
124+
125+
- Versioning and immutability
126+
- Why: Keep historical context available for provenance and reproducibility. Store `schema_version`, `metadata_version`, and stable dataset identifiers.
127+
128+
- Vector store for embeddings
129+
- Why: Fast semantic retrieval. Keep vector metadata pointing back to canonical IDs; store embedding model name and timestamp.
130+
131+
- Controlled vocabularies and schemas
132+
- Why: Predictable semantics for LLM prompts. Define enumerations for classification, semantic types, sensitivity, update frequency.
133+
134+
6) Ingestion & maintenance guidance
135+
- Automated extraction
136+
- Create jobs that generate or update: descriptions (seeded then curated), statistics, sample rows (PII redacted), tests, lineage.
137+
138+
- Review workflow
139+
- Provide suggested changes automatically (profiling/detectors), but require steward human approval for semantic fields (descriptions, sensitivity, canonical mapping).
140+
141+
- Frequency & triggers
142+
- Update statistics and embeddings on refresh cycles or schema change; update descriptions and lineage when ETL code or owner changes.
143+
144+
- Audit & access control
145+
- Maintain audit logs of metadata changes and who changed them. Enforce RBAC for who can change sensitivity or owner.
146+
147+
7) How LLMs will consume and use these items
148+
- Retrieval & grounding
149+
- LLMs will use dataset and column descriptions, embeddings, and sample rows to select relevant data and ground responses in factual sources.
150+
151+
- Query generation (SQL/filters)
152+
- Column semantics, units, and sample values let the LLM map natural-language predicates into typed SQL fragments, reducing malformed queries.
153+
154+
- Safety & policy enforcement
155+
- Use sensitivity tags and PII flags to block or redact results and to insert safety disclaimers in generated output.
156+
157+
- Explainability & provenance
158+
- Lineage and transform expressions allow the LLM to explain how values were derived and link back to data sources.
159+
160+
- Confidence estimation
161+
- Use freshness, data quality results, and row counts to set answer confidence and create caveats for the user.
162+
163+
- Disambiguation & entity linking
164+
- Use business glossary + ontology mappings so the LLM can resolve ambiguous user terms to canonical schema elements.
165+
166+
8) Implementation checklist & examples
167+
- Minimum viable metadata to onboard a dataset
168+
- `description`, `owner`, `sensitivity`, `last_updated`, column `description`, column `semantic_type`, and `schema`.
169+
170+
- Recommended full set (for production LLM usage)
171+
- All dataset-level items in section 1, all column-level in section 2, embeddings, glossary links, lineage, DQ rules, and sample rows.
172+
173+
- Example JSON snippet (dataset metadata)
174+
- {
175+
"dataset_id": "billing.transactions.v1",
176+
"description": "Transaction-level events for billing.",
177+
"owner": "team-billing@example.com",
178+
"sensitivity": "internal",
179+
"last_updated": "2026-01-10T12:00:00Z",
180+
"row_count": 12345678,
181+
"tags": ["billing","payments"],
182+
"glossary_terms": ["invoice","chargeback"]
183+
}
184+
185+
9) Security, privacy, and governance notes
186+
- Redaction and synthetic samples
187+
- Never store raw PII in sample rows unless explicitly approved; use redaction or synthetic replacements.
188+
189+
- Embedding privacy
190+
- Beware of embedding models that can memorize; strip raw PII before embedding and record embedding provenance.
191+
192+
- Policy enforcement
193+
- Enforce access control at retrieval time using sensitivity tags; LLM layer must check authorization before exposing content.
194+
195+
10) Next steps & suggested rollout
196+
- Phase 1 (MVP): Add `description`, `owner`, `sensitivity`, `last_updated`, column `description`, column `semantic_type`, and compute basic stats and sample rows (redacted).
197+
- Phase 2: Add lineage capture, DQ rules, vector embeddings for descriptions and samples, and top-value statistics.
198+
- Phase 3: Full integration with glossary/ontology, automated detectors, and UI surfaces for steward approval.
199+
200+
Contact / Review
201+
- Please review and indicate priority items to implement first. For implementation I can scaffold extractor jobs, a JSON schema for metadata, or a small metastore adapter for this repository.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "opteryx-catalog"
7-
version = "0.4.15"
7+
version = "0.4.16"
88
description = "Opteryx Cloud Catalog"
99
readme = { file = "README.md", content-type = "text/markdown" }
1010
authors = [

0 commit comments

Comments
 (0)