-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Context
In discussion on metadata PR #192, we realized that the LinkML schema (isamples_core.yaml) and PQG parquet representations serve different purposes:
| Serialization | Purpose | How Relationships Work |
|---|---|---|
| JSON/JSON-LD (LinkML) | Document exchange | Nesting provides implicit linkage |
| PQG Narrow (parquet) | Graph queries | Explicit edge rows with s/p/o |
| PQG Wide (parquet) | Analytical queries | p__* columns with row_id arrays |
The LinkML schema is the conceptual model and JSON serialization spec. But PQG parquet has additional structural requirements that aren't formally documented:
- Every entity row has
row_id(internal identifier) - Edge rows have
s,p,o,ncolumns - Wide format has
p__*columns containing arrays of target row_ids - All nodes (including GeospatialCoordLocation, SampleRelation) get PIDs in parquet even though they're optional in the JSON schema
Question
Should we create a formal PQG Parquet Schema specification that documents these parquet-specific conventions?
Preliminary Research: Parquet Schema Standards
What exists
-
Parquet's built-in schema - Files are self-describing (column names, types in footer), but this only describes structure, not semantics.
-
Frictionless Data Package - Most mature standard for dataset metadata:
- Table Schema - columns with types, constraints, descriptions
- Parquet support being added in v2
- Tension: Parquet already has embedded schema (potential duplication)
-
Data Catalogs - AWS Glue, Apache Iceberg, Delta Lake have their own metadata layers, but these are platform-specific.
-
No dominant standard for "parquet + semantics" - Most approaches are ad-hoc (README files, companion JSON/YAML).
Possible approach for PQG
| Layer | Format | What It Describes |
|---|---|---|
| Conceptual | LinkML YAML | Entity types, properties, relationships |
| JSON serialization | JSON Schema (from LinkML) | Document structure |
| Parquet serialization | Custom spec doc (+ Frictionless Table Schema?) | Column meanings, PQG conventions |
Proposal
Create a PQG_SPECIFICATION.md document in this repo that:
- References the LinkML conceptual model (don't duplicate it)
- Documents PQG-specific conventions:
- Narrow format:
row_id,otype,s/p/o/nfor edges - Wide format:
p__*columns, relationship to narrow - Why all entities get PIDs in parquet (graph traversal) even if optional in JSON
- Narrow format:
- Optionally includes a Frictionless Table Schema for machine-readable column definitions
Questions for Discussion
- Is a formal spec worth the maintenance overhead?
- Should we use Frictionless Table Schema, or just markdown documentation?
- What level of detail is needed? (Column definitions only, or also query patterns?)