Skip to content

Define formal PQG Parquet Schema specification #16

@rdhyee

Description

@rdhyee

Context

In discussion on metadata PR #192, we realized that the LinkML schema (isamples_core.yaml) and PQG parquet representations serve different purposes:

Serialization Purpose How Relationships Work
JSON/JSON-LD (LinkML) Document exchange Nesting provides implicit linkage
PQG Narrow (parquet) Graph queries Explicit edge rows with s/p/o
PQG Wide (parquet) Analytical queries p__* columns with row_id arrays

The LinkML schema is the conceptual model and JSON serialization spec. But PQG parquet has additional structural requirements that aren't formally documented:

  • Every entity row has row_id (internal identifier)
  • Edge rows have s, p, o, n columns
  • Wide format has p__* columns containing arrays of target row_ids
  • All nodes (including GeospatialCoordLocation, SampleRelation) get PIDs in parquet even though they're optional in the JSON schema

Question

Should we create a formal PQG Parquet Schema specification that documents these parquet-specific conventions?

Preliminary Research: Parquet Schema Standards

What exists

  1. Parquet's built-in schema - Files are self-describing (column names, types in footer), but this only describes structure, not semantics.

  2. Frictionless Data Package - Most mature standard for dataset metadata:

    • Table Schema - columns with types, constraints, descriptions
    • Parquet support being added in v2
    • Tension: Parquet already has embedded schema (potential duplication)
  3. Data Catalogs - AWS Glue, Apache Iceberg, Delta Lake have their own metadata layers, but these are platform-specific.

  4. No dominant standard for "parquet + semantics" - Most approaches are ad-hoc (README files, companion JSON/YAML).

Possible approach for PQG

Layer Format What It Describes
Conceptual LinkML YAML Entity types, properties, relationships
JSON serialization JSON Schema (from LinkML) Document structure
Parquet serialization Custom spec doc (+ Frictionless Table Schema?) Column meanings, PQG conventions

Proposal

Create a PQG_SPECIFICATION.md document in this repo that:

  1. References the LinkML conceptual model (don't duplicate it)
  2. Documents PQG-specific conventions:
    • Narrow format: row_id, otype, s/p/o/n for edges
    • Wide format: p__* columns, relationship to narrow
    • Why all entities get PIDs in parquet (graph traversal) even if optional in JSON
  3. Optionally includes a Frictionless Table Schema for machine-readable column definitions

Questions for Discussion

  1. Is a formal spec worth the maintenance overhead?
  2. Should we use Frictionless Table Schema, or just markdown documentation?
  3. What level of detail is needed? (Column definitions only, or also query patterns?)

cc @smrgeoinfo @datadavev

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions