Skip to content

Expand data domain with 6 new validated types #10

@oborchers

Description

@oborchers

Summary

The data domain currently has only 3 types (KafkaTopicName, SqlIdentifier, TableIdentifier). This issue proposes 6 new types to bring it to 9 — all vendor-agnostic data engineering formats backed by formal specs or widely-adopted standards.

Proposed Types

1. JsonPointer (Pattern B — Annotated)

  • Spec: RFC 6901
  • Domain: data/
  • Format: Empty string or sequences of / followed by reference tokens. Escape rules: ~0 for ~, ~1 for /
  • Regex: ^(/([^~/]|~0|~1)*)*$
  • Usage: Central to JSON Patch (RFC 6902), OpenAPI $ref resolution, JSON Schema. Escape rules are a common source of bugs.
  • Examples: "", "/foo", "/foo/0", "/a~1b" (refers to key a/b), "/m~0n" (refers to key m~n)

2. HivePartitionPath (Pattern A — str subclass)

  • Spec: Apache Hive partitioning convention
  • Domain: data/
  • Format: key=value/key=value/... where keys are valid identifiers
  • Regex: ^([a-zA-Z_][a-zA-Z0-9_]*=[^/]+)(/[a-zA-Z_][a-zA-Z0-9_]*=[^/]+)*$
  • Parsed properties: .partitionsdict[str, str] of partition key-value pairs
  • Usage: Ubiquitous in data lake architectures — Spark, Databricks, AWS Athena, Presto/Trino, Delta Lake, Iceberg. Every data engineer writes these daily. Malformed partitions are a common source of pipeline failures.
  • Examples: "year=2024", "year=2024/month=01/day=15", "region=us-east-1/dt=2024-01-15"

3. Doi (Pattern A — str subclass)

  • Spec: ISO 26324, Crossref regex
  • Domain: data/
  • Format: 10.NNNN/suffix where NNNN is a registrant code (4+ digits)
  • Regex: ^10\.\d{4,9}/[-._;()/:A-Za-z0-9]+$ (covers 99.3% of all DOIs per Crossref)
  • Parsed properties: .prefix (registrant code), .suffix
  • Usage: Universal in scientific publishing, dataset registries (Zenodo, DataCite, Figshare), ML model cards, and data catalog metadata. Increasingly common in ML experiment tracking.
  • Examples: "10.1000/xyz123", "10.1038/nature12373", "10.5281/zenodo.1234567"

4. AvroFullName (Pattern A — str subclass)

  • Spec: Apache Avro 1.11.1 Specification
  • Domain: data/
  • Format: Fully qualified name for Avro records/enums/fixed types. namespace.name where each component matches [A-Za-z_][A-Za-z0-9_]*
  • Regex: ^[A-Za-z_][A-Za-z0-9_]*(\.[A-Za-z_][A-Za-z0-9_]*)*$
  • Parsed properties: .namespace (everything before last dot, or None), .name (last component)
  • Usage: Every Kafka + Schema Registry deployment (Confluent, Redpanda). Also applies to Protobuf fully qualified names (same format). Frequently misconfigured.
  • Examples: "User", "com.example.UserEvent", "io.confluent.kafka.AvroMessage"

5. SchemaRegistrySubject (Pattern A — str subclass)

  • Spec: Confluent Schema Registry — Subject Name Strategy
  • Domain: data/
  • Format: Default TopicNameStrategy: <topic>-key or <topic>-value
  • Regex: ^[a-zA-Z0-9._-]+-(?:key|value)$
  • Parsed properties: .topic (topic name), .record_type ("key" or "value")
  • Usage: Every Kafka + Schema Registry deployment. Natural companion to existing KafkaTopicName. The TopicNameStrategy is the default and dominant naming pattern.
  • Examples: "user-events-value", "order.created-key", "payments-value"

6. ElasticsearchIndexName (Pattern B — Annotated)

  • Spec: Elasticsearch index naming restrictions
  • Domain: data/
  • Validation rules:
    • Lowercase only
    • Max 255 bytes
    • Cannot start with -, _, or +
    • Cannot be . or ..
    • Forbidden chars: \, /, *, ?, ", <, >, |, #, ,, space
  • Usage: Every ELK/OpenSearch deployment. Index name validation is surprisingly nuanced and most people get it wrong. Multiple Pydantic+ES projects (esorm, pydastic) reinvent this validation.
  • Examples: "logs-2024.01.15", "user-events", "metrics-prod"
  • Note: Elasticsearch is vendor-agnostic (open-source, runs anywhere). Same rules apply to AWS OpenSearch.

What goes where

All 6 types go into data/ — they are vendor-agnostic data engineering formats, not tied to any specific cloud provider.

DataUri (RFC 2397) was also considered but belongs in web/ — tracked separately if pursued.

Verification

  • None of these overlap with Pydantic core, pydantic-extra-types, or schwifty
  • All are regex/parsing-only — no external service calls
  • All have formal specs or widely-adopted standards behind them
  • All follow existing pydantypes patterns (Pattern A or B per ARCHITECTURE.md)

Implementation order

Suggested file structure:

  • src/pydantypes/data/json.pyJsonPointer
  • src/pydantypes/data/hive.pyHivePartitionPath
  • src/pydantypes/data/doi.pyDoi
  • src/pydantypes/data/avro.pyAvroFullName
  • src/pydantypes/data/schema_registry.pySchemaRegistrySubject
  • src/pydantypes/data/elasticsearch.pyElasticsearchIndexName

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions