-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
The data domain currently has only 3 types (KafkaTopicName, SqlIdentifier, TableIdentifier). This issue proposes 6 new types to bring it to 9 — all vendor-agnostic data engineering formats backed by formal specs or widely-adopted standards.
Proposed Types
1. JsonPointer (Pattern B — Annotated)
- Spec: RFC 6901
- Domain:
data/ - Format: Empty string or sequences of
/followed by reference tokens. Escape rules:~0for~,~1for/ - Regex:
^(/([^~/]|~0|~1)*)*$ - Usage: Central to JSON Patch (RFC 6902), OpenAPI
$refresolution, JSON Schema. Escape rules are a common source of bugs. - Examples:
"","/foo","/foo/0","/a~1b"(refers to keya/b),"/m~0n"(refers to keym~n)
2. HivePartitionPath (Pattern A — str subclass)
- Spec: Apache Hive partitioning convention
- Domain:
data/ - Format:
key=value/key=value/...where keys are valid identifiers - Regex:
^([a-zA-Z_][a-zA-Z0-9_]*=[^/]+)(/[a-zA-Z_][a-zA-Z0-9_]*=[^/]+)*$ - Parsed properties:
.partitions→dict[str, str]of partition key-value pairs - Usage: Ubiquitous in data lake architectures — Spark, Databricks, AWS Athena, Presto/Trino, Delta Lake, Iceberg. Every data engineer writes these daily. Malformed partitions are a common source of pipeline failures.
- Examples:
"year=2024","year=2024/month=01/day=15","region=us-east-1/dt=2024-01-15"
3. Doi (Pattern A — str subclass)
- Spec: ISO 26324, Crossref regex
- Domain:
data/ - Format:
10.NNNN/suffixwhere NNNN is a registrant code (4+ digits) - Regex:
^10\.\d{4,9}/[-._;()/:A-Za-z0-9]+$(covers 99.3% of all DOIs per Crossref) - Parsed properties:
.prefix(registrant code),.suffix - Usage: Universal in scientific publishing, dataset registries (Zenodo, DataCite, Figshare), ML model cards, and data catalog metadata. Increasingly common in ML experiment tracking.
- Examples:
"10.1000/xyz123","10.1038/nature12373","10.5281/zenodo.1234567"
4. AvroFullName (Pattern A — str subclass)
- Spec: Apache Avro 1.11.1 Specification
- Domain:
data/ - Format: Fully qualified name for Avro records/enums/fixed types.
namespace.namewhere each component matches[A-Za-z_][A-Za-z0-9_]* - Regex:
^[A-Za-z_][A-Za-z0-9_]*(\.[A-Za-z_][A-Za-z0-9_]*)*$ - Parsed properties:
.namespace(everything before last dot, orNone),.name(last component) - Usage: Every Kafka + Schema Registry deployment (Confluent, Redpanda). Also applies to Protobuf fully qualified names (same format). Frequently misconfigured.
- Examples:
"User","com.example.UserEvent","io.confluent.kafka.AvroMessage"
5. SchemaRegistrySubject (Pattern A — str subclass)
- Spec: Confluent Schema Registry — Subject Name Strategy
- Domain:
data/ - Format: Default TopicNameStrategy:
<topic>-keyor<topic>-value - Regex:
^[a-zA-Z0-9._-]+-(?:key|value)$ - Parsed properties:
.topic(topic name),.record_type("key"or"value") - Usage: Every Kafka + Schema Registry deployment. Natural companion to existing
KafkaTopicName. The TopicNameStrategy is the default and dominant naming pattern. - Examples:
"user-events-value","order.created-key","payments-value"
6. ElasticsearchIndexName (Pattern B — Annotated)
- Spec: Elasticsearch index naming restrictions
- Domain:
data/ - Validation rules:
- Lowercase only
- Max 255 bytes
- Cannot start with
-,_, or+ - Cannot be
.or.. - Forbidden chars:
\,/,*,?,",<,>,|,#,,, space
- Usage: Every ELK/OpenSearch deployment. Index name validation is surprisingly nuanced and most people get it wrong. Multiple Pydantic+ES projects (esorm, pydastic) reinvent this validation.
- Examples:
"logs-2024.01.15","user-events","metrics-prod" - Note: Elasticsearch is vendor-agnostic (open-source, runs anywhere). Same rules apply to AWS OpenSearch.
What goes where
All 6 types go into data/ — they are vendor-agnostic data engineering formats, not tied to any specific cloud provider.
DataUri (RFC 2397) was also considered but belongs in web/ — tracked separately if pursued.
Verification
- None of these overlap with Pydantic core, pydantic-extra-types, or schwifty
- All are regex/parsing-only — no external service calls
- All have formal specs or widely-adopted standards behind them
- All follow existing pydantypes patterns (Pattern A or B per ARCHITECTURE.md)
Implementation order
Suggested file structure:
src/pydantypes/data/json.py—JsonPointersrc/pydantypes/data/hive.py—HivePartitionPathsrc/pydantypes/data/doi.py—Doisrc/pydantypes/data/avro.py—AvroFullNamesrc/pydantypes/data/schema_registry.py—SchemaRegistrySubjectsrc/pydantypes/data/elasticsearch.py—ElasticsearchIndexName
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels