This document codifies the conventions, patterns, and design decisions of the pydantypes codebase. All contributors (human and AI) should follow these conventions when adding or modifying types.
Every type in pydantypes uses one of four patterns. The choice depends on what the type needs to do.
Use when the validated string has extractable components (e.g., bucket + key from an S3 URI, partition + service + region from an ARN).
# Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html
class S3Uri(CloudStorageUri):
"""An S3 URI like s3://bucket/key with parsed properties."""
_pattern: ClassVar[re.Pattern[str]] = re.compile(
r"^s3://([a-z0-9][a-z0-9.\-]{1,61}[a-z0-9])(/(.*))?$"
)
def __new__(cls, value: str) -> S3Uri:
"""Create and validate a new S3Uri instance."""
m = cls._pattern.match(value)
if not m:
raise PydanticCustomError("s3_uri", "Invalid S3 URI: {value}", {"value": value})
bucket = m.group(1)
_validate_s3_bucket_name(bucket)
key = m.group(3) or ""
if len(key.encode("utf-8")) > 1024:
raise PydanticCustomError(
"s3_uri",
"Invalid S3 URI: key must be <= 1024 bytes. Got: {value}",
{"value": value},
)
if _S3_KEY_CONTROL_CHAR_RE.search(key):
raise PydanticCustomError(
"s3_uri",
"Invalid S3 URI: key must not contain control characters. Got: {value}",
{"value": value},
)
instance = str.__new__(cls, value)
instance.bucket = bucket
instance.key = key
return instance
@classmethod
def _validate(cls, value: str) -> S3Uri:
"""Validate a string as an S3 URI."""
return cls(value)
@classmethod
def __get_pydantic_core_schema__(
cls, source_type: Any, handler: GetCoreSchemaHandler
) -> CoreSchema:
"""Return the Pydantic core schema for S3Uri."""
return _str_type_core_schema(cls, source_type, handler)
@classmethod
def __get_pydantic_json_schema__(
cls, _core_schema: CoreSchema, handler: GetJsonSchemaHandler
) -> JsonSchemaValue:
"""Return the JSON schema for S3Uri."""
return {
"type": "string",
"format": "s3-uri",
"pattern": cls._pattern.pattern,
"description": "An S3 URI in the format s3://bucket/key",
"examples": ["s3://my-bucket/path/to/file.csv"],
"title": "S3Uri",
}Key rules:
- Cloud storage URI types inherit from
CloudStorageUri; other Pattern A types inherit fromstr - Regex lives on the class as
_pattern: ClassVar[re.Pattern[str]] - Parsed properties are set as instance attributes in
__new__ _validateis a one-liner that delegates to__new____get_pydantic_core_schema__delegates to_str_type_core_schemafrom_internal.py__get_pydantic_json_schema__returns a dict withtype,format,pattern,description,examples,title- All five methods (
__new__,_validate,__get_pydantic_core_schema__,__get_pydantic_json_schema__) have one-line docstrings # Source:comment on the line directly above the class definition
Use when the type only needs validation without parsed properties — just accept/reject.
_EC2_INSTANCE_ID_RE = re.compile(r"^i-[0-9a-f]{8,17}$")
def _validate_ec2_instance_id(v: str) -> str:
"""Validate an EC2 instance ID format."""
if not _EC2_INSTANCE_ID_RE.match(v):
raise PydanticCustomError(
"ec2_instance_id",
"Invalid EC2 Instance ID: {value}",
{"value": v},
)
return v
# Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/resources.html
Ec2InstanceId = Annotated[
str,
AfterValidator(_validate_ec2_instance_id),
WithJsonSchema(
{
"type": "string",
"pattern": r"^i-[0-9a-f]{8,17}$",
"description": "An AWS EC2 instance ID",
"examples": ["i-1234567890abcdef0"],
"title": "Ec2InstanceId",
}
),
]Key rules:
- Regex lives at module level as
_UPPER_CASE_RE = re.compile(...) - Validator function named
_validate_<type_name>with a one-line docstring # Source:comment directly above theAnnotated[...]assignmentWithJsonSchemadict includestype,pattern,description,examples,title- Optional:
minLength,maxLengthwhen applicable
Use for fixed sets of valid values like regions or zones.
if sys.version_info >= (3, 11):
from enum import StrEnum
else:
from enum import Enum
class StrEnum(str, Enum):
pass
class Region(StrEnum):
"""AWS region identifiers.
Source: https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions.html
"""
US_EAST_1 = "us-east-1"
US_EAST_2 = "us-east-2"
# ...Key rules:
- Python 3.10 compatibility shim at top of file
- Class docstring includes
Source:URL - Member names are UPPER_SNAKE_CASE
- Member values are the actual string identifiers
Use for AI/ML classification labels that need metadata, deprecation, and alias resolution.
class Sentiment(LabelEnum):
POSITIVE = Label("positive", description="Expresses approval or satisfaction")
NEGATIVE = Label("negative", description="Expresses disapproval or frustration")
OLD_LABEL = Label("old", deprecated=True, successor="POSITIVE", aliases=["legacy"])Key rules:
- Extends
LabelEnumbase class frompydantypes.ai.labels - Members are
Label(...)objects or plain strings - Lifecycle states: active, deprecated (warns), retired (rejects)
- Alias resolution has priority over retired values
The placement depends on the pattern:
| Pattern | Where | Example |
|---|---|---|
| A (str subclass) | ClassVar on the class |
_pattern: ClassVar[re.Pattern[str]] = re.compile(...) |
| B (Annotated) | Module level | _EC2_INSTANCE_ID_RE = re.compile(...) |
Module-level regexes use the naming convention _UPPER_SNAKE_CASE_RE. The _RE suffix is mandatory — _PATTERN is prohibited.
Pydantic v2 uses the Rust regex crate internally for Field(pattern=...) constraints — faster, with linear-time guarantees. We use Python's re.compile instead for two reasons:
- Pattern A types need capture groups. Our
__new__methods extract parsed properties (.bucket,.key, etc.) from match groups. Pydantic's Rust regex only does accept/reject — it cannot return match groups to Python. - Custom error messages.
AfterValidator+PydanticCustomErrorgives us control over error codes and messages. Pydantic's built-inpatternconstraint produces generic errors.
This is not a performance concern. Our regexes are simple (anchored, no backtracking risk), and the Python function call overhead already dominates over the regex match itself.
Every type must have a # Source: comment on the line directly above its definition, linking to the official documentation. This applies uniformly to all patterns — no source URLs inside docstrings.
# Pattern A (str subclass):
# Source: https://docs.aws.amazon.com/...
class S3Uri(str):
"""An S3 URI like s3://bucket/key with parsed properties."""
# Pattern B (Annotated):
# Source: https://docs.aws.amazon.com/...
Ec2InstanceId = Annotated[...]
# Pattern C (StrEnum):
# Source: https://docs.aws.amazon.com/...
class Region(StrEnum):
"""AWS region identifiers."""Rules:
- URL must be verified to load and be the correct reference for the type
- Prefer official provider docs (AWS, Azure, GCP), RFCs, or specification documents
- Never put source URLs inside docstrings
Every module, class, and function has a docstring. No exceptions.
- Module:
"""AWS storage types."""— short, descriptive - Pattern A class:
"""An S3 URI like s3://bucket/key with parsed properties."""— one-liner - Pattern A methods: Every
__new__,_validate,__get_pydantic_core_schema__, and__get_pydantic_json_schema__must have a one-line docstring. Templates:__new__:"""Create and validate a new {ClassName} instance."""_validate:"""Validate a string as a {description}."""__get_pydantic_core_schema__:"""Return the Pydantic core schema for {ClassName}."""__get_pydantic_json_schema__:"""Return the JSON schema for {ClassName}."""
- Pattern B validator: Every
_validate_*function must have a one-line docstring:"""Validate a {type description} format.""" - Pattern C class:
"""AWS region identifiers."""— one-liner __init__.py:"""AWS cloud resource types."""— descriptive module docstring- Test module:
"""Tests for AWS storage types."""
src/pydantypes/cloud/<provider>/<domain>.py
Each file groups related types by domain (storage, compute, network, etc.). Within a file:
from __future__ import annotationsfirst- Standard library imports
- Third-party imports (pydantic)
- Local imports (
from pydantypes._internal import _str_type_core_schema) - Module-level regexes (Pattern B)
- Validator functions (Pattern B)
- Type aliases (Pattern B) or classes (Pattern A)
"""AWS cloud resource types."""
from pydantypes.cloud.aws.arn import Arn, IamRoleArn, SnsTopicArn
from pydantypes.cloud.aws.compute import AmiId, Ec2InstanceId, EcsClusterName
__all__ = [
"AccountId",
"AmiId",
"Arn",
# ... alphabetically sorted
]Rules:
- Explicit imports, no star imports
__all__is alphabetically sorted- Every public type is listed in
__all__
All validation errors use PydanticCustomError:
raise PydanticCustomError(
"s3_bucket_name", # snake_case error code
"Invalid S3 bucket name: {value}", # template message
{"value": v}, # context dict
)- Error codes are snake_case, named after the type
- Messages include
{value}placeholder for the rejected input - Specific constraint violations get descriptive messages:
"Invalid S3 bucket name: must not contain consecutive dots. Got: {value}" - Always use exception chaining when wrapping:
raise Error(...) from e
<provider>-<type-name> # "aws-arn", "gcs-uri", "gcp-project-id"
<type-name> # "jwt", "mime-type", "docker-image-ref"
Pattern A returns from __get_pydantic_json_schema__:
{"type": "string", "format": "...", "pattern": "...", "description": "...", "examples": [...], "title": "..."}Pattern B uses WithJsonSchema(...):
{"type": "string", "pattern": "...", "description": "...", "examples": [...], "title": "...", "minLength": N, "maxLength": N}Pattern A includes format. Pattern B omits it (the title serves as identifier). Both always include pattern, description, examples, and title.
_internal.py exports a single function:
def _str_type_core_schema(cls, source_type, handler) -> CoreSchema:This builds the Pydantic core schema for all Pattern A types. It handles:
- JSON deserialization → validates via
cls._validate - Python instantiation → passes through existing instances of
cls, otherwise validates - Serialization → plain
str(v)
All Pattern A types delegate to this in __get_pydantic_core_schema__.
Mirror source: src/pydantypes/cloud/aws/storage.py → tests/cloud/aws/test_storage.py
One test file per source file. Module docstring: """Tests for AWS storage types."""
All tests use flat parametrized functions — class-based test grouping (TestFooValid, TestFooInvalid) is prohibited.
Each type gets a simple wrapper model:
class S3UriModel(BaseModel):
uri: S3UriEvery type must have these tests:
- Valid values — parametrized, assert roundtrip
str(model.field) == value - Invalid values — parametrized, assert
ValidationError - Serialization —
model.model_dump()["field"] == value - JSON schema — verify
type,format(Pattern A) orpattern(Pattern B) - Existing instance identity —
TypeName(value)passed to model yields same object (is) - Parsed properties (Pattern A only) — verify each extracted attribute
test_valid_<type_name>
test_invalid_<type_name>
test_<type_name>_properties
test_<type_name>_serialization
test_<type_name>_json_schema
test_<type_name>_existing_instance
@pytest.mark.parametrize("value", ["valid-1", "valid-2"])
def test_valid_ec2_instance_id(value: str) -> None:
m = Ec2Model(instance=value)
assert m.instance == value
@pytest.mark.parametrize("value", ["bad-1", "bad-2", ""])
def test_invalid_ec2_instance_id(value: str) -> None:
with pytest.raises(ValidationError):
Ec2Model(instance=value)- Target: Python 3.10+
- Use
from __future__ import annotationsin every file - Use modern syntax:
str | None,list[str],dict[str, Any] - StrEnum compatibility shim for Python 3.10 (inline in files that need it)
- No lazy imports — everything at file top
When a URI type contains a component that has its own standalone type, the URI validator delegates to the component's validator. This avoids duplicating rules.
# S3Uri.__new__ delegates bucket validation:
bucket = m.group(1)
_validate_s3_bucket_name(bucket) # reuses S3BucketName's validator
# GcsUri.__new__ delegates bucket validation:
bucket = m.group(1)
_validate_gcs_bucket_name(bucket) # reuses GcsBucketName's validatorAll cloud storage URI types (S3Uri, GcsUri, BlobStorageUri) inherit from CloudStorageUri (cloud/_base.py), which provides:
- Unified interface —
.bucketand.keyacross all providers - Path helpers —
.name,.suffix,.stem,.parent_key,.suffixes,.parts - Heuristic helpers —
.is_file,.is_folder(based on key naming conventions)
Path helpers delegate to PurePosixPath(self.key) internally.
| Concept | Unified name | Provider-specific aliases |
|---|---|---|
| Bucket/container | .bucket |
Azure: .container |
| Object path | .key |
Azure: .blob_path |
S3Uri and GcsUri use only the unified names. BlobStorageUri exposes both unified names (from base class) and Azure-specific aliases (.account_name, .container, .blob_path).
Subclasses must set bucket and key as instance attributes in __new__. The base class does not define __new__ — each provider has its own regex and validation. Pydantic integration (__get_pydantic_core_schema__, __get_pydantic_json_schema__) remains on each subclass.
Every type in pydantypes uses str as its base — even when the underlying format is a well-known type like UUID, integer, or date. Azure SubscriptionId and TenantId are UUIDs, but they are Annotated[str, ...], not Annotated[UUID, ...].
Why:
- Consistency — every type behaves the same way: you put a string in, you get a string out. No surprises where one type returns a
UUIDobject and another returnsstr. - Serialization —
strround-trips cleanly through JSON, YAML, TOML, environment variables, and CLI args without custom serializers. - Semantic layer, not type conversion — pydantypes validates and constrains identifiers. It does not convert them into richer Python objects. That is a different concern.
If a user needs a uuid.UUID object, they can do uuid.UUID(model.subscription_id) themselves. Our job is to validate the format, not change the runtime type.
pydantypes fills gaps — it never reimplements types already provided by Pydantic core or pydantic-extra-types. Before adding any new type, verify it is not covered below.
- URLs:
AnyUrl,AnyHttpUrl,HttpUrl,AnyWebsocketUrl,WebsocketUrl,FileUrl,FtpUrl - DSNs:
PostgresDsn,CockroachDsn,MySQLDsn,MariaDBDsn,RedisDsn,MongoDsn,ClickHouseDsn,SnowflakeDsn,AmqpDsn,KafkaDsn,NatsDsn - Email:
EmailStr,NameEmail - IP:
IPvAnyAddress,IPvAnyInterface,IPvAnyNetwork - UUIDs:
UUID1,UUID3,UUID4,UUID5,UUID6,UUID7,UUID8 - Paths:
FilePath,DirectoryPath,NewPath - Secrets:
SecretStr,SecretBytes - Encoding:
Base64Bytes,Base64Str,Base64UrlBytes,Base64UrlStr - Other:
ByteSize,ImportString,Json,PaymentCardNumber(deprecated)
- schwifty —
IBAN,BIC(banking identifiers with native__get_pydantic_core_schema__support)
Color,RGBA(hex, RGB, HSL, named colors)Coordinate,Latitude,LongitudeCountryAlpha2,CountryAlpha3,CountryNumericCode,CountryShortNameCronStr(cron expressions viacron-converter)DomainStr(basic domain name validation)epoch.Number,epoch.Integer(datetime from unix timestamp)ISO4217,Currency(currency codes)ISBNISO_15924(script codes)LanguageAlpha2,LanguageName,ISO639_3,ISO639_5MacAddressMimeType(IANA whitelist lookup — distinct from pydantypes' RFC 6838 format validator)MongoObjectIdPaymentCardNumber,PaymentCardBrandPhoneNumberABARoutingNumberS3Path(basic S3 path — distinct from pydantypes'S3Uriwith full property extraction)SemanticVersion,ManifestVersionTimeZoneNameULIDUUID6,UUID7,UUID8- Path types:
ExistingPath,ResolvedFilePath,ResolvedDirectoryPath,ResolvedNewPath - Pendulum datetime types (
DateTime,Date,Time,Duration)