-
Notifications
You must be signed in to change notification settings - Fork 353
Description
Overview
This issue outlines the implementation plan for adding field-level metadata support to Python Fory (pyfory), enabling fine-grained control over serialization behavior per field. This follows the pattern established by Rust (#[fory(...)] attributes) and Go (fory:"..." struct tags) implementations.
Related Issues:
- GitHub Issue [Python] Create field descriptor for dataclass field metadata #3002 (Python field descriptor)
- GitHub Issue [Java] create @ForyField annotation to provide extra meta for perf/space optimization #3000 (Java @ForyField annotation)
- [C++] Field metadata for optimized serialization #3003
- [Rust] Add #[fory()] field attributes for optimization metadata #3004
- RoadMap for 1.0 #1017
Design Goals
- Python-Idiomatic API: Leverage dataclasses.field() with Fory-specific metadata
- Explicit Over Implicit: Users explicitly control nullable/ref per field
- Zero Breaking Changes: Existing code without field metadata works unchanged
- Performance First: Pre-compute field info at registration time, optimize JIT codegen
- Cross-Language Compatibility: Support TAG_ID encoding per xlang spec
Key Design Decisions
Default Values
| Option | Default | Notes |
|---|---|---|
id |
required | Must specify: -1 for field name, >=0 for tag ID |
nullable |
False |
No null flag written; exception: Optional[T] is always True |
ref |
False |
No reference tracking; must explicitly enable with ref=True |
ignore |
False |
Field is serialized |
Nullable Rules
- Non-Optional fields:
nullable=Falseby default (no null flag, saves 1 byte) - Optional[T] fields:
nullable=Trueis required; settingnullable=FalseraisesValueError - Explicit override: Use
nullable=Truefor non-Optional fields that may be None
Ref Tracking Rules
- All fields:
ref=Falseby default (no IdentityMap overhead) - Explicit enable: Use
ref=Truefor fields with circular/shared references - Global override: If
Fory(ref_tracking=False), ALL fields useref=Falseregardless of field setting - Hash computation: Uses field-level
refsetting only (stable, independent of Fory config) - Serializer generation: Combines field-level
refwith global config (global False overrides field True)
API Design
Core API: field() Function
from dataclasses import MISSING
from typing import Any, Callable, Mapping, Optional
def field(
id: int, # Tag ID (required): -1 = use field name, >=0 = use numeric tag ID
*,
# Fory-specific options
nullable: bool = False, # Whether null flag is written (default: False, auto-True for Optional[T])
ref: bool = False, # Whether ref tracking enabled (default: False, must explicitly enable)
ignore: bool = False, # Whether to ignore this field during serialization
# Standard dataclass.field() options (passthrough)
default: Any = MISSING,
default_factory: Callable[[], Any] = MISSING,
init: bool = True,
repr: bool = True,
hash: Optional[bool] = None,
compare: bool = True,
metadata: Optional[Mapping[str, Any]] = None,
**kwargs, # Forward any additional args to dataclasses.field()
) -> Any:
"""
Create a field with Fory-specific serialization metadata.
This wraps dataclasses.field() and stores Fory configuration in field.metadata.
Args:
id: Field tag ID (required positional parameter).
- -1: Use field name with meta string encoding
- >=0: Use numeric tag ID (more compact, stable across renames)
Must be unique within the class (except -1).
Required to force explicit choice about schema evolution strategy.
nullable: Whether to write null flag for this field.
- False (default): Skip null flag, field cannot be None
- True: Write null flag (1 byte overhead), field can be None
Note: For Optional[T] fields, nullable is automatically True
regardless of this parameter.
ref: Whether to enable reference tracking for this field.
- False (default): No tracking, skip IdentityMap overhead
- True: Track references (handles circular refs, shared objects)
Note: Must be explicitly set to True when needed. Not inherited
from Fory instance's ref_tracking config.
ignore: Whether to ignore this field during serialization.
- True: Field is excluded from serialization
- False (default): Field is serialized
default, default_factory, init, repr, hash, compare, metadata:
Standard dataclass.field() parameters, passed through.
Returns:
A dataclass field descriptor with Fory metadata attached.
Example:
@dataclass
class User:
# Compact encoding with tag ID 0, non-nullable
name: str = pyfory.field(0)
# Tag ID 1, explicitly nullable
email: Optional[str] = pyfory.field(1, nullable=True)
# Tag ID 2, enable ref tracking
friends: List[User] = pyfory.field(2, ref=True, default_factory=list)
# Use field name encoding (id=-1), ignore this field
_cache: dict = pyfory.field(-1, ignore=True, default_factory=dict)
"""ForyFieldMeta Data Class
from dataclasses import dataclass
@dataclass(frozen=True)
class ForyFieldMeta:
"""Parsed Fory field metadata extracted from field.metadata."""
id: int # Required: -1 = use field name, >=0 = use tag ID
nullable: bool = False # Whether null flag is written
ref: bool = False # Whether ref tracking is enabled
ignore: bool = False # Whether to ignore this field
def uses_tag_id(self) -> bool:
"""Returns True if this field uses tag ID encoding."""
return self.id >= 0
def validate_nullable(self, field_name: str, type_hint: type) -> None:
"""
Validate nullable setting against type hint.
Raises:
ValueError: If Optional[T] field has nullable=False
"""
if is_optional_type(type_hint) and not self.nullable:
raise ValueError(
f"Field '{field_name}' is Optional[T] but nullable=False. "
f"Optional fields must have nullable=True (or omit the parameter)."
)
def effective_nullable(self, type_hint: type) -> bool:
"""
Returns effective nullable value.
Rules:
- Optional[T] fields must have nullable=True (validated separately)
- Other fields use the configured nullable value (default: False)
"""
if is_optional_type(type_hint):
return True # Already validated that nullable=True
return self.nullable
def effective_ref(self) -> bool:
"""Returns ref tracking value (no inheritance from global config)."""
return self.refMetadata Storage
Fory metadata is stored in field.metadata["__fory__"]:
FORY_FIELD_METADATA_KEY = "__fory__"
def field(...) -> Any:
# Build Fory metadata
fory_meta = ForyFieldMeta(
id=id,
nullable=nullable,
ref=ref,
ignore=ignore,
)
# Merge with user-provided metadata
combined_metadata = dict(metadata) if metadata else {}
combined_metadata[FORY_FIELD_METADATA_KEY] = fory_meta
# Create dataclass field with combined metadata
return dataclasses.field(
default=default,
default_factory=default_factory,
init=init,
repr=repr,
hash=hash,
compare=compare,
metadata=combined_metadata,
**kwargs, # Forward any additional args
)Type Utilities
Reuse existing utilities from pyfory/type.py:
from pyfory.type import is_optional_type, unwrap_optional
# is_optional_type(type_) -> bool
# Check if type is Optional[T] or Union[T, None]
# unwrap_optional(type_, field_nullable=False) -> tuple[type, bool]
# Unwrap Optional[T] to (T, True) or return (type_, False)No new utilities needed - type.py already provides these functions.
Implementation Architecture
Component Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ User Code │
│ │
│ @dataclass │
│ class User: │
│ name: str = pyfory.field(0) │
│ email: Optional[str] = pyfory.field(1, nullable=True) │
│ age: int32 = pyfory.field(2) │
│ _cache: dict = pyfory.field(-1, ignore=True, default_factory=dict) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Type Registration (Fory.register()) │
│ │
│ 1. Extract field metadata from dataclass fields │
│ 2. Validate tag IDs are unique │
│ 3. Compute effective nullable/ref per field │
│ 4. Filter out ignored fields │
│ 5. Build FieldInfo list with pre-computed flags │
│ 6. Compute struct fingerprint (includes field metadata) │
│ 7. Create DataClassSerializer with field metadata │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DataClassSerializer │
│ │
│ Fields: │
│ - _schema_field_infos: List[FieldInfo] # From type hints (for hash) │
│ - _runtime_field_infos: List[FieldInfo] # With global config applied │
│ - _hash: int32 # Schema fingerprint hash │
│ │
│ JIT Codegen: │
│ - Generate write/read methods with field-specific logic │
│ - Skip null flag for non-nullable fields │
│ - Skip ref tracking for ref=False fields │
│ - Use tag ID encoding when id >= 0 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
FieldInfo Data Structure
@dataclass
class FieldInfo:
"""Pre-computed field information for serialization."""
# Identity
name: str # Field name (snake_case)
index: int # Field index in dataclass
type_hint: type # Type annotation
# Fory metadata (from pyfory.field()) - used for hash computation
tag_id: int # Required: -1 or >=0
nullable: bool # Effective nullable flag (considers Optional[T])
ref: bool # Field-level ref setting (for hash computation)
ignore: bool # Whether to ignore (always False in this list)
# Runtime flags (combines field metadata with global Fory config)
runtime_ref_tracking: bool # Actual ref tracking: field.ref AND fory.ref_tracking
# Derived info
type_id: int # Fory TypeId
serializer: Serializer # Field serializer
unwrapped_type: type # Type with Optional unwrapped
# Pre-computed flags for codegen (based on runtime flags)
needs_null_flag: bool # Whether to write null flag
needs_ref_tracking: bool # Whether to track references (= runtime_ref_tracking)
uses_tag_id: bool # Whether to use tag ID encodingKey distinction:
ref: Field-level setting frompyfory.field(), used for hash computation (stable)runtime_ref_tracking:field.ref AND fory.ref_tracking, used for actual serialization
Implementation Steps
Phase 1: Core API and Metadata Extraction
Files to modify/create:
python/pyfory/field.py(NEW) - field() function and ForyFieldMetapython/pyfory/__init__.py- Export field
Tasks:
- Create
field()function wrappingdataclasses.field() - Create
ForyFieldMetadataclass for parsed metadata - Create
extract_field_meta()helper to read metadata from fields - Create
validate_field_metas()to check:- Tag ID uniqueness (no duplicate IDs >= 0)
- Nullable consistency (Optional[T] with nullable=False raises ValueError)
- Reuse
is_optional_type()andunwrap_optional()frompyfory/type.py
Phase 2: Serializer Integration
Files to modify:
python/pyfory/struct.py- DataClassSerializer
Tasks:
- Modify
DataClassSerializer.__init__()to extract field metadata - Create
FieldInfodataclass for pre-computed field info - Update
_get_field_names()to filter out ignored fields - Update
_nullable_fieldscomputation to use field metadata - Add
_ref_tracking_fieldsdict for per-field ref control
Phase 3: Fingerprint Computation
Files to modify:
python/pyfory/struct.py- compute_struct_fingerprint()
Tasks:
- Update fingerprint format to include field metadata:
<field_id_or_name>,<type_id>,<ref>,<nullable>; - Use tag ID as sort key when id >= 0, otherwise use field name
- Include ref flag in fingerprint (from field annotation, not runtime config)
- Update
compute_struct_meta()to pass field metadata
Phase 4: JIT Codegen Updates
Files to modify:
python/pyfory/struct.py- _gen_write_method(), _gen_read_method(), etc.python/pyfory/codegen.py- Code generation utilities
Tasks:
- Update write codegen to skip null flag when
nullable=False - Update write codegen to skip ref tracking when
ref=False - Update read codegen accordingly
- Update xwrite/xread methods for xlang mode
- Optimize primitive field serialization for non-nullable fields
Phase 5: TypeDef Encoding (Compatible Mode)
Files to modify:
python/pyfory/meta/typedef_encoder.pypython/pyfory/meta/typedef_decoder.py
Tasks:
- Update TypeDef encoding to use tag ID when available
- Write field header with TAG_ID encoding (2 bits = 0b11)
- Write tag ID as varint instead of field name
- Include nullable and ref flags in field header
- Update decoder to handle TAG_ID encoded fields
Phase 6: Cython Integration
Files to modify:
python/pyfory/serialization.pyx
Tasks:
- Add FieldInfo handling in Cython serialization code
- Optimize field metadata access for Cython
- Update Cython read/write paths for field metadata
Phase 7: Testing
Files to create:
python/pyfory/tests/test_field_meta.py(NEW)- Update
python/pyfory/tests/xlang_test_main.py
Test cases:
- Basic pyfory.field() usage with all options
- Tag ID uniqueness validation
- Type-based nullable inference
- Ref tracking per field
- Ignored fields not serialized
- Fingerprint computation with field metadata
- Cross-language compatibility with Java/Rust/Go
- Schema evolution with tag IDs
- Mixed fields (some with metadata, some without)
Fingerprint Format
The struct fingerprint format (matching Java/Rust/Go):
<field_id_or_name>,<type_id>,<ref>,<nullable>;
Components:
field_id_or_name: Tag ID as string (e.g., "0", "1") if id >= 0, otherwise snake_case field nametype_id: Fory TypeId as decimal string (e.g., "4" for INT32)ref: "1" ifref=Truein field annotation, "0" otherwise (NOT affected by global Fory config)nullable: "1" if null flag is written, "0" otherwise
Important: The fingerprint uses field-level ref setting only, independent of Fory(ref_tracking=...).
This ensures the hash is stable across different Fory instances with different configurations.
Example fingerprints:
# With tag IDs:
0,4,0,0;1,12,0,1;2,0,0,1;
# With field names:
age,4,0,0;email,12,0,1;name,9,0,0;
Hash computation:
fingerprint = compute_struct_fingerprint(fields)
hash_bytes = fingerprint.encode("utf-8")
full_hash = murmurhash3_x64_128(hash_bytes, seed=47)
type_hash_32 = int32(full_hash & 0xFFFFFFFF)TypeDef Field Header Encoding
Per xlang spec, field header is 8 bits:
2 bits encoding + 4 bits size + 1 bit nullable + 1 bit ref_tracking
TAG_ID encoding (when id >= 0):
| 2 bits encoding (0b11) | 4 bits tag_id | 1 bit nullable | 1 bit ref_tracking |
When tag ID > 15, write additional varint for (tag_id - 15).
Field name encoding (when id < 0):
| 2 bits encoding (0b00-10) | 4 bits name_size | 1 bit nullable | 1 bit ref_tracking |
| meta string encoded field name |
API Examples
Basic Usage
from dataclasses import dataclass
from typing import Optional, List
import pyfory
from pyfory import int32
@dataclass
class User:
# Compact tag ID encoding, non-nullable (saves 1 byte)
id: int32 = pyfory.field(0)
# Tag ID 1, non-nullable string
name: str = pyfory.field(1)
# Tag ID 2, nullable (required for Optional)
email: Optional[str] = pyfory.field(2, nullable=True)
# Tag ID 3, enable ref tracking for circular references
friends: List["User"] = pyfory.field(3, ref=True, default_factory=list)
# Use field name encoding (-1), ignore this field (not serialized)
_cache: dict = pyfory.field(-1, ignore=True, default_factory=dict)
# Usage
fory = pyfory.Fory(ref_tracking=True)
fory.register(User)
user = User(id=1, name="Alice", email="[email protected]")
data = fory.serialize(user)
restored = fory.deserialize(data)Mixed Fields (Gradual Adoption)
@dataclass
class MixedStruct:
# New-style with field metadata (tag IDs)
id: int32 = field(0)
name: str = field(1)
# New-style with field name encoding (id=-1)
description: Optional[str] = ield(-1, nullable=True)
count: int32 = field(-1)Schema Evolution
# Version 1
@dataclass
class ConfigV1:
timeout: int32 = pyfory.field(0)
retries: int32 = pyfory.field(1)
# Version 2 - Added new field, renamed existing
@dataclass
class ConfigV2:
timeout_ms: int32 = pyfory.field(0) # Same tag ID, different name OK
max_retries: int32 = pyfory.field(1) # Same tag ID, different name OK
enabled: bool = pyfory.field(2) # New field with new tag IDPerformance Considerations
- Pre-computation: All field metadata is computed once at registration time
- JIT Codegen: Generated methods include field-specific optimizations
- Skip null flags: Non-nullable primitives save 1 byte per field
- Skip ref tracking: Fields with ref=False skip IdentityMap lookups
- Tag ID encoding: Numeric IDs are more compact than field name strings