Skip to content

[Python] Create field descriptor for dataclass field metadata #3002

@chaokunyang

Description

@chaokunyang

Overview

This issue outlines the implementation plan for adding field-level metadata support to Python Fory (pyfory), enabling fine-grained control over serialization behavior per field. This follows the pattern established by Rust (#[fory(...)] attributes) and Go (fory:"..." struct tags) implementations.

Related Issues:

Design Goals

  1. Python-Idiomatic API: Leverage dataclasses.field() with Fory-specific metadata
  2. Explicit Over Implicit: Users explicitly control nullable/ref per field
  3. Zero Breaking Changes: Existing code without field metadata works unchanged
  4. Performance First: Pre-compute field info at registration time, optimize JIT codegen
  5. Cross-Language Compatibility: Support TAG_ID encoding per xlang spec

Key Design Decisions

Default Values

Option Default Notes
id required Must specify: -1 for field name, >=0 for tag ID
nullable False No null flag written; exception: Optional[T] is always True
ref False No reference tracking; must explicitly enable with ref=True
ignore False Field is serialized

Nullable Rules

  1. Non-Optional fields: nullable=False by default (no null flag, saves 1 byte)
  2. Optional[T] fields: nullable=True is required; setting nullable=False raises ValueError
  3. Explicit override: Use nullable=True for non-Optional fields that may be None

Ref Tracking Rules

  1. All fields: ref=False by default (no IdentityMap overhead)
  2. Explicit enable: Use ref=True for fields with circular/shared references
  3. Global override: If Fory(ref_tracking=False), ALL fields use ref=False regardless of field setting
  4. Hash computation: Uses field-level ref setting only (stable, independent of Fory config)
  5. Serializer generation: Combines field-level ref with global config (global False overrides field True)

API Design

Core API: field() Function

from dataclasses import MISSING
from typing import Any, Callable, Mapping, Optional

def field(
    id: int,                 # Tag ID (required): -1 = use field name, >=0 = use numeric tag ID
    *,
    # Fory-specific options
    nullable: bool = False,  # Whether null flag is written (default: False, auto-True for Optional[T])
    ref: bool = False,       # Whether ref tracking enabled (default: False, must explicitly enable)
    ignore: bool = False,    # Whether to ignore this field during serialization

    # Standard dataclass.field() options (passthrough)
    default: Any = MISSING,
    default_factory: Callable[[], Any] = MISSING,
    init: bool = True,
    repr: bool = True,
    hash: Optional[bool] = None,
    compare: bool = True,
    metadata: Optional[Mapping[str, Any]] = None,
    **kwargs,              # Forward any additional args to dataclasses.field()
) -> Any:
    """
    Create a field with Fory-specific serialization metadata.

    This wraps dataclasses.field() and stores Fory configuration in field.metadata.

    Args:
        id: Field tag ID (required positional parameter).
            - -1: Use field name with meta string encoding
            - >=0: Use numeric tag ID (more compact, stable across renames)
            Must be unique within the class (except -1).
            Required to force explicit choice about schema evolution strategy.

        nullable: Whether to write null flag for this field.
            - False (default): Skip null flag, field cannot be None
            - True: Write null flag (1 byte overhead), field can be None
            Note: For Optional[T] fields, nullable is automatically True
            regardless of this parameter.

        ref: Whether to enable reference tracking for this field.
            - False (default): No tracking, skip IdentityMap overhead
            - True: Track references (handles circular refs, shared objects)
            Note: Must be explicitly set to True when needed. Not inherited
            from Fory instance's ref_tracking config.

        ignore: Whether to ignore this field during serialization.
            - True: Field is excluded from serialization
            - False (default): Field is serialized

        default, default_factory, init, repr, hash, compare, metadata:
            Standard dataclass.field() parameters, passed through.

    Returns:
        A dataclass field descriptor with Fory metadata attached.

    Example:
        @dataclass
        class User:
            # Compact encoding with tag ID 0, non-nullable
            name: str = pyfory.field(0)

            # Tag ID 1, explicitly nullable
            email: Optional[str] = pyfory.field(1, nullable=True)

            # Tag ID 2, enable ref tracking
            friends: List[User] = pyfory.field(2, ref=True, default_factory=list)

            # Use field name encoding (id=-1), ignore this field
            _cache: dict = pyfory.field(-1, ignore=True, default_factory=dict)
    """

ForyFieldMeta Data Class

from dataclasses import dataclass

@dataclass(frozen=True)
class ForyFieldMeta:
    """Parsed Fory field metadata extracted from field.metadata."""

    id: int                           # Required: -1 = use field name, >=0 = use tag ID
    nullable: bool = False            # Whether null flag is written
    ref: bool = False                 # Whether ref tracking is enabled
    ignore: bool = False              # Whether to ignore this field

    def uses_tag_id(self) -> bool:
        """Returns True if this field uses tag ID encoding."""
        return self.id >= 0

    def validate_nullable(self, field_name: str, type_hint: type) -> None:
        """
        Validate nullable setting against type hint.

        Raises:
            ValueError: If Optional[T] field has nullable=False
        """
        if is_optional_type(type_hint) and not self.nullable:
            raise ValueError(
                f"Field '{field_name}' is Optional[T] but nullable=False. "
                f"Optional fields must have nullable=True (or omit the parameter)."
            )

    def effective_nullable(self, type_hint: type) -> bool:
        """
        Returns effective nullable value.

        Rules:
        - Optional[T] fields must have nullable=True (validated separately)
        - Other fields use the configured nullable value (default: False)
        """
        if is_optional_type(type_hint):
            return True  # Already validated that nullable=True
        return self.nullable

    def effective_ref(self) -> bool:
        """Returns ref tracking value (no inheritance from global config)."""
        return self.ref

Metadata Storage

Fory metadata is stored in field.metadata["__fory__"]:

FORY_FIELD_METADATA_KEY = "__fory__"

def field(...) -> Any:
    # Build Fory metadata
    fory_meta = ForyFieldMeta(
        id=id,
        nullable=nullable,
        ref=ref,
        ignore=ignore,
    )

    # Merge with user-provided metadata
    combined_metadata = dict(metadata) if metadata else {}
    combined_metadata[FORY_FIELD_METADATA_KEY] = fory_meta

    # Create dataclass field with combined metadata
    return dataclasses.field(
        default=default,
        default_factory=default_factory,
        init=init,
        repr=repr,
        hash=hash,
        compare=compare,
        metadata=combined_metadata,
        **kwargs,  # Forward any additional args
    )

Type Utilities

Reuse existing utilities from pyfory/type.py:

from pyfory.type import is_optional_type, unwrap_optional

# is_optional_type(type_) -> bool
#   Check if type is Optional[T] or Union[T, None]

# unwrap_optional(type_, field_nullable=False) -> tuple[type, bool]
#   Unwrap Optional[T] to (T, True) or return (type_, False)

No new utilities needed - type.py already provides these functions.

Implementation Architecture

Component Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           User Code                                         │
│                                                                             │
│  @dataclass                                                                 │
│  class User:                                                                │
│      name: str = pyfory.field(0)                                            │
│      email: Optional[str] = pyfory.field(1, nullable=True)                  │
│      age: int32 = pyfory.field(2)                                           │
│      _cache: dict = pyfory.field(-1, ignore=True, default_factory=dict)     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      Type Registration (Fory.register())                    │
│                                                                             │
│  1. Extract field metadata from dataclass fields                            │
│  2. Validate tag IDs are unique                                             │
│  3. Compute effective nullable/ref per field                                │
│  4. Filter out ignored fields                                               │
│  5. Build FieldInfo list with pre-computed flags                            │
│  6. Compute struct fingerprint (includes field metadata)                    │
│  7. Create DataClassSerializer with field metadata                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         DataClassSerializer                                 │
│                                                                             │
│  Fields:                                                                    │
│  - _schema_field_infos: List[FieldInfo]  # From type hints (for hash)      │
│  - _runtime_field_infos: List[FieldInfo] # With global config applied      │
│  - _hash: int32                          # Schema fingerprint hash         │
│                                                                             │
│  JIT Codegen:                                                               │
│  - Generate write/read methods with field-specific logic                    │
│  - Skip null flag for non-nullable fields                                   │
│  - Skip ref tracking for ref=False fields                                   │
│  - Use tag ID encoding when id >= 0                                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

FieldInfo Data Structure

@dataclass
class FieldInfo:
    """Pre-computed field information for serialization."""

    # Identity
    name: str                        # Field name (snake_case)
    index: int                       # Field index in dataclass
    type_hint: type                  # Type annotation

    # Fory metadata (from pyfory.field()) - used for hash computation
    tag_id: int                      # Required: -1 or >=0
    nullable: bool                   # Effective nullable flag (considers Optional[T])
    ref: bool                        # Field-level ref setting (for hash computation)
    ignore: bool                     # Whether to ignore (always False in this list)

    # Runtime flags (combines field metadata with global Fory config)
    runtime_ref_tracking: bool       # Actual ref tracking: field.ref AND fory.ref_tracking

    # Derived info
    type_id: int                     # Fory TypeId
    serializer: Serializer           # Field serializer
    unwrapped_type: type             # Type with Optional unwrapped

    # Pre-computed flags for codegen (based on runtime flags)
    needs_null_flag: bool            # Whether to write null flag
    needs_ref_tracking: bool         # Whether to track references (= runtime_ref_tracking)
    uses_tag_id: bool                # Whether to use tag ID encoding

Key distinction:

  • ref: Field-level setting from pyfory.field(), used for hash computation (stable)
  • runtime_ref_tracking: field.ref AND fory.ref_tracking, used for actual serialization

Implementation Steps

Phase 1: Core API and Metadata Extraction

Files to modify/create:

  • python/pyfory/field.py (NEW) - field() function and ForyFieldMeta
  • python/pyfory/__init__.py - Export field

Tasks:

  1. Create field() function wrapping dataclasses.field()
  2. Create ForyFieldMeta dataclass for parsed metadata
  3. Create extract_field_meta() helper to read metadata from fields
  4. Create validate_field_metas() to check:
    • Tag ID uniqueness (no duplicate IDs >= 0)
    • Nullable consistency (Optional[T] with nullable=False raises ValueError)
  5. Reuse is_optional_type() and unwrap_optional() from pyfory/type.py

Phase 2: Serializer Integration

Files to modify:

  • python/pyfory/struct.py - DataClassSerializer

Tasks:

  1. Modify DataClassSerializer.__init__() to extract field metadata
  2. Create FieldInfo dataclass for pre-computed field info
  3. Update _get_field_names() to filter out ignored fields
  4. Update _nullable_fields computation to use field metadata
  5. Add _ref_tracking_fields dict for per-field ref control

Phase 3: Fingerprint Computation

Files to modify:

  • python/pyfory/struct.py - compute_struct_fingerprint()

Tasks:

  1. Update fingerprint format to include field metadata:
    <field_id_or_name>,<type_id>,<ref>,<nullable>;
    
  2. Use tag ID as sort key when id >= 0, otherwise use field name
  3. Include ref flag in fingerprint (from field annotation, not runtime config)
  4. Update compute_struct_meta() to pass field metadata

Phase 4: JIT Codegen Updates

Files to modify:

  • python/pyfory/struct.py - _gen_write_method(), _gen_read_method(), etc.
  • python/pyfory/codegen.py - Code generation utilities

Tasks:

  1. Update write codegen to skip null flag when nullable=False
  2. Update write codegen to skip ref tracking when ref=False
  3. Update read codegen accordingly
  4. Update xwrite/xread methods for xlang mode
  5. Optimize primitive field serialization for non-nullable fields

Phase 5: TypeDef Encoding (Compatible Mode)

Files to modify:

  • python/pyfory/meta/typedef_encoder.py
  • python/pyfory/meta/typedef_decoder.py

Tasks:

  1. Update TypeDef encoding to use tag ID when available
  2. Write field header with TAG_ID encoding (2 bits = 0b11)
  3. Write tag ID as varint instead of field name
  4. Include nullable and ref flags in field header
  5. Update decoder to handle TAG_ID encoded fields

Phase 6: Cython Integration

Files to modify:

  • python/pyfory/serialization.pyx

Tasks:

  1. Add FieldInfo handling in Cython serialization code
  2. Optimize field metadata access for Cython
  3. Update Cython read/write paths for field metadata

Phase 7: Testing

Files to create:

  • python/pyfory/tests/test_field_meta.py (NEW)
  • Update python/pyfory/tests/xlang_test_main.py

Test cases:

  1. Basic pyfory.field() usage with all options
  2. Tag ID uniqueness validation
  3. Type-based nullable inference
  4. Ref tracking per field
  5. Ignored fields not serialized
  6. Fingerprint computation with field metadata
  7. Cross-language compatibility with Java/Rust/Go
  8. Schema evolution with tag IDs
  9. Mixed fields (some with metadata, some without)

Fingerprint Format

The struct fingerprint format (matching Java/Rust/Go):

<field_id_or_name>,<type_id>,<ref>,<nullable>;

Components:

  • field_id_or_name: Tag ID as string (e.g., "0", "1") if id >= 0, otherwise snake_case field name
  • type_id: Fory TypeId as decimal string (e.g., "4" for INT32)
  • ref: "1" if ref=True in field annotation, "0" otherwise (NOT affected by global Fory config)
  • nullable: "1" if null flag is written, "0" otherwise

Important: The fingerprint uses field-level ref setting only, independent of Fory(ref_tracking=...).
This ensures the hash is stable across different Fory instances with different configurations.

Example fingerprints:

# With tag IDs:
0,4,0,0;1,12,0,1;2,0,0,1;

# With field names:
age,4,0,0;email,12,0,1;name,9,0,0;

Hash computation:

fingerprint = compute_struct_fingerprint(fields)
hash_bytes = fingerprint.encode("utf-8")
full_hash = murmurhash3_x64_128(hash_bytes, seed=47)
type_hash_32 = int32(full_hash & 0xFFFFFFFF)

TypeDef Field Header Encoding

Per xlang spec, field header is 8 bits:

2 bits encoding + 4 bits size + 1 bit nullable + 1 bit ref_tracking

TAG_ID encoding (when id >= 0):

| 2 bits encoding (0b11) | 4 bits tag_id | 1 bit nullable | 1 bit ref_tracking |

When tag ID > 15, write additional varint for (tag_id - 15).

Field name encoding (when id < 0):

| 2 bits encoding (0b00-10) | 4 bits name_size | 1 bit nullable | 1 bit ref_tracking |
| meta string encoded field name |

API Examples

Basic Usage

from dataclasses import dataclass
from typing import Optional, List
import pyfory
from pyfory import int32

@dataclass
class User:
    # Compact tag ID encoding, non-nullable (saves 1 byte)
    id: int32 = pyfory.field(0)

    # Tag ID 1, non-nullable string
    name: str = pyfory.field(1)

    # Tag ID 2, nullable (required for Optional)
    email: Optional[str] = pyfory.field(2, nullable=True)

    # Tag ID 3, enable ref tracking for circular references
    friends: List["User"] = pyfory.field(3, ref=True, default_factory=list)

    # Use field name encoding (-1), ignore this field (not serialized)
    _cache: dict = pyfory.field(-1, ignore=True, default_factory=dict)

# Usage
fory = pyfory.Fory(ref_tracking=True)
fory.register(User)

user = User(id=1, name="Alice", email="[email protected]")
data = fory.serialize(user)
restored = fory.deserialize(data)

Mixed Fields (Gradual Adoption)

@dataclass
class MixedStruct:
    # New-style with field metadata (tag IDs)
    id: int32 = field(0)
    name: str = field(1)

    # New-style with field name encoding (id=-1)
    description: Optional[str] = ield(-1, nullable=True)
    count: int32 = field(-1)

Schema Evolution

# Version 1
@dataclass
class ConfigV1:
    timeout: int32 = pyfory.field(0)
    retries: int32 = pyfory.field(1)

# Version 2 - Added new field, renamed existing
@dataclass
class ConfigV2:
    timeout_ms: int32 = pyfory.field(0)  # Same tag ID, different name OK
    max_retries: int32 = pyfory.field(1) # Same tag ID, different name OK
    enabled: bool = pyfory.field(2)      # New field with new tag ID

Performance Considerations

  1. Pre-computation: All field metadata is computed once at registration time
  2. JIT Codegen: Generated methods include field-specific optimizations
  3. Skip null flags: Non-nullable primitives save 1 byte per field
  4. Skip ref tracking: Fields with ref=False skip IdentityMap lookups
  5. Tag ID encoding: Numeric IDs are more compact than field name strings

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions