Skip to content

Latest commit

 

History

History
468 lines (340 loc) · 15.3 KB

File metadata and controls

468 lines (340 loc) · 15.3 KB

Style Guide

This document is the authoritative reference for code style, naming, type annotations, import patterns, and design principles in DataDesigner. It is extracted from the project's coding standards and enforced by ruff (>=0.14.10).

For architectural invariants and project identity, see AGENTS.md. For development workflow and testing, see DEVELOPMENT.md.


General Formatting

  • Line length: Maximum 120 characters per line
  • Quote style: Always use double quotes (") for strings
  • Indentation: Use 4 spaces (never tabs)
  • String formatting: Prefer f-strings. Avoid .format() and % formatting.
  • Target version: Python 3.10+

License Headers

All Python files must include the NVIDIA SPDX license header:

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Use make update-license-headers to add headers to all files automatically.

Future Annotations

Include from __future__ import annotations at the top of every Python source file (after the license header) for deferred type evaluation.

Comments

Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.

Docstrings

Use Google style docstrings (Args:, Returns:, Raises:).

  • Public API classes and functions get docstrings. Use a one-liner for simple functions; add Google sections for anything with non-obvious parameters or behavior.
  • Private helpers (_-prefixed) don't need docstrings unless the logic is non-obvious.
  • Don't restate the signature — the docstring should explain why or what, not repeat the parameter names and types that are already in the annotation.
  • Pydantic config classes use Attributes: and Inherited Attributes: sections to document fields.
  • Module docstrings are optional — use a one-liner after the license header when the module's purpose isn't obvious from its name.
# Good - Google style with sections
def compile_config(config: DataDesignerConfig, provider: ResourceProvider) -> DataDesignerConfig:
    """Compile a raw config into an executable form.

    Resolves seed columns, adds internal IDs, and validates the result.

    Args:
        config: The user-provided configuration to compile.
        provider: Resource provider for seed dataset resolution.

    Returns:
        The compiled configuration ready for execution.

    Raises:
        ConfigValidationError: If the configuration is invalid after compilation.
    """

# Good - one-liner for simple functions
def get_column_names(config: DataDesignerConfig) -> list[str]:
    """Return the names of all columns in the config."""

# Bad - restates the signature
def get_column_names(config: DataDesignerConfig) -> list[str]:
    """Get column names from a DataDesignerConfig and return them as a list of strings."""

Type Annotations

Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules UP006, UP007, and UP045.

  • ALWAYS add type annotations to all functions, methods, and class attributes (including tests)
  • Use primitive types when possible: list not List, dict not Dict, set not Set, tuple not Tuple (enforced by UP006)
  • Use modern union syntax with | for optional and union types:
    • str | None not Optional[str] (enforced by UP045)
    • int | str not Union[int, str] (enforced by UP007)
  • Only import from typing when absolutely necessary for complex generic types
  • For Pydantic models, use field-level type annotations
# Good
def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]:
    return {item: len(item) for item in items}

# Avoid - missing type annotations
def process_items(items, max_count=None):
    return {item: len(item) for item in items}

# Avoid - old-style typing
from typing import List, Dict, Optional
def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]:
    return {item: len(item) for item in items}

Import Style

  • ALWAYS use absolute imports, never relative imports (enforced by TID)
  • Place imports at module level, not inside functions (exception: unavoidable for performance reasons)
  • Import sorting is handled by ruff's isort — imports should be grouped and sorted:
    1. Standard library imports
    2. Third-party imports (use lazy_heavy_imports for heavy libraries)
    3. First-party imports (data_designer)
  • Use standard import conventions (enforced by ICN)
# Good
from data_designer.config.config_builder import DataDesignerConfigBuilder

# Bad - relative import (will cause linter errors)
from .config_builder import DataDesignerConfigBuilder

# Good - imports at module level
from pathlib import Path

def process_file(filename: str) -> None:
    path = Path(filename)

# Bad - import inside function
def process_file(filename: str) -> None:
    from pathlib import Path
    path = Path(filename)

Lazy Loading and TYPE_CHECKING

This project uses lazy loading for heavy third-party dependencies to optimize import performance.

Heavy third-party libraries (>100ms import cost) should be lazy-loaded via lazy_heavy_imports.py:

# Don't import directly
import pandas as pd
import numpy as np

# Use lazy loading with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd, np

if TYPE_CHECKING:
    import pandas as pd
    import numpy as np

This pattern provides:

  • Runtime lazy loading (fast startup)
  • Full IDE support (autocomplete, type hints)
  • Type checker validation

See lazy_heavy_imports.py for the current list of lazy-loaded libraries.

Adding New Heavy Dependencies

If you add a new dependency with significant import cost (>100ms):

  1. Add to lazy_heavy_imports.py:

    _LAZY_IMPORTS = {
        # ... existing entries ...
        "your_lib": "your_library_name",
    }
  2. Update imports across codebase:

    from typing import TYPE_CHECKING
    from data_designer.lazy_heavy_imports import your_lib
    
    if TYPE_CHECKING:
        import your_library_name as your_lib
  3. Verify with performance test:

    make perf-import CLEAN=1

TYPE_CHECKING Rules

TYPE_CHECKING blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.

DO put in TYPE_CHECKING:

  • Internal data_designer imports used only in type hints
  • Imports that would cause circular dependencies
  • Full imports of lazy-loaded libraries for IDE support (e.g., import pandas as pd in addition to runtime from data_designer.lazy_heavy_imports import pd)

DON'T put in TYPE_CHECKING:

  • Standard library imports (Path, Any, Callable, Literal, TypeAlias, etc.)
  • Pydantic model types used in field definitions (needed at runtime for validation)
  • Types used in discriminated unions (Pydantic needs them at runtime)
  • Any import used at runtime (instantiation, method calls, base classes, etc.)

Examples:

# CORRECT - Lazy-loaded library with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd

if TYPE_CHECKING:
    import pandas as pd

def load_data(path: str) -> pd.DataFrame:
    return pd.read_csv(path)

# CORRECT - Standard library NOT in TYPE_CHECKING
from pathlib import Path
from typing import Any

def process_file(path: Path) -> Any:
    return path.read_text()

# CORRECT - Internal type-only import
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from data_designer.engine.models.facade import ModelFacade

def get_model(model: ModelFacade) -> str:
    return model.name

# INCORRECT - Pydantic field type in TYPE_CHECKING
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from data_designer.config.models import ModelConfig  # Wrong!

class MyConfig(BaseModel):
    model: ModelConfig  # Pydantic needs this at runtime!

# CORRECT - Pydantic field type at runtime
from data_designer.config.models import ModelConfig

class MyConfig(BaseModel):
    model: ModelConfig

Naming Conventions (PEP 8)

  • Functions and variables: snake_case
  • Classes: PascalCase
  • Constants: UPPER_SNAKE_CASE
  • Private attributes: prefix with single underscore _private_var
  • Function and method names must start with an action verb: e.g. get_value_from not value_from, coerce_to_int not to_int, extract_usage not usage
# Good
class DatasetGenerator:
    MAX_RETRIES = 3

    def __init__(self) -> None:
        self._cache: dict[str, str] = {}

    def generate_dataset(self, config: dict[str, str]) -> list[dict[str, str]]:
        pass

# Bad
class dataset_generator:  # Should be PascalCase
    maxRetries = 3        # Should be UPPER_SNAKE_CASE

    def GenerateDataset(self, Config):  # Should be snake_case
        pass

Code Organization

  • Public before private: Public functions/methods appear before private ones in modules and classes
  • Class method order: __init__ and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all @staticmethods in one block, all @classmethods in one block).
  • Prefer public over private for testability: Use public functions (no _ prefix) for helpers that benefit from direct testing
  • Avoid nested functions: Define helpers at module level or as private methods on the class. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is closures that genuinely need to capture local state.
  • Section comments in larger modules: Use # --- separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors)

Pydantic Models and Dataclasses

Pydantic for config, validation, serialization, and schema generation. Dataclasses for simple data containers that don't need any of that.

Pydantic Models

  • Config models inherit ConfigBase (from data_designer.config.base), which sets shared defaults: extra="forbid", use_enum_values=True, arbitrary_types_allowed=True.
  • Use Field() when you need constraints (ge, le, gt), descriptions, default_factory, discriminators, or schema control (exclude, SkipJsonSchema). Use bare defaults for simple flags and strings.
  • Specify validator mode explicitly (mode="before" or mode="after"). Name validators with descriptive verbs: validate_* for checks, normalize_* for canonicalization, inject_* for pre-parse dict shaping.
# Good - bare defaults for simple fields, Field() for constraints
class RunConfig(ConfigBase):
    disable_early_shutdown: bool = False
    shutdown_error_rate: float = Field(default=0.5, ge=0.0, le=1.0)
    buffer_size: int = Field(default=1000, gt=0)

    @model_validator(mode="after")
    def normalize_shutdown_settings(self) -> Self:
        if self.disable_early_shutdown:
            self.shutdown_error_rate = 1.0
        return self

Dataclasses

Use @dataclass for runtime data containers in the engine, CLI, and internal tooling — DTOs, concurrency primitives, task metadata. Prefer frozen=True, slots=True for immutable value types.

@dataclass(frozen=True, slots=True)
class Usage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int

When to Choose

Need Use
Validation, serialization, JSON schema Pydantic (ConfigBase or BaseModel)
Typed struct with no validation @dataclass
Immutable value object @dataclass(frozen=True, slots=True)
Dict-shaped data (e.g., trace JSON) TypedDict

Design Principles

DRY

  • Extract shared logic into pure helper functions rather than duplicating across similar call sites
  • Rule of thumb: tolerate duplication until the third occurrence, then extract

KISS

  • Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
  • When in doubt between DRY and KISS, favor readability over deduplication

YAGNI

  • Don't add parameters, config, or abstraction layers for hypothetical future use cases
  • Don't generalize until the third caller appears

SOLID

  • Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
  • Use Protocol for contracts between layers
  • One function, one job — separate logic from I/O

Error Handling

  • Prefer specific exception types over bare except. Never catch Exception or BaseException without re-raising.
  • Wrap third-party exceptions at module boundaries into canonical data_designer error types (see data_designer.errors, data_designer.interface.errors).
  • Don't use exceptions for control flow — check conditions explicitly instead.
  • Re-raise with context so the original traceback is preserved:
# Good
try:
    response = client.chat(messages)
except httpx.HTTPStatusError as exc:
    raise ModelClientError(f"LLM request failed: {exc.response.status_code}") from exc

# Bad - swallows the original traceback
except httpx.HTTPStatusError as exc:
    raise ModelClientError("LLM request failed")

Common Pitfalls to Avoid

  1. Mutable default arguments:

    # Bad
    def add_item(item: str, items: list[str] = []) -> list[str]:
        items.append(item)
        return items
    
    # Good
    def add_item(item: str, items: list[str] | None = None) -> list[str]:
        if items is None:
            items = []
        items.append(item)
        return items
  2. Unused imports and variables:

    # Bad
    from pathlib import Path
    from typing import Any  # Not used
    
    def process() -> None:
        pass
    
    # Good
    from pathlib import Path
    
    def process() -> None:
        pass
  3. Simplify code where possible (SIM rules; not yet enforced by CI but code should comply):

    # Bad
    if condition:
        return True
    else:
        return False
    
    # Good
    return condition
  4. Use comprehensions properly:

    # Bad
    list([x for x in items])  # Unnecessary list() call
    
    # Good
    [x for x in items]
  5. Proper return statements:

    # Bad - unnecessary else after return
    def get_value(condition: bool) -> str:
        if condition:
            return "yes"
        else:
            return "no"
    
    # Good
    def get_value(condition: bool) -> str:
        if condition:
            return "yes"
        return "no"

Active Linter Rules

The following ruff linter rules are currently enabled (see pyproject.toml):

  • W: pycodestyle warnings
  • F: pyflakes (unused imports, undefined names)
  • I: isort (import sorting)
  • ICN: flake8-import-conventions (standard import names)
  • PIE: flake8-pie (miscellaneous lints)
  • TID: flake8-tidy-imports (bans relative imports)
  • UP006: List[A] -> list[A]
  • UP007: Union[A, B] -> A | B
  • UP045: Optional[A] -> A | None

Note: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.