Style Guide

This document is the authoritative reference for code style, naming, type annotations, import patterns, and design principles in DataDesigner. It is extracted from the project's coding standards and enforced by ruff (>=0.14.10).

For architectural invariants and project identity, see AGENTS.md. For development workflow and testing, see DEVELOPMENT.md.

General Formatting

Line length: Maximum 120 characters per line
Quote style: Always use double quotes (") for strings
Indentation: Use 4 spaces (never tabs)
String formatting: Prefer f-strings. Avoid .format() and % formatting.
Target version: Python 3.10+

License Headers

All Python files must include the NVIDIA SPDX license header:

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Use make update-license-headers to add headers to all files automatically.

Future Annotations

Include from __future__ import annotations at the top of every Python source file (after the license header) for deferred type evaluation.

Comments

Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.

Docstrings

Use Google style docstrings (Args:, Returns:, Raises:).

Public API classes and functions get docstrings. Use a one-liner for simple functions; add Google sections for anything with non-obvious parameters or behavior.
Private helpers (_-prefixed) don't need docstrings unless the logic is non-obvious.
Don't restate the signature — the docstring should explain why or what, not repeat the parameter names and types that are already in the annotation.
Pydantic config classes use Attributes: and Inherited Attributes: sections to document fields.
Module docstrings are optional — use a one-liner after the license header when the module's purpose isn't obvious from its name.

# Good - Google style with sections
def compile_config(config: DataDesignerConfig, provider: ResourceProvider) -> DataDesignerConfig:
    """Compile a raw config into an executable form.

    Resolves seed columns, adds internal IDs, and validates the result.

    Args:
        config: The user-provided configuration to compile.
        provider: Resource provider for seed dataset resolution.

    Returns:
        The compiled configuration ready for execution.

    Raises:
        ConfigValidationError: If the configuration is invalid after compilation.
    """

# Good - one-liner for simple functions
def get_column_names(config: DataDesignerConfig) -> list[str]:
    """Return the names of all columns in the config."""

# Bad - restates the signature
def get_column_names(config: DataDesignerConfig) -> list[str]:
    """Get column names from a DataDesignerConfig and return them as a list of strings."""

Type Annotations

Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules UP006, UP007, and UP045.

ALWAYS add type annotations to all functions, methods, and class attributes (including tests)
Use primitive types when possible: list not List, dict not Dict, set not Set, tuple not Tuple (enforced by UP006)
Use modern union syntax with | for optional and union types:
- str | None not Optional[str] (enforced by UP045)
- int | str not Union[int, str] (enforced by UP007)
Only import from typing when absolutely necessary for complex generic types
For Pydantic models, use field-level type annotations

# Good
def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]:
    return {item: len(item) for item in items}

# Avoid - missing type annotations
def process_items(items, max_count=None):
    return {item: len(item) for item in items}

# Avoid - old-style typing
from typing import List, Dict, Optional
def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]:
    return {item: len(item) for item in items}

Import Style

ALWAYS use absolute imports, never relative imports (enforced by TID)
Place imports at module level, not inside functions (exception: unavoidable for performance reasons)
Import sorting is handled by ruff's isort — imports should be grouped and sorted:
1. Standard library imports
2. Third-party imports (use lazy_heavy_imports for heavy libraries)
3. First-party imports (data_designer)
Use standard import conventions (enforced by ICN)

# Good
from data_designer.config.config_builder import DataDesignerConfigBuilder

# Bad - relative import (will cause linter errors)
from .config_builder import DataDesignerConfigBuilder

# Good - imports at module level
from pathlib import Path

def process_file(filename: str) -> None:
    path = Path(filename)

# Bad - import inside function
def process_file(filename: str) -> None:
    from pathlib import Path
    path = Path(filename)

Lazy Loading and TYPE_CHECKING

This project uses lazy loading for heavy third-party dependencies to optimize import performance.

Heavy third-party libraries (>100ms import cost) should be lazy-loaded via lazy_heavy_imports.py:

# Don't import directly
import pandas as pd
import numpy as np

# Use lazy loading with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd, np

if TYPE_CHECKING:
    import pandas as pd
    import numpy as np

This pattern provides:

Runtime lazy loading (fast startup)
Full IDE support (autocomplete, type hints)
Type checker validation

See lazy_heavy_imports.py for the current list of lazy-loaded libraries.

Adding New Heavy Dependencies

If you add a new dependency with significant import cost (>100ms):

Add to lazy_heavy_imports.py:

_LAZY_IMPORTS = {
    # ... existing entries ...
    "your_lib": "your_library_name",
}

Update imports across codebase:

from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import your_lib

if TYPE_CHECKING:
    import your_library_name as your_lib

Verify with performance test:
```
make perf-import CLEAN=1
```

TYPE_CHECKING Rules

TYPE_CHECKING blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.

DO put in TYPE_CHECKING:

Internal data_designer imports used only in type hints
Imports that would cause circular dependencies
Full imports of lazy-loaded libraries for IDE support (e.g., import pandas as pd in addition to runtime from data_designer.lazy_heavy_imports import pd)

DON'T put in TYPE_CHECKING:

Standard library imports (Path, Any, Callable, Literal, TypeAlias, etc.)
Pydantic model types used in field definitions (needed at runtime for validation)
Types used in discriminated unions (Pydantic needs them at runtime)
Any import used at runtime (instantiation, method calls, base classes, etc.)

Examples:

# CORRECT - Lazy-loaded library with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd

if TYPE_CHECKING:
    import pandas as pd

def load_data(path: str) -> pd.DataFrame:
    return pd.read_csv(path)

# CORRECT - Standard library NOT in TYPE_CHECKING
from pathlib import Path
from typing import Any

def process_file(path: Path) -> Any:
    return path.read_text()

# CORRECT - Internal type-only import
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from data_designer.engine.models.facade import ModelFacade

def get_model(model: ModelFacade) -> str:
    return model.name

# INCORRECT - Pydantic field type in TYPE_CHECKING
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from data_designer.config.models import ModelConfig  # Wrong!

class MyConfig(BaseModel):
    model: ModelConfig  # Pydantic needs this at runtime!

# CORRECT - Pydantic field type at runtime
from data_designer.config.models import ModelConfig

class MyConfig(BaseModel):
    model: ModelConfig

Naming Conventions (PEP 8)

Functions and variables: snake_case
Classes: PascalCase
Constants: UPPER_SNAKE_CASE
Private attributes: prefix with single underscore _private_var
Function and method names must start with an action verb: e.g. get_value_from not value_from, coerce_to_int not to_int, extract_usage not usage

# Good
class DatasetGenerator:
    MAX_RETRIES = 3

    def __init__(self) -> None:
        self._cache: dict[str, str] = {}

    def generate_dataset(self, config: dict[str, str]) -> list[dict[str, str]]:
        pass

# Bad
class dataset_generator:  # Should be PascalCase
    maxRetries = 3        # Should be UPPER_SNAKE_CASE

    def GenerateDataset(self, Config):  # Should be snake_case
        pass

Code Organization

Public before private: Public functions/methods appear before private ones in modules and classes
Class method order: __init__ and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all @staticmethods in one block, all @classmethods in one block).
Prefer public over private for testability: Use public functions (no _ prefix) for helpers that benefit from direct testing
Avoid nested functions: Define helpers at module level or as private methods on the class. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is closures that genuinely need to capture local state.
Section comments in larger modules: Use # --- separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors)

Pydantic Models and Dataclasses

Pydantic for config, validation, serialization, and schema generation. Dataclasses for simple data containers that don't need any of that.

Pydantic Models

Config models inherit ConfigBase (from data_designer.config.base), which sets shared defaults: extra="forbid", use_enum_values=True, arbitrary_types_allowed=True.
Use Field() when you need constraints (ge, le, gt), descriptions, default_factory, discriminators, or schema control (exclude, SkipJsonSchema). Use bare defaults for simple flags and strings.
Specify validator mode explicitly (mode="before" or mode="after"). Name validators with descriptive verbs: validate_* for checks, normalize_* for canonicalization, inject_* for pre-parse dict shaping.

# Good - bare defaults for simple fields, Field() for constraints
class RunConfig(ConfigBase):
    disable_early_shutdown: bool = False
    shutdown_error_rate: float = Field(default=0.5, ge=0.0, le=1.0)
    buffer_size: int = Field(default=1000, gt=0)

    @model_validator(mode="after")
    def normalize_shutdown_settings(self) -> Self:
        if self.disable_early_shutdown:
            self.shutdown_error_rate = 1.0
        return self

Dataclasses

Use @dataclass for runtime data containers in the engine, CLI, and internal tooling — DTOs, concurrency primitives, task metadata. Prefer frozen=True, slots=True for immutable value types.

@dataclass(frozen=True, slots=True)
class Usage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int

When to Choose

Need	Use
Validation, serialization, JSON schema	Pydantic (`ConfigBase` or `BaseModel`)
Typed struct with no validation	`@dataclass`
Immutable value object	`@dataclass(frozen=True, slots=True)`
Dict-shaped data (e.g., trace JSON)	`TypedDict`

Design Principles

DRY

Extract shared logic into pure helper functions rather than duplicating across similar call sites
Rule of thumb: tolerate duplication until the third occurrence, then extract

KISS

Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
When in doubt between DRY and KISS, favor readability over deduplication

YAGNI

Don't add parameters, config, or abstraction layers for hypothetical future use cases
Don't generalize until the third caller appears

SOLID

Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
Use Protocol for contracts between layers
One function, one job — separate logic from I/O

Error Handling

Prefer specific exception types over bare except. Never catch Exception or BaseException without re-raising.
Wrap third-party exceptions at module boundaries into canonical data_designer error types (see data_designer.errors, data_designer.interface.errors).
Don't use exceptions for control flow — check conditions explicitly instead.
Re-raise with context so the original traceback is preserved:

# Good
try:
    response = client.chat(messages)
except httpx.HTTPStatusError as exc:
    raise ModelClientError(f"LLM request failed: {exc.response.status_code}") from exc

# Bad - swallows the original traceback
except httpx.HTTPStatusError as exc:
    raise ModelClientError("LLM request failed")

Common Pitfalls to Avoid

Mutable default arguments:

# Bad
def add_item(item: str, items: list[str] = []) -> list[str]:
    items.append(item)
    return items

# Good
def add_item(item: str, items: list[str] | None = None) -> list[str]:
    if items is None:
        items = []
    items.append(item)
    return items

Unused imports and variables:

# Bad
from pathlib import Path
from typing import Any  # Not used

def process() -> None:
    pass

# Good
from pathlib import Path

def process() -> None:
    pass

Simplify code where possible (SIM rules; not yet enforced by CI but code should comply):

# Bad
if condition:
    return True
else:
    return False

# Good
return condition

Use comprehensions properly:

# Bad
list([x for x in items])  # Unnecessary list() call

# Good
[x for x in items]

Proper return statements:

# Bad - unnecessary else after return
def get_value(condition: bool) -> str:
    if condition:
        return "yes"
    else:
        return "no"

# Good
def get_value(condition: bool) -> str:
    if condition:
        return "yes"
    return "no"

Active Linter Rules

The following ruff linter rules are currently enabled (see pyproject.toml):

W: pycodestyle warnings
F: pyflakes (unused imports, undefined names)
I: isort (import sorting)
ICN: flake8-import-conventions (standard import names)
PIE: flake8-pie (miscellaneous lints)
TID: flake8-tidy-imports (bans relative imports)
UP006: List[A] -> list[A]
UP007: Union[A, B] -> A | B
UP045: Optional[A] -> A | None

Note: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Style Guide

General Formatting

License Headers

Future Annotations

Comments

Docstrings

Type Annotations

Import Style

Lazy Loading and TYPE_CHECKING

Adding New Heavy Dependencies

TYPE_CHECKING Rules

Naming Conventions (PEP 8)

Code Organization

Pydantic Models and Dataclasses

Pydantic Models

Dataclasses

When to Choose

Design Principles

Error Handling

Common Pitfalls to Avoid

Active Linter Rules

FilesExpand file tree

STYLEGUIDE.md

Latest commit

History

STYLEGUIDE.md

File metadata and controls

Style Guide

General Formatting

License Headers

Future Annotations

Comments

Docstrings

Type Annotations

Import Style

Lazy Loading and TYPE_CHECKING

Adding New Heavy Dependencies

TYPE_CHECKING Rules

Naming Conventions (PEP 8)

Code Organization

Pydantic Models and Dataclasses

Pydantic Models

Dataclasses

When to Choose

Design Principles

Error Handling

Common Pitfalls to Avoid

Active Linter Rules