This document is the authoritative reference for code style, naming, type annotations, import patterns, and design principles in DataDesigner. It is extracted from the project's coding standards and enforced by ruff (>=0.14.10).
For architectural invariants and project identity, see AGENTS.md. For development workflow and testing, see DEVELOPMENT.md.
- Line length: Maximum 120 characters per line
- Quote style: Always use double quotes (
") for strings - Indentation: Use 4 spaces (never tabs)
- String formatting: Prefer f-strings. Avoid
.format()and%formatting. - Target version: Python 3.10+
All Python files must include the NVIDIA SPDX license header:
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0Use make update-license-headers to add headers to all files automatically.
Include from __future__ import annotations at the top of every Python source file (after the license header) for deferred type evaluation.
Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.
Use Google style docstrings (Args:, Returns:, Raises:).
- Public API classes and functions get docstrings. Use a one-liner for simple functions; add Google sections for anything with non-obvious parameters or behavior.
- Private helpers (
_-prefixed) don't need docstrings unless the logic is non-obvious. - Don't restate the signature — the docstring should explain why or what, not repeat the parameter names and types that are already in the annotation.
- Pydantic config classes use
Attributes:andInherited Attributes:sections to document fields. - Module docstrings are optional — use a one-liner after the license header when the module's purpose isn't obvious from its name.
# Good - Google style with sections
def compile_config(config: DataDesignerConfig, provider: ResourceProvider) -> DataDesignerConfig:
"""Compile a raw config into an executable form.
Resolves seed columns, adds internal IDs, and validates the result.
Args:
config: The user-provided configuration to compile.
provider: Resource provider for seed dataset resolution.
Returns:
The compiled configuration ready for execution.
Raises:
ConfigValidationError: If the configuration is invalid after compilation.
"""
# Good - one-liner for simple functions
def get_column_names(config: DataDesignerConfig) -> list[str]:
"""Return the names of all columns in the config."""
# Bad - restates the signature
def get_column_names(config: DataDesignerConfig) -> list[str]:
"""Get column names from a DataDesignerConfig and return them as a list of strings."""Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules UP006, UP007, and UP045.
- ALWAYS add type annotations to all functions, methods, and class attributes (including tests)
- Use primitive types when possible:
listnotList,dictnotDict,setnotSet,tuplenotTuple(enforced byUP006) - Use modern union syntax with
|for optional and union types:str | NonenotOptional[str](enforced byUP045)int | strnotUnion[int, str](enforced byUP007)
- Only import from
typingwhen absolutely necessary for complex generic types - For Pydantic models, use field-level type annotations
# Good
def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]:
return {item: len(item) for item in items}
# Avoid - missing type annotations
def process_items(items, max_count=None):
return {item: len(item) for item in items}
# Avoid - old-style typing
from typing import List, Dict, Optional
def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]:
return {item: len(item) for item in items}- ALWAYS use absolute imports, never relative imports (enforced by
TID) - Place imports at module level, not inside functions (exception: unavoidable for performance reasons)
- Import sorting is handled by
ruff'sisort— imports should be grouped and sorted:- Standard library imports
- Third-party imports (use
lazy_heavy_importsfor heavy libraries) - First-party imports (
data_designer)
- Use standard import conventions (enforced by
ICN)
# Good
from data_designer.config.config_builder import DataDesignerConfigBuilder
# Bad - relative import (will cause linter errors)
from .config_builder import DataDesignerConfigBuilder
# Good - imports at module level
from pathlib import Path
def process_file(filename: str) -> None:
path = Path(filename)
# Bad - import inside function
def process_file(filename: str) -> None:
from pathlib import Path
path = Path(filename)This project uses lazy loading for heavy third-party dependencies to optimize import performance.
Heavy third-party libraries (>100ms import cost) should be lazy-loaded via lazy_heavy_imports.py:
# Don't import directly
import pandas as pd
import numpy as np
# Use lazy loading with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd, np
if TYPE_CHECKING:
import pandas as pd
import numpy as npThis pattern provides:
- Runtime lazy loading (fast startup)
- Full IDE support (autocomplete, type hints)
- Type checker validation
See lazy_heavy_imports.py for the current list of lazy-loaded libraries.
If you add a new dependency with significant import cost (>100ms):
-
Add to
lazy_heavy_imports.py:_LAZY_IMPORTS = { # ... existing entries ... "your_lib": "your_library_name", }
-
Update imports across codebase:
from typing import TYPE_CHECKING from data_designer.lazy_heavy_imports import your_lib if TYPE_CHECKING: import your_library_name as your_lib
-
Verify with performance test:
make perf-import CLEAN=1
TYPE_CHECKING blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.
DO put in TYPE_CHECKING:
- Internal
data_designerimports used only in type hints - Imports that would cause circular dependencies
- Full imports of lazy-loaded libraries for IDE support (e.g.,
import pandas as pdin addition to runtimefrom data_designer.lazy_heavy_imports import pd)
DON'T put in TYPE_CHECKING:
- Standard library imports (
Path,Any,Callable,Literal,TypeAlias, etc.) - Pydantic model types used in field definitions (needed at runtime for validation)
- Types used in discriminated unions (Pydantic needs them at runtime)
- Any import used at runtime (instantiation, method calls, base classes, etc.)
Examples:
# CORRECT - Lazy-loaded library with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd
if TYPE_CHECKING:
import pandas as pd
def load_data(path: str) -> pd.DataFrame:
return pd.read_csv(path)
# CORRECT - Standard library NOT in TYPE_CHECKING
from pathlib import Path
from typing import Any
def process_file(path: Path) -> Any:
return path.read_text()
# CORRECT - Internal type-only import
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from data_designer.engine.models.facade import ModelFacade
def get_model(model: ModelFacade) -> str:
return model.name
# INCORRECT - Pydantic field type in TYPE_CHECKING
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from data_designer.config.models import ModelConfig # Wrong!
class MyConfig(BaseModel):
model: ModelConfig # Pydantic needs this at runtime!
# CORRECT - Pydantic field type at runtime
from data_designer.config.models import ModelConfig
class MyConfig(BaseModel):
model: ModelConfig- Functions and variables:
snake_case - Classes:
PascalCase - Constants:
UPPER_SNAKE_CASE - Private attributes: prefix with single underscore
_private_var - Function and method names must start with an action verb: e.g.
get_value_fromnotvalue_from,coerce_to_intnotto_int,extract_usagenotusage
# Good
class DatasetGenerator:
MAX_RETRIES = 3
def __init__(self) -> None:
self._cache: dict[str, str] = {}
def generate_dataset(self, config: dict[str, str]) -> list[dict[str, str]]:
pass
# Bad
class dataset_generator: # Should be PascalCase
maxRetries = 3 # Should be UPPER_SNAKE_CASE
def GenerateDataset(self, Config): # Should be snake_case
pass- Public before private: Public functions/methods appear before private ones in modules and classes
- Class method order:
__init__and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all@staticmethods in one block, all@classmethods in one block). - Prefer public over private for testability: Use public functions (no
_prefix) for helpers that benefit from direct testing - Avoid nested functions: Define helpers at module level or as private methods on the class. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is closures that genuinely need to capture local state.
- Section comments in larger modules: Use
# ---separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors)
Pydantic for config, validation, serialization, and schema generation. Dataclasses for simple data containers that don't need any of that.
- Config models inherit
ConfigBase(fromdata_designer.config.base), which sets shared defaults:extra="forbid",use_enum_values=True,arbitrary_types_allowed=True. - Use
Field()when you need constraints (ge,le,gt), descriptions,default_factory, discriminators, or schema control (exclude,SkipJsonSchema). Use bare defaults for simple flags and strings. - Specify validator
modeexplicitly (mode="before"ormode="after"). Name validators with descriptive verbs:validate_*for checks,normalize_*for canonicalization,inject_*for pre-parse dict shaping.
# Good - bare defaults for simple fields, Field() for constraints
class RunConfig(ConfigBase):
disable_early_shutdown: bool = False
shutdown_error_rate: float = Field(default=0.5, ge=0.0, le=1.0)
buffer_size: int = Field(default=1000, gt=0)
@model_validator(mode="after")
def normalize_shutdown_settings(self) -> Self:
if self.disable_early_shutdown:
self.shutdown_error_rate = 1.0
return selfUse @dataclass for runtime data containers in the engine, CLI, and internal tooling — DTOs, concurrency primitives, task metadata. Prefer frozen=True, slots=True for immutable value types.
@dataclass(frozen=True, slots=True)
class Usage:
prompt_tokens: int
completion_tokens: int
total_tokens: int| Need | Use |
|---|---|
| Validation, serialization, JSON schema | Pydantic (ConfigBase or BaseModel) |
| Typed struct with no validation | @dataclass |
| Immutable value object | @dataclass(frozen=True, slots=True) |
| Dict-shaped data (e.g., trace JSON) | TypedDict |
DRY
- Extract shared logic into pure helper functions rather than duplicating across similar call sites
- Rule of thumb: tolerate duplication until the third occurrence, then extract
KISS
- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
- When in doubt between DRY and KISS, favor readability over deduplication
YAGNI
- Don't add parameters, config, or abstraction layers for hypothetical future use cases
- Don't generalize until the third caller appears
SOLID
- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
- Use
Protocolfor contracts between layers - One function, one job — separate logic from I/O
- Prefer specific exception types over bare
except. Never catchExceptionorBaseExceptionwithout re-raising. - Wrap third-party exceptions at module boundaries into canonical
data_designererror types (seedata_designer.errors,data_designer.interface.errors). - Don't use exceptions for control flow — check conditions explicitly instead.
- Re-raise with context so the original traceback is preserved:
# Good
try:
response = client.chat(messages)
except httpx.HTTPStatusError as exc:
raise ModelClientError(f"LLM request failed: {exc.response.status_code}") from exc
# Bad - swallows the original traceback
except httpx.HTTPStatusError as exc:
raise ModelClientError("LLM request failed")-
Mutable default arguments:
# Bad def add_item(item: str, items: list[str] = []) -> list[str]: items.append(item) return items # Good def add_item(item: str, items: list[str] | None = None) -> list[str]: if items is None: items = [] items.append(item) return items
-
Unused imports and variables:
# Bad from pathlib import Path from typing import Any # Not used def process() -> None: pass # Good from pathlib import Path def process() -> None: pass
-
Simplify code where possible (
SIMrules; not yet enforced by CI but code should comply):# Bad if condition: return True else: return False # Good return condition
-
Use comprehensions properly:
# Bad list([x for x in items]) # Unnecessary list() call # Good [x for x in items]
-
Proper return statements:
# Bad - unnecessary else after return def get_value(condition: bool) -> str: if condition: return "yes" else: return "no" # Good def get_value(condition: bool) -> str: if condition: return "yes" return "no"
The following ruff linter rules are currently enabled (see pyproject.toml):
W: pycodestyle warningsF: pyflakes (unused imports, undefined names)I: isort (import sorting)ICN: flake8-import-conventions (standard import names)PIE: flake8-pie (miscellaneous lints)TID: flake8-tidy-imports (bans relative imports)UP006:List[A]->list[A]UP007:Union[A, B]->A | BUP045:Optional[A]->A | None
Note: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.