CLAUDE.md

This file provides guidance to Claude Code (claude.ai/claude-code) when working with this repository.

Project Overview

GoodData Export is a Python library for exporting GoodData workspace metadata to SQLite databases and CSV files. It fetches metrics, dashboards, visualizations, and LDM (Logical Data Model) information from the GoodData API and stores them locally for analysis.

The library supports two modes:

API mode (default): Fetches data from GoodData API
Local mode: Processes local layout.json files without API calls (useful for tagging workflows on feature branches)

This is a public package. When making changes:

Bump the version in pyproject.toml (single source of truth):
```
version = "1.0.0"  # Increment appropriately
```
Note: __version__ in __init__.py is derived automatically via importlib.metadata
Update CHANGELOG.md with the changes (follow Keep a Changelog format)
A git tag vX.Y.Z is auto-created when the PR merges to main (via create_tag.yml workflow). To preview locally: python scripts/create_tag.py --dry-run

Security Considerations

This is a public package. Before committing:

Never commit .env* files (already in .gitignore)
Never include real API tokens, workspace IDs, or customer data in code/tests
Use mock data or placeholders in examples and tests
Review diffs for accidentally exposed credentials or PII

Key Commands

# Full workflow: export + enrichment
make run              # or: make export-enrich

# Export only (skip post-processing)
make export

# Enrichment only (on existing database)
make enrich
make enrich DB=output/db/custom.db

# Run with CLI directly
gooddata-export export
gooddata-export enrich --db-path output/db/gooddata_export.db

# Run tests
pytest

# Format code
make ruff-format

Architecture

Core Components

gooddata_export/
├── __init__.py          # Public API exports
├── config.py            # ExportConfig class, environment loading
├── constants.py         # Shared constants (DEFAULT_DB_NAME, worker limits)
├── common.py            # API client utilities (get_api_client, create_api_session)
├── db.py                # SQLite database utilities
├── post_export.py       # Post-processing orchestration, topological sort
├── export/              # Export module (orchestration, fetching, writing)
│   ├── __init__.py      # Main orchestration (export_all_metadata)
│   ├── fetch.py         # Data fetching functions (API calls)
│   ├── writers.py       # Database/CSV writer functions (export_*)
│   └── utils.py         # Export utilities (write_to_csv, execute_with_retry)
├── process/             # Data processing modules
│   ├── __init__.py      # Exports all process functions
│   ├── entities.py      # Entity processing (metrics, dashboards, visualizations) + fetch_child_workspaces
│   ├── layout.py        # Layout API fetching (fetch_ldm, fetch_analytics_model, fetch_users_and_user_groups)
│   ├── dashboard_traversal.py  # Dashboard widget/visualization extraction
│   ├── rich_text.py     # Rich text extraction from dashboards
│   └── common.py        # Shared utilities (sort_tags)
└── sql/                 # SQL scripts for post-export processing
    ├── post_export_config.yaml  # YAML configuration for all SQL operations
    ├── tables/          # Table creation scripts (metrics_references, etc.)
    ├── views/           # Analytical views (v_metrics_*, v_*_tags, etc.)
    ├── updates/         # Table modification scripts (duplicate detection)
    └── procedures/      # Parameterized views for API automation

scripts/
└── create_tag.py        # Auto-create git tag from pyproject.toml version (CI + manual)

Data Flow

Export Phase (export/ → process/)
- API mode: Fetches from analyticsModel endpoint (parent and children)
- Local mode: Uses provided layout_json directly (no API calls)
- All data is processed in layout format (flat structure with obj["title"])
- Stores in SQLite tables: metrics, visualizations, dashboards, ldm_*, etc.
Post-Export Phase (post_export.py)
- Loads sql/post_export_config.yaml
- Topologically sorts operations by dependencies
- Executes tables → views → procedures → updates in order
- Python populate functions run for tables needing regex (e.g., metrics_references)

Key Tables

Table	Description
`metrics`	Metric definitions with MAQL formulas
`visualizations`	Visualization configurations
`dashboards`	Dashboard definitions
`metrics_references`	All metric references from MAQL - metrics, attributes, labels, facts (Python populates)
`metrics_ancestry`	Transitive metric-to-metric ancestry (recursive CTE)

Key Views

View	Purpose
`v_metrics_relationships`	Direct metric references with titles
`v_metrics_relationships_ancestry`	Full ancestry with titles/tags
`v_metrics_relationships_root`	Root metrics (no outgoing dependencies)
`v_*_tags`	Unnested tags for each entity type
`v_*_usage`	Usage tracking views

Label Reference Validation

Both metrics and visualizations validate label references against both ldm_labels and ldm_columns (type='attribute').

Data model:

Attribute: id="region"           <- in ldm_columns (type='attribute')
  └── Label: id="region.name"    <- in ldm_labels only
  └── Label: id="region.code"    <- in ldm_labels only

Attribute: id="date.month"       <- in ldm_columns (type='attribute')
  └── Label: id="date.month"     <- in ldm_labels (shares attribute ID)

Label IDs can be:

Specific label IDs like region.name → only in ldm_labels
Attribute IDs like date.month where default label shares the ID → in ldm_columns
Date granularities like process_date.day → only in ldm_columns

Validation logic (same for both):

LEFT JOIN ldm_labels ll ON referenced_id = ll.id
LEFT JOIN ldm_columns lc ON referenced_id = lc.id AND lc.type = 'attribute'
WHERE ll.id IS NULL AND lc.id IS NULL  -- Invalid only if not in EITHER

This ensures any valid label reference is accepted regardless of whether it's a specific label ID or an attribute ID used as default label.

Configuration

Environment Variables

Create .env.gdcloud:

BASE_URL=https://your-instance.gooddata.com
WORKSPACE_ID=your_workspace_id
BEARER_TOKEN=your_api_token

Post-Export YAML Structure

sql/post_export_config.yaml defines:

tables: Created tables (some with python_populate for Python processing)
views: Read-only analytical views
procedures: Parameterized views (base_url/workspace_id from dictionary_metadata CTE, only bearer_token substituted)
updates: Table modifications with required_columns

Each entry has:

sql_file: Path to SQL file
dependencies: List of items that must run first
category: Grouping (tagging/usage/deduplication/procedures)

Common Patterns

Adding a New View

Create SQL file in sql/views/v_your_view.sql
Add to sql/post_export_config.yaml:

views:
  v_your_view:
    sql_file: views/v_your_view.sql
    description: What this view does
    category: usage
    dependencies: []  # or list dependencies

Adding a Table with Python Processing

Create SQL file in sql/tables/your_table.sql (structure only)
Add Python function in post_export.py
Register in PYTHON_POPULATE_FUNCTIONS dict
Add to YAML with python_populate: your_function_name

Adding an Update

Updates modify existing tables during post-export processing.

Create SQL file in sql/updates/your_update.sql
Add to sql/post_export_config.yaml:

updates:
  your_update:
    sql_file: updates/your_update.sql
    description: What this update does
    category: usage
    table: target_table_name
    dependencies: []
    required_columns:
      new_column: INTEGER DEFAULT 0  # Columns to add if missing

Important: Include {parent_workspace_filter} placeholder in WHERE clauses:

-- Pattern 1: When you have no other conditions
UPDATE metrics
SET some_column = value
WHERE 1=1 {parent_workspace_filter};

-- Pattern 2: When you have existing conditions
UPDATE metrics
SET some_column = value
WHERE is_valid IS NULL {parent_workspace_filter};

This placeholder is replaced at runtime:

Multi-workspace exports: AND workspace_id = 'parent_ws_id' (only updates parent workspace)
Single-workspace exports: empty string (updates all rows)

Dependency Management

Dependencies are resolved via topological sort (Kahn's algorithm)
Circular dependencies will raise ValueError
Items without dependencies execute in alphabetical order

Export Function Interface

All export_* functions in export/writers.py share the same signature for uniform orchestration:

def export_something(all_workspace_data, export_dir, config, db_name):

This allows export/__init__.py to call them in a loop:

for export_func in export_functions:
    export_func(all_workspace_data, export_dir, config, db_path)

Important: Some functions don't use all parameters (e.g., export_dashboards_permissions doesn't use config). Use underscore prefix (_config) for unused parameters and document in the docstring why it's kept. Don't remove unused parameters - it would break the uniform interface.

Testing Changes

# Test imports work
python3 -c "from gooddata_export.post_export import load_post_export_config; print(load_post_export_config())"

# Run enrichment on existing DB to test SQL changes
make enrich

# Full export + enrich
make export-enrich

Code Style

Python Formatting (Ruff)

Python files must be formatted and linted with Ruff after changes:

make ruff-format
# or directly: ruff check --fix . && ruff format .

Type Hints (Modern Syntax)

This project targets Python 3.14+. Use built-in generics and | union syntax - no typing imports needed:

def process(items: list[str], config: dict[str, int] | None = None) -> set[str]: ...
def get_class() -> type[MyClass]: ...
def fetch(id: str | int) -> tuple[str, bool]: ...

Only import from typing: Any, Never, TypeVar, TYPE_CHECKING, Protocol, Literal, TypedDict

Exception syntax: Python 3.14 allows except A, B: without parentheses (equivalent to except (A, B):). Both forms are valid.

Avoiding Over-Engineering

Don't consolidate every repeated pattern. Small, simple duplications (2-3 lines appearing a few times) are often clearer than adding another abstraction layer. Consolidate when the pattern is complex (5+ lines), appears in many places (5+), or requires consistent behavior that might need updating.

Logging (Use `logger` Instead of `print`)

Use Python's logging module instead of print() for all output:

import logging

logger = logging.getLogger(__name__)

# Use logger methods instead of print()
logger.info("Processing workspace %s", workspace_id)    # Not: print(f"Processing workspace {workspace_id}")
logger.warning("Could not fetch data: %s", error)       # Not: print(f"Warning: Could not fetch data: {error}")
logger.debug("Debug info: %s", details)                 # For debug-only output

Benefits:

Consistent output format across the codebase
Log levels allow filtering (INFO, WARNING, DEBUG, ERROR)
Easier to redirect output to files or external logging systems
Use %s formatting (not f-strings) for lazy evaluation

SQL Style

SQL files use DROP ... IF EXISTS then CREATE
SQL comments explain purpose at top of file
Table naming convention: Use plural form for grouping
- Main tables: dashboards, metrics, visualizations
- Junction tables: dashboards_visualizations, dashboards_metrics, dashboards_permissions
View naming convention: v_{table_plural}_{suffix} - views are grouped by table name
- v_dashboards_tags (dashboards group)
- v_metrics_tags, v_metrics_usage, v_metrics_relationships (metrics group)
- v_visualizations_tags, v_visualizations_usage (visualizations group)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Security Considerations

Key Commands

Architecture

Core Components

Data Flow

Key Tables

Key Views

Label Reference Validation

Configuration

Environment Variables

Post-Export YAML Structure

Common Patterns

Adding a New View

Adding a Table with Python Processing

Adding an Update

Dependency Management

Export Function Interface

Testing Changes

Code Style

Python Formatting (Ruff)

Type Hints (Modern Syntax)

Avoiding Over-Engineering

Logging (Use `logger` Instead of `print`)

SQL Style

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Security Considerations

Key Commands

Architecture

Core Components

Data Flow

Key Tables

Key Views

Label Reference Validation

Configuration

Environment Variables

Post-Export YAML Structure

Common Patterns

Adding a New View

Adding a Table with Python Processing

Adding an Update

Dependency Management

Export Function Interface

Testing Changes

Code Style

Python Formatting (Ruff)

Type Hints (Modern Syntax)

Avoiding Over-Engineering

Logging (Use logger Instead of print)

SQL Style

Logging (Use `logger` Instead of `print`)