Skip to content

Latest commit

 

History

History
442 lines (340 loc) · 14.8 KB

File metadata and controls

442 lines (340 loc) · 14.8 KB

GoodData Export

A Python library for exporting GoodData workspace metadata to SQLite databases and CSV files.

Features

  • Multiple Export Formats: Export to SQLite, CSV, or both
  • Multi-Workspace Support: Process parent and child workspaces in parallel
  • Local Layout JSON Support: Process local JSON-based layout files without API calls
  • Flexible Configuration: Configure via Python API or environment variables
  • Post-Processing: Automatic duplicate detection and relationship analysis
  • Rich Text Extraction: Optional extraction of metrics/insights from dashboard rich text widgets
  • Standalone: Zero Flask dependencies - pure Python library
  • Clean CSV Exports: Automatically clears CSV directory before each export to prevent stale data

Installation

From Git

# With uv
uv pip install git+https://github.com/vondravl/gooddata-export.git

# With pip
pip install git+https://github.com/vondravl/gooddata-export.git

From source (local development)

git clone https://github.com/vondravl/gooddata-export.git
cd gooddata-export

# With uv (recommended)
uv sync

# With pip
pip install -e ".[dev]"

Quick Start

Command Line Interface

  1. Create a .env.gdcloud configuration file:
BASE_URL=https://your-instance.gooddata.com
WORKSPACE_ID=your_workspace_id
BEARER_TOKEN=your_api_token
  1. Run the export:
# Basic export (both SQLite and CSV)
gooddata-export export

# Export only SQLite (fastest)
gooddata-export export --format sqlite

# Export with child workspaces
gooddata-export export --include-child-workspaces --max-workers 10

# Custom directories
gooddata-export export --db-dir my_databases --csv-dir my_csvs

# Enable debug mode
gooddata-export export --debug

# Run enrichment on existing database
gooddata-export enrich --db-path output/db/gooddata_export.db

# Get help
gooddata-export --help

Python API

from gooddata_export import export_metadata

result = export_metadata(
    base_url="https://your-instance.gooddata.com",
    workspace_id="your_workspace_id",
    bearer_token="your_api_token"
)

print(f"Database created at: {result['db_path']}")  # output/db/gooddata_export.db
print(f"CSV files in: {result['csv_dir']}")  # output/metadata_csv/
print(f"Processed {result['workspace_count']} workspace(s)")

Using Environment Variables (Python API)

Create a .env.gdcloud file:

BASE_URL=https://your-instance.gooddata.com
WORKSPACE_ID=your_workspace_id
BEARER_TOKEN=your_api_token

Then in Python:

from gooddata_export.config import ExportConfig
from gooddata_export.export import export_all_metadata

# Load config from .env files
config = ExportConfig(load_from_env=True)

result = export_all_metadata(
    config=config,
    output_dir="output"
)

CLI Options

Commands

  • gooddata-export export - Export metadata from GoodData
  • gooddata-export enrich - Run post-export enrichment on existing database

Connection Options

  • --base-url URL - GoodData API base URL (overrides .env.gdcloud)
  • --workspace-id ID - Workspace ID to export (overrides .env.gdcloud)
  • --bearer-token TOKEN - API authentication token (overrides .env.gdcloud)

Export Configuration

  • --db-dir DIR - Directory for SQLite database files (default: output/db)
  • --csv-dir DIR - Directory for CSV export files (default: output/metadata_csv)
  • --format {sqlite,csv} - Export format(s): sqlite, csv, or both (default: both)
  • --db-name FILENAME - Custom SQLite database filename (default: gooddata_export.db)

Child Workspace Options

  • --include-child-workspaces - Include child workspaces in export
  • --child-workspace-data-types {metrics,dashboards,visualizations,filter_contexts} - Data types to fetch from children
  • --max-workers N - Maximum parallel workers (default: 5)

Feature Flags

  • --enable-rich-text-extraction - Enable extraction from rich text widgets
  • --skip-post-export - Skip post-export SQL processing (duplicate detection)
  • --debug - Enable debug logging

Examples

# SQLite only (fastest)
gooddata-export export --format sqlite --skip-post-export

# CSV only
gooddata-export export --format csv

# Multi-workspace with specific data types
gooddata-export export --include-child-workspaces --child-workspace-data-types dashboards visualizations --max-workers 15

# Override config with command-line args
gooddata-export export --workspace-id prod_workspace --db-dir exports/prod/db --debug

Usage Examples

SQLite-Only Export (Fastest)

For maximum speed, export only to SQLite and skip post-processing:

from gooddata_export import export_metadata

result = export_metadata(
    base_url="https://your-instance.gooddata.com",
    workspace_id="your_workspace_id",
    bearer_token="your_token",
    export_formats=["sqlite"],  # SQLite only
    run_post_export=False       # Skip duplicate detection
)

This is ideal for:

  • Programmatic access to metadata
  • Custom post-processing pipelines
  • Integration with other tools

Multi-Workspace Export

Export from a parent workspace and all its children:

result = export_metadata(
    base_url="https://your-instance.gooddata.com",
    workspace_id="parent_workspace_id",
    bearer_token="your_token",
    include_child_workspaces=True,
    child_workspace_data_types=["dashboards", "visualizations"],
    max_parallel_workspaces=5  # Process 5 workspaces at once (default)
)

Local Layout JSON Export (No API Calls)

Process local layout files without connecting to GoodData API. This is useful for:

  • Tagging workflows on feature branches before changes are deployed
  • Offline analysis of exported layout files
  • CI/CD pipelines without API access
import json
from gooddata_export import export_metadata

# Load layout from file (exported via gooddata-cli or API)
with open("layout.json") as f:
    layout = json.load(f)

result = export_metadata(
    base_url="https://your-instance.gooddata.com",  # Used for URL generation only
    workspace_id="my_workspace",
    layout_json=layout,  # No API calls made
    export_formats=["sqlite"],
    run_post_export=True
)

Expected layout format:

{
  "analytics": {
    "metrics": [...],
    "visualizationObjects": [...],
    "analyticalDashboards": [...],
    "filterContexts": [...],
    "dashboardPlugins": [...]
  },
  "ldm": {
    "datasets": [...],
    ...
  }
}

Note: When using layout_json, tables that would be stale (users, user_groups, user_group_members) are automatically truncated.

Complete Export with All Features

result = export_metadata(
    base_url="https://your-instance.gooddata.com",
    workspace_id="your_workspace_id",
    bearer_token="your_token",
    export_formats=["sqlite", "csv"],
    enable_rich_text_extraction=True,
    run_post_export=True,
    debug=True
)

Configuration Options

Required Parameters

  • base_url: GoodData API base URL
  • workspace_id: Workspace ID to export
  • bearer_token: API authentication token (required unless layout_json is provided)

Optional Parameters

  • layout_json: Local layout data dict - when provided, skips API fetch and uses this data directly
  • export_formats: List of ["sqlite"], ["csv"], or both (default: both)
  • include_child_workspaces: Fetch data from child workspaces (default: False)
    • Note: The workspaces table is always created with child workspace list; this flag controls whether to fetch child workspace DATA (metrics, dashboards, etc.)
  • child_workspace_data_types: Data types to fetch from children (default: all)
    • Options: "metrics", "dashboards", "visualizations", "filter_contexts"
  • max_parallel_workspaces: Parallel processing limit (default: 5)
  • enable_rich_text_extraction: Extract from rich text widgets (default: False)
  • run_post_export: Run duplicate detection SQL (default: True)
  • debug: Enable debug logging (default: False)
  • db_name: Custom database path (default: output_dir/db/gooddata_export.db)

Output Structure

Note: Before each export, the CSV directory (output/metadata_csv/) is automatically cleaned to prevent stale data from mixing with new exports. Database files naturally overwrite themselves and are not cleaned, allowing you to keep workspace-specific databases from multiple exports.

SQLite Database

The SQLite database contains the following tables:

  • metrics: Metric definitions, MAQL, and metadata
  • visualizations: Visualization configurations
  • dashboards: Dashboard definitions and layouts
  • ldm_datasets: Logical data model datasets with tags
  • ldm_columns: LDM columns (attributes, facts, references) with tags
  • ldm_labels: Attribute label definitions (display forms)
  • filter_contexts: Filter context definitions
  • filter_context_fields: Individual filters within each filter context (date filters and attribute filters)
  • workspaces: Workspace information (always included; child workspaces listed when available)
  • visualizations_references: Visualization references to metrics, facts, and labels
  • dashboards_visualizations: Visualization-to-dashboard relationships
  • dashboards_metrics: Metric-to-dashboard relationships (rich text only)
  • dashboards_references: Dashboard-level references to labels, datasets, and filter contexts
  • dictionary_metadata: Export metadata (timestamp, workspace ID, etc.)
  • metrics_references: All metric references extracted from MAQL - metrics, attributes, labels, and facts (created by post-export)
  • metrics_ancestry: Full transitive metric ancestry (created by post-export)

CSV Files

When CSV export is enabled, the following files are created:

  • gooddata_metrics.csv
  • gooddata_visualizations.csv
  • gooddata_dashboards.csv
  • gooddata_ldm_datasets.csv
  • gooddata_ldm_columns.csv
  • gooddata_ldm_labels.csv
  • gooddata_filter_contexts.csv
  • gooddata_filter_context_fields.csv
  • gooddata_workspaces.csv (always included; child workspaces listed when available)
  • gooddata_visualizations_references.csv
  • gooddata_dashboards_visualizations.csv
  • gooddata_dashboards_metrics.csv (rich text only)

Post-Export Processing

When run_post_export=True (default for single workspace exports), the library runs SQL scripts to:

  1. Build metric relationships: Extracts metric-to-metric references from MAQL formulas
  2. Compute metric ancestry: Creates transitive closure of metric dependencies
  3. Detect duplicates: Identifies visualizations and metrics with identical content
  4. Track usage: Marks which metrics/visualizations are used in dashboards
  5. Create analytical views: Tag views, usage views, relationship views

Key views created:

  • v_metrics_relationships_* - Metric dependency analysis and tag inheritance
  • v_metrics_usage, v_visualizations_usage - Usage tracking
  • v_*_tags - Unnested tag views for filtering

See USAGE_GUIDE.md for detailed post-processing documentation.

Note: Post-export processing is automatically skipped for multi-workspace exports to avoid confusion.

Performance Tuning

For Large Multi-Workspace Deployments (1000+ workspaces)

result = export_metadata(
    base_url="...",
    workspace_id="...",
    bearer_token="...",
    include_child_workspaces=True,
    child_workspace_data_types=["dashboards"],  # Fetch only dashboards
    max_parallel_workspaces=20,  # Higher parallelization
    export_formats=["sqlite"],   # Skip CSV
    run_post_export=False        # Skip post-processing
)

Expected performance: 10-20 workspaces/minute

For Smaller Deployments (<100 workspaces)

result = export_metadata(
    base_url="...",
    workspace_id="...",
    bearer_token="...",
    include_child_workspaces=True,
    child_workspace_data_types=["metrics", "dashboards", "visualizations", "filter_contexts"],
    max_parallel_workspaces=8
)

Development

Running Tests

# With uv
uv sync
uv run pytest

# With pip
pip install -e ".[dev]"
pytest

Project Structure

gooddata-export/
├── gooddata_export/           # Core library package
│   ├── __init__.py           # Main API exports
│   ├── cli/                  # Command-line interface
│   │   ├── __init__.py       # Package exports (main function)
│   │   ├── main.py           # CLI commands and argument parsing
│   │   └── prompts.py        # Interactive prompt utilities
│   ├── config.py             # Configuration handling
│   ├── constants.py          # Shared constants
│   ├── common.py             # API client utilities
│   ├── db.py                 # Database utilities
│   ├── post_export.py        # Post-processing orchestration
│   ├── export/               # Export orchestration
│   │   ├── __init__.py       # Main orchestration (export_all_metadata)
│   │   ├── fetch.py          # Data fetching functions (API calls)
│   │   ├── writers.py        # Database/CSV writer functions
│   │   └── utils.py          # Export utilities
│   ├── process/              # Data processing logic
│   │   ├── __init__.py       # Exports all process functions
│   │   ├── entities.py       # Entity processing
│   │   ├── layout.py         # Layout API fetching
│   │   ├── dashboard_traversal.py  # Dashboard widget extraction
│   │   ├── rich_text.py      # Rich text extraction
│   │   └── common.py         # Shared utilities
│   └── sql/                  # SQL scripts (auto-executed during post-export)
│       ├── procedures/       # Stored procedures and automation views
│       ├── updates/          # Data enrichment scripts (duplicates, usage analysis)
│       ├── views/            # Analytical views (dependencies, tags, usage)
│       └── *.yaml, *.md      # Execution config and documentation
├── main.py                   # Development CLI wrapper (convenience for local dev)
├── pyproject.toml            # Package configuration
├── README.md                 # This file
├── LICENSE                   # MIT License
├── USAGE_GUIDE.md            # Detailed usage examples
├── .env.gdcloud              # Configuration file (create this)
└── output/                   # Export destination (auto-created)
    ├── db/                   # SQLite databases
    └── metadata_csv/         # CSV exports

Note: The sql/ directory contains various analytical scripts that are automatically applied during post-export processing. These scripts evolve frequently as new analysis capabilities are added.

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please submit pull requests or open issues on GitHub.

Support

For issues and questions, please open an issue on GitHub.