Skip to content

finchdavid-proph/dbx_data_gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Databricks Test Data Generator & Table Loader

This project lets you:

  • Drop real-world schemas and code into source_data/.
  • Use an AI skill to generate relationally consistent synthetic data into data/ and schema definitions into schema/.
  • Run a loader that uploads the generated data to a Databricks Unity Catalog Volume and creates Delta tables in a target catalog.schema.

Layout

  • source_data/: Your input material – SQL DDL, schema docs (YAML/JSON/markdown), and pipeline code used to infer tables and relationships.
  • data/: Generated sample data files (for example, one CSV per table). Place your data files here for upload to Databricks.
  • schema/: Schema definitions; by default, a Python TABLE_SCHEMAS dictionary in schema/table_schemas.py. Used as reference for table structure (note: CSV files will have types inferred by COPY INTO).
  • src/: Python sources for config loading, Databricks access, upload, and table creation.
  • config/: Configuration files (see config.example.yml).
  • logs/: Execution logs are written here with timestamps (e.g., databricks_loader_20240121_143022.log).
  • .claude/data-generation/SKILL.md: The AI skill definition describing how to infer schemas and generate test data.
  • prompt-example.md: A suggested prompt for invoking the skill with this project layout.

Workflow

  1. Prepare source data

    • Put your SQL DDL, schema YAML/JSON, and/or pipeline code into files under source_data/.
  2. Call the AI skill to generate schemas & data

    • Open .claude/data-generation/SKILL.md to see what the skill does.
    • Use prompt-example.md as a starting prompt when talking to the AI.
    • Ask the AI to:
      • Produce a TABLE_SCHEMAS Python dictionary for schema/table_schemas.py.
      • Produce CSV snippets per table that you will save under data/ (for example, data/customers.csv).
    • Copy the generated TABLE_SCHEMAS block into schema/table_schemas.py.
    • Save the CSV snippets into files under data/.
  3. Configure Databricks connection

    • Copy the example config and adjust values:
    cp config/config.example.yml config/config.yml
    • Edit config/config.yml and set:
      • databricks.host: Your workspace URL (https://<your-workspace>.cloud.databricks.com).
      • databricks.token: A PAT with permissions to:
        • CREATE CATALOG (if catalogs don't exist)
        • CREATE SCHEMA (if schemas don't exist)
        • CREATE VOLUME (if volumes don't exist)
        • USE CATALOG / USE SCHEMA on the target catalog & schema
        • WRITE VOLUME on the target volume
        • CAN USE on the SQL warehouse
      • databricks.warehouse_id or databricks.http_path:
        • You can provide the warehouse ID directly, or
        • Paste the HTTP path from the SQL warehouse connection details (for example, /sql/1.0/warehouses/<id>). The project will extract and clean the underlying warehouse ID automatically.
      • data.local_data_dir: Typically ./data.
      • data.volume_base_path: UC volume base path, e.g. /Volumes/main/raw/my_volume. The catalog, schema, and volume will be automatically created if they don't exist.
      • data.catalog / data.schema: Where tables will be created. The catalog and schema will be automatically created if they don't exist.
      • data.file_format / data.format_options: Format and options for reading files (e.g. CSV with header and inferSchema).
  4. Set up Python environment and install dependencies

    # Create a virtual environment (recommended)
    python3 -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
    # Install dependencies
    pip install -r requirements.txt
  5. Run the loader

    python -m src.main

    The script will automatically:

    • Create catalogs and schemas if they don't exist:
      • The catalog and schema specified in data.catalog and data.schema (for tables)
      • The catalog and schema parsed from data.volume_base_path (for the volume)
    • Create the volume if it doesn't exist (parsed from data.volume_base_path)

    Then, for each file under data/, the script will:

    • Upload it to the configured UC volume path, preserving the relative folder layout.
    • Derive a table name from the file name (cleaned: lowercased, non-alphanumerics → _, extension removed).
    • Create a Delta table and load data using COPY INTO.
    • Note on schemas: If schema/table_schemas.py provides explicit schemas, they are used as reference. However, when loading CSV files, COPY INTO will infer column types from the data. For strict type enforcement, consider using Parquet format instead.

    Logging: All operations are logged to timestamped files in ./logs/. Progress is also displayed on the console.

    Table creation: Tables are created as managed Delta tables. If a table already exists, it will be dropped and recreated to ensure consistency.

Notes

  • Schema inference: When using CSV files, COPY INTO infers column types from the data. Explicit schemas in schema/table_schemas.py serve as documentation but don't override inferred types. For strict type control, use Parquet format instead.

  • File formats: Supported formats include CSV, JSON, and Parquet. Configure the format and options in config/config.yml.

  • Volume storage: Files are uploaded to a Unity Catalog volume. The volume path structure is preserved (subdirectories in data/ become subdirectories in the volume).

  • Table naming: Table names are derived from file names by:

    • Removing the file extension
    • Converting to lowercase
    • Replacing non-alphanumeric characters with underscores
    • Example: customer-orders.csvcustomer_orders

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages