GitHub - finchdavid-proph/dbx_data_gen

Databricks Test Data Generator & Table Loader

This project lets you:

Drop real-world schemas and code into source_data/.
Use an AI skill to generate relationally consistent synthetic data into data/ and schema definitions into schema/.
Run a loader that uploads the generated data to a Databricks Unity Catalog Volume and creates Delta tables in a target catalog.schema.

Layout

source_data/: Your input material – SQL DDL, schema docs (YAML/JSON/markdown), and pipeline code used to infer tables and relationships.
data/: Generated sample data files (for example, one CSV per table). Place your data files here for upload to Databricks.
schema/: Schema definitions; by default, a Python TABLE_SCHEMAS dictionary in schema/table_schemas.py. Used as reference for table structure (note: CSV files will have types inferred by COPY INTO).
src/: Python sources for config loading, Databricks access, upload, and table creation.
config/: Configuration files (see config.example.yml).
logs/: Execution logs are written here with timestamps (e.g., databricks_loader_20240121_143022.log).
.claude/data-generation/SKILL.md: The AI skill definition describing how to infer schemas and generate test data.
prompt-example.md: A suggested prompt for invoking the skill with this project layout.

Workflow

Prepare source data
- Put your SQL DDL, schema YAML/JSON, and/or pipeline code into files under source_data/.
Call the AI skill to generate schemas & data
- Open .claude/data-generation/SKILL.md to see what the skill does.
- Use prompt-example.md as a starting prompt when talking to the AI.
- Ask the AI to:
  - Produce a TABLE_SCHEMAS Python dictionary for schema/table_schemas.py.
  - Produce CSV snippets per table that you will save under data/ (for example, data/customers.csv).
- Copy the generated TABLE_SCHEMAS block into schema/table_schemas.py.
- Save the CSV snippets into files under data/.
Configure Databricks connection
- Copy the example config and adjust values:
```
cp config/config.example.yml config/config.yml
```
- Edit config/config.yml and set:
  - databricks.host: Your workspace URL (https://<your-workspace>.cloud.databricks.com).
  - databricks.token: A PAT with permissions to:
    - CREATE CATALOG (if catalogs don't exist)
    - CREATE SCHEMA (if schemas don't exist)
    - CREATE VOLUME (if volumes don't exist)
    - USE CATALOG / USE SCHEMA on the target catalog & schema
    - WRITE VOLUME on the target volume
    - CAN USE on the SQL warehouse
  - databricks.warehouse_id or databricks.http_path:
    - You can provide the warehouse ID directly, or
    - Paste the HTTP path from the SQL warehouse connection details (for example, /sql/1.0/warehouses/<id>). The project will extract and clean the underlying warehouse ID automatically.
  - data.local_data_dir: Typically ./data.
  - data.volume_base_path: UC volume base path, e.g. /Volumes/main/raw/my_volume. The catalog, schema, and volume will be automatically created if they don't exist.
  - data.catalog / data.schema: Where tables will be created. The catalog and schema will be automatically created if they don't exist.
  - data.file_format / data.format_options: Format and options for reading files (e.g. CSV with header and inferSchema).

Set up Python environment and install dependencies

# Create a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run the loader
```
python -m src.main
```
The script will automatically:
- Create catalogs and schemas if they don't exist:
  - The catalog and schema specified in data.catalog and data.schema (for tables)
  - The catalog and schema parsed from data.volume_base_path (for the volume)
- Create the volume if it doesn't exist (parsed from data.volume_base_path)
Then, for each file under data/, the script will:
- Upload it to the configured UC volume path, preserving the relative folder layout.
- Derive a table name from the file name (cleaned: lowercased, non-alphanumerics → _, extension removed).
- Create a Delta table and load data using COPY INTO.
- Note on schemas: If schema/table_schemas.py provides explicit schemas, they are used as reference. However, when loading CSV files, COPY INTO will infer column types from the data. For strict type enforcement, consider using Parquet format instead.
Logging: All operations are logged to timestamped files in ./logs/. Progress is also displayed on the console.

Table creation: Tables are created as managed Delta tables. If a table already exists, it will be dropped and recreated to ensure consistency.

Notes

Schema inference: When using CSV files, COPY INTO infers column types from the data. Explicit schemas in schema/table_schemas.py serve as documentation but don't override inferred types. For strict type control, use Parquet format instead.
File formats: Supported formats include CSV, JSON, and Parquet. Configure the format and options in config/config.yml.
Volume storage: Files are uploaded to a Unity Catalog volume. The volume path structure is preserved (subdirectories in data/ become subdirectories in the volume).
Table naming: Table names are derived from file names by:
- Removing the file extension
- Converting to lowercase
- Replacing non-alphanumeric characters with underscores
- Example: customer-orders.csv → customer_orders

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Test Data Generator & Table Loader

Layout

Workflow

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude/data-generation		.claude/data-generation
config		config
schema		schema
source_data		source_data
src		src
.gitignore		.gitignore
README.md		README.md
prompt-example.md		prompt-example.md
requirements.txt		requirements.txt

finchdavid-proph/dbx_data_gen

Folders and files

Latest commit

History

Repository files navigation

Databricks Test Data Generator & Table Loader

Layout

Workflow

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages