This project lets you:
- Drop real-world schemas and code into
source_data/. - Use an AI skill to generate relationally consistent synthetic data into
data/and schema definitions intoschema/. - Run a loader that uploads the generated data to a Databricks Unity Catalog Volume and creates Delta tables in a target
catalog.schema.
source_data/: Your input material – SQL DDL, schema docs (YAML/JSON/markdown), and pipeline code used to infer tables and relationships.data/: Generated sample data files (for example, one CSV per table). Place your data files here for upload to Databricks.schema/: Schema definitions; by default, a PythonTABLE_SCHEMASdictionary inschema/table_schemas.py. Used as reference for table structure (note: CSV files will have types inferred by COPY INTO).src/: Python sources for config loading, Databricks access, upload, and table creation.config/: Configuration files (seeconfig.example.yml).logs/: Execution logs are written here with timestamps (e.g.,databricks_loader_20240121_143022.log)..claude/data-generation/SKILL.md: The AI skill definition describing how to infer schemas and generate test data.prompt-example.md: A suggested prompt for invoking the skill with this project layout.
-
Prepare source data
- Put your SQL DDL, schema YAML/JSON, and/or pipeline code into files under
source_data/.
- Put your SQL DDL, schema YAML/JSON, and/or pipeline code into files under
-
Call the AI skill to generate schemas & data
- Open
.claude/data-generation/SKILL.mdto see what the skill does. - Use
prompt-example.mdas a starting prompt when talking to the AI. - Ask the AI to:
- Produce a
TABLE_SCHEMASPython dictionary forschema/table_schemas.py. - Produce CSV snippets per table that you will save under
data/(for example,data/customers.csv).
- Produce a
- Copy the generated
TABLE_SCHEMASblock intoschema/table_schemas.py. - Save the CSV snippets into files under
data/.
- Open
-
Configure Databricks connection
- Copy the example config and adjust values:
cp config/config.example.yml config/config.yml
- Edit
config/config.ymland set:databricks.host: Your workspace URL (https://<your-workspace>.cloud.databricks.com).databricks.token: A PAT with permissions to:- CREATE CATALOG (if catalogs don't exist)
- CREATE SCHEMA (if schemas don't exist)
- CREATE VOLUME (if volumes don't exist)
- USE CATALOG / USE SCHEMA on the target catalog & schema
- WRITE VOLUME on the target volume
- CAN USE on the SQL warehouse
databricks.warehouse_idordatabricks.http_path:- You can provide the warehouse ID directly, or
- Paste the HTTP path from the SQL warehouse connection details (for example,
/sql/1.0/warehouses/<id>). The project will extract and clean the underlying warehouse ID automatically.
data.local_data_dir: Typically./data.data.volume_base_path: UC volume base path, e.g./Volumes/main/raw/my_volume. The catalog, schema, and volume will be automatically created if they don't exist.data.catalog/data.schema: Where tables will be created. The catalog and schema will be automatically created if they don't exist.data.file_format/data.format_options: Format and options for reading files (e.g. CSV withheaderandinferSchema).
-
Set up Python environment and install dependencies
# Create a virtual environment (recommended) python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies pip install -r requirements.txt
-
Run the loader
python -m src.main
The script will automatically:
- Create catalogs and schemas if they don't exist:
- The catalog and schema specified in
data.cataloganddata.schema(for tables) - The catalog and schema parsed from
data.volume_base_path(for the volume)
- The catalog and schema specified in
- Create the volume if it doesn't exist (parsed from
data.volume_base_path)
Then, for each file under
data/, the script will:- Upload it to the configured UC volume path, preserving the relative folder layout.
- Derive a table name from the file name (cleaned: lowercased, non-alphanumerics →
_, extension removed). - Create a Delta table and load data using
COPY INTO. - Note on schemas: If
schema/table_schemas.pyprovides explicit schemas, they are used as reference. However, when loading CSV files,COPY INTOwill infer column types from the data. For strict type enforcement, consider using Parquet format instead.
Logging: All operations are logged to timestamped files in
./logs/. Progress is also displayed on the console.Table creation: Tables are created as managed Delta tables. If a table already exists, it will be dropped and recreated to ensure consistency.
- Create catalogs and schemas if they don't exist:
-
Schema inference: When using CSV files,
COPY INTOinfers column types from the data. Explicit schemas inschema/table_schemas.pyserve as documentation but don't override inferred types. For strict type control, use Parquet format instead. -
File formats: Supported formats include CSV, JSON, and Parquet. Configure the format and options in
config/config.yml. -
Volume storage: Files are uploaded to a Unity Catalog volume. The volume path structure is preserved (subdirectories in
data/become subdirectories in the volume). -
Table naming: Table names are derived from file names by:
- Removing the file extension
- Converting to lowercase
- Replacing non-alphanumeric characters with underscores
- Example:
customer-orders.csv→customer_orders