Independent educational resource; not endorsed by Databricks, Inc. "Databricks" and "Delta Lake" are trademarks of their respective owners.
Follow me on LinkedIn for more Databricks projects and tips. Extra material: dataengineer.wiki
Modern lakehouse performance hinges on layout and file hygiene. This project is a guided lab that lets you iteratively apply and observe core Delta Lake optimization levers:
- Physical partitioning
- Z-Ordering
- Manual compaction (OPTIMIZE)
- Auto Optimize (optimizeWrites + autoCompact)
- Liquid Clustering
- VACUUM lifecycle hygiene
You will generate a synthetic 50M‑row sales dataset, capture baseline query metrics, then layer techniques - measuring their impact (files scanned, data read, scan time) via the Spark UI and table metadata.
All instructions and code live in the notebook: project.ipynb
. Open it first; proceed cell by cell.
By completing the lab you will be able to:
- Choose between partitioning, Z-Ordering, and Liquid Clustering based on data shape & query patterns
- Diagnose small-file and data-skipping issues using
DESCRIBE DETAIL
+ Spark UI - Apply manual compaction and contrast with auto compaction
- Understand retention safety around VACUUM
- Build a lightweight empirical metrics log to justify optimization choices
A sequence of Delta tables representing successive optimization strategies:
Logical Role | Table Key (see registry) | Technique Illustrated |
---|---|---|
Baseline raw | sales_raw |
Many small files, unoptimized |
Country-partitioned | sales_partitioned |
Low-cardinality partitioning |
Z-Ordered copy | sales_raw_zorder |
Multi-column data skipping |
Fragmented (pre-compaction) | sales_to_compact |
Small file proliferation |
Auto Optimize enabled | sales_auto_compact |
Automatic write sizing + async compaction |
Liquid Clustered | sales_liquid_clustered |
Adaptive clustering |
A single helper registry centralizes fully qualified names for reproducibility.
- Databricks (Community / Free or higher tier). Some VACUUM retention behaviors differ on Free Edition (cannot disable retention safety).
- Spark SQL + PySpark (no external data sources required)
- Delta Lake tables stored in a user-created catalog & schema (created automatically if permitted)
- Basic Spark & Delta Lake familiarity (DataFrames, SQL, catalog objects)
- Comfort reading Spark UI (scan details, tasks, input size)
- Python (for minor helper code)
-
Create a Databricks Account
- Sign up for a Databricks Free Edition account if you don’t already have one.
- Familiarize yourself with the workspace, clusters, and notebook interface.
-
Import this repository to Databricks
- In Databricks, go to the Workspace sidebar and click the "Repos" section, click "Add Repo".
- Alternatively, go to your personal folder, click "create" and select "git folder".
- Paste the GitHub URL for this repository.
- Authenticate with GitHub if prompted, and select the main branch.
- The repo will appear as a folder in your workspace, allowing you to edit, run notebooks, and manage files directly from Databricks.
- For more details, see the official Databricks documentation: Repos in Databricks.
- In Databricks, go to the Workspace sidebar and click the "Repos" section, click "Add Repo".
-
Open
project.ipynb
. -
Execute cells sequentially - pick the serverless cluster. The notebook is idempotent - data generation skips if the base table already exists.
-
After each optimization action, open the Spark UI (SQL / DataFrame tab) and record metrics.
Try adding:
- Date-based partition layer vs country: compare scan metrics.
- Programmatic metrics capture notebook that stores results into a Delta table for plotting trends.
- Incremental data growth simulation + periodic Z-Order refresh policy.
- Photon vs non-Photon runtime comparison (CPU cost vs performance).
- Streaming ingestion (Auto Loader) to stress clustering adaptiveness.
I want this to be maximally useful for learners. After running the notebook, please consider opening a Discussion or Issue with:
- Were the notebook instructions clear at each step? Where did you pause or re-read?
- Which optimization concept remained fuzzy, and what supporting visual or explanation would help?
- Would a short video walkthrough add value, or do you prefer self-discovery?
Happy to credit you as a contributor if you provide actionable feedback.
Symptom | Possible Cause | Action |
---|---|---|
Baseline table regenerates unexpectedly | Catalog/schema context lost | Re-run config cell to USE CATALOG + USE SCHEMA |
Z-Order command not found | Wrong runtime / missing Delta extras | Ensure cluster has Delta support (DBR 11+ recommended) |
VACUUM removed no files | Retention window not elapsed | Check history timestamps; wait or (demo only) lower retention (not on Free) |
Minimal scan improvement after Z-Order | Predicate low selectivity | Test narrower filters; ensure chosen columns are in WHERE |
Run the final cleanup cell (commented by default) to drop the entire catalog if you want to fully reset.
-- Optional
-- DROP CATALOG IF EXISTS delta_optimization_project CASCADE;
Open project.ipynb
now and start with the configuration cell. Record metrics; experimentation beats theory. Enjoy!