Handling Data Skew in dbt Models

A Synthetic, Reproducible Case Study

Overview

This repository demonstrates how severe data skew in analytical dbt models can lead to performance degradation and unstable executions, and how isolating skewed keys into separate execution paths can significantly improve reliability.

This case study is based on a real production issue, recreated using fully synthetic data so the optimisation pattern can be shared publicly without exposing proprietary or sensitive information.

Background and Problem Statement

Single aggregation path where a dominant key overwhelms the aggregation stage, leading to unbalanced execution.

In analytical workloads, it is common for a small subset of keys to dominate a dataset.

In this case:

A revenue aggregation model grouped movement-level data by location and date
A small number of locations accounted for a disproportionately large share of records
This caused data skew during aggregation

Observed issues included:

Long execution times
Unstable dbt runs
Pressure to scale compute instead of addressing the root cause

The original model was logically correct but did not account for skewed keys.

Baseline Approach (Pre-Optimisation)

The baseline implementation aggregates all locations together in a single execution path.

Baseline model (`revenue_base.sql`)

select
    location_id,
    activity_date,
    count(*) as movement_count,
    sum(revenue_amount) as total_revenue
from stg_synthetic_movement
group by
    location_id,
    activity_date

Why this approach breaks down

High-frequency keys dominate execution
Workload distribution becomes unbalanced
Aggregation stages become bottlenecks

OPTIMISATION STRATEGY

Optimisation Strategy

Separating skewed and non-skewed keys into dedicated execution paths to keep execution balanced and predictable.

Rather than scaling infrastructure, the optimisation focuses on changing the execution pattern.

Core principles

Identify high-frequency (skewed) keys
Isolate skewed keys into a dedicated execution path
Process non-skewed data separately
Recombine results using UNION ALL

Implementation Details

Synthetic Staging Layer

Seed data is staged using standard dbt practices.

Staging model

models/staging/stg_synthetic_movement.sql

Skewed Execution Path

Model

models/marts/revenue/revenue_skewed.sql

This model processes only the high-frequency location:

where location_id = 'SK1'

Non-Skewed Execution Path

Model

models/marts/revenue/revenue_non_skewed.sql

This model processes all other locations efficiently:

where location_id != 'SK1'

Final Optimised Model

Model

models/marts/revenue/revenue_optimized.sql

Recombines both execution paths:

select * from revenue_skewed
union all
select * from revenue_non_skewed

📈 Engineering Impact & Production Results

This framework was developed to resolve a critical failure in a tier-1 customer data model. By implementing execution-aware model design, I achieved the following verified results:

Reliability & Recovery: Resolved persistent warehouse execution failures (OOM/Timeouts). The model transitioned from daily instability and failure to 100% successful execution.
Performance Engineering: Reduced end-to-end processing time by 66%, bringing the execution window down from 60+ minutes to just 20 minutes.
Operational Cost Efficiency: Significantly reduced cluster compute spillage to remote storage by isolating skewed keys. This optimization directly lowered the compute costs, optimizing the department’s OpEx.
SLA Restoration: Restored the Data SLA for downstream executive stakeholders, ensuring mission-critical customer metrics were available for business decision-making 40 minutes earlier than previously possible.
Scalability: Execution paths now remain balanced, avoiding single-stage bottlenecks even as upstream data volumes increase, ensuring the pipeline is "future-proofed" for company growth.

Performance Evidence (Synthetic)

To provide reproducible evidence without using any proprietary data, I ran the models locally using dbt-duckdb against synthetic seed data designed to include a heavily skewed key.

Baseline run (build stg_synthetic_movement + revenue_base): 0.59s
Optimised run (build stg_synthetic_movement + revenue_skewed + revenue_non_skewed + revenue_optimized): 0.53s

The optimised approach builds additional intermediate models (separate skewed and non-skewed aggregations) before recombining results. Despite the extra build steps, the overall run completed slightly faster on this synthetic dataset and demonstrates an execution-aware strategy that isolates skewed keys into a dedicated path.

Run Evidence

Baseline run

Optimised run

Data Privacy and Anonymisation

All datasets are fully synthetic
No real customer, financial, or operational data is included
Column names, values, and distributions are anonymised
The optimisation logic remains representative of real-world systems

Project Structure

models/
  staging/
    stg_synthetic_movement.sql
  marts/
    revenue/
      revenue_base.sql
      revenue_skewed.sql
      revenue_non_skewed.sql
      revenue_optimized.sql
seeds/
  movement_non_skewed.csv
  movement_skewed.csv

KEY TAKEAWAYS

Data skew is an execution problem, not just a scaling issue
Isolating skewed keys is a simple and effective optimisation pattern
dbt models benefit from execution-aware design

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
analyses		analyses
macros		macros
models		models
seeds		seeds
snapshots		snapshots
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Handling Data Skew in dbt Models

A Synthetic, Reproducible Case Study

Overview

Background and Problem Statement

Baseline Approach (Pre-Optimisation)

Baseline model (`revenue_base.sql`)

Why this approach breaks down

OPTIMISATION STRATEGY

Optimisation Strategy

Core principles

Implementation Details

Synthetic Staging Layer

Skewed Execution Path

Non-Skewed Execution Path

Final Optimised Model

📈 Engineering Impact & Production Results

Performance Evidence (Synthetic)

Run Evidence

Data Privacy and Anonymisation

Project Structure

KEY TAKEAWAYS

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Handling Data Skew in dbt Models

A Synthetic, Reproducible Case Study

Overview

Background and Problem Statement

Baseline Approach (Pre-Optimisation)

Baseline model (revenue_base.sql)

Why this approach breaks down

OPTIMISATION STRATEGY

Optimisation Strategy

Core principles

Implementation Details

Synthetic Staging Layer

Skewed Execution Path

Non-Skewed Execution Path

Final Optimised Model

📈 Engineering Impact & Production Results

Performance Evidence (Synthetic)

Run Evidence

Data Privacy and Anonymisation

Project Structure

KEY TAKEAWAYS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Baseline model (`revenue_base.sql`)

Packages