A comprehensive dbt project for building a modern ecommerce data warehouse with multi-source integration, advanced analytics, and data quality monitoring.
- Overview
- Data Sources
- Architecture
- Quick Start
- Installation
- Data Loading
- Running the Project
- Testing
- Data Quality Monitoring
- Documentation
- Looker LookML Analytics Layer
- Project Structure
- Models Overview
- Business Use Cases
- Contributing
The Ra Ecommerce Data Warehouse v2 is a comprehensive analytics solution that combines a dbt data transformation project with a complete Looker LookML implementation. It integrates data from multiple ecommerce and marketing platforms to provide enterprise-grade business intelligence and analytics capabilities. The warehouse follows best practices for data modeling, includes extensive data quality monitoring, and provides pre-built analytics for common ecommerce use cases.
- Multi-Source Integration: Shopify, Google Analytics 4, Google/Facebook/Pinterest Ads, Klaviyo, Instagram Business
- Layered Architecture: Staging → Integration → Warehouse with clear separation of concerns
- Looker LookML Layer: 16 views, 12 explores, and 3 production-ready dashboards
- Data Quality Monitoring: Comprehensive pipeline health tracking and data quality metrics
- Attribution Analysis: Multi-touch customer journey attribution across all touchpoints
- Performance Analytics: Campaign performance, customer segmentation, product analytics
- SCD Type 2: Slowly Changing Dimensions for customers and products
- Comprehensive Testing: 100+ data tests for data integrity and business rules
- Documentation: Full column-level documentation and business context
Source | Purpose | Tables | Status |
---|---|---|---|
Shopify | Ecommerce transactions | Orders, Customers, Products, Order Lines | ✅ Active |
Google Analytics 4 | Website behavior | Events, Sessions | ✅ Active |
Google Ads | Paid search advertising | Campaigns, Ad Groups, Keywords, Ads | ✅ Active |
Facebook Ads | Social media advertising | Campaigns, Ad Sets, Ads | ✅ Active |
Pinterest Ads | Visual advertising | Campaigns, Ad Groups | ✅ Active |
Klaviyo | Email marketing | Campaigns, Events, People | ✅ Active |
Instagram Business | Social media content | Posts, Media Insights | ✅ Active |
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Raw Sources │ │ Staging (stg) │ │ Integration(int)│
│ │───▶│ │───▶│ │
│ • Shopify │ │ • Cleaned │ │ • Business │
│ • GA4 │ │ • Standardized │ │ Logic │
│ • Ad Platforms │ │ • Validated │ │ • Calculations │
│ • Email/Social │ │ │ │ • Joins │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Warehouse (wh) │
│ │
│ • Dimensions │
│ • Facts │
│ • Aggregations │
│ • BI Ready │
└─────────────────┘
- Staging: Raw data cleaning, standardization, and basic validation
- Integration: Business logic application, cross-source joins, and metric calculations
- Warehouse: Dimensional modeling with facts and dimensions optimized for analytics
- dbt Core 1.10+
- BigQuery project with billing enabled
- Python 3.8+
- Access to source data (or use provided seed data)
git clone <repository-url>
cd ra_warehouse_ecommerce_v2
# Create and activate virtual environment
python -m venv dbt-ecomm-env
source dbt-ecomm-env/bin/activate # On Windows: dbt-ecomm-env\Scripts\activate
# Install dependencies
pip install dbt-core dbt-bigquery
Create ~/.dbt/profiles.yml
:
ra_dw_ecommerce:
target: dev
outputs:
dev:
type: bigquery
method: service-account
project: your-bigquery-project-id
dataset: analytics_ecommerce
keyfile: path/to/your/service-account.json
threads: 4
timeout_seconds: 300
location: us-central1 # or your preferred location
# Load all seed data (demo datasets)
dbt seed
# Or load specific sources
dbt seed --select "shopify_demo"
dbt seed --select "ga4_demo"
dbt seed --select "ad_platforms"
# Install dbt packages
dbt deps
# Run all models
dbt run
# Run tests
dbt test
# Generate documentation
dbt docs generate
dbt docs serve
The project includes comprehensive seed data for development and testing:
# Load all seed data
dbt seed
# Check what seeds are available
dbt ls --resource-type seed
# Load specific source seeds
dbt seed --select "tag:shopify"
dbt seed --select "tag:ga4"
dbt seed --select "tag:marketing"
- Shopify: 2,500+ orders, 500+ customers, 50+ products
- GA4: 10K+ events across 1K+ sessions
- Ad Platforms: Campaign performance data for Google/Facebook/Pinterest Ads
- Klaviyo: Email campaigns and engagement events
- Instagram: Social media posts and engagement metrics
For production use, replace seed sources with your actual data sources:
- Update
sources.yml
files in each staging folder - Replace
{{ source() }}
references with your actual table names - Configure your data pipeline to land data in the expected schemas
# Example: models/staging/stg_shopify_ecommerce/sources.yml
sources:
- name: shopify_raw
description: Shopify production data
schema: fivetran_shopify # Your actual schema
tables:
- name: order
- name: customer
- name: product
# Run all models
dbt run
# Run specific layer
dbt run --select "staging"
dbt run --select "integration"
dbt run --select "warehouse"
# Run specific source models
dbt run --select "+stg_shopify+"
dbt run --select "+int_customers+"
# Run with fail-fast
dbt run --fail-fast
# Run in parallel
dbt run --threads 8
# Run only modified models and downstream dependencies
dbt run --select "state:modified+"
# Run models that failed in last run
dbt retry
# 1. Load/refresh seed data
dbt seed
# 2. Run staging models
dbt run --select "staging"
# 3. Run integration models
dbt run --select "integration"
# 4. Run warehouse models
dbt run --select "warehouse"
# 5. Run tests
dbt test
# 6. Check data quality
dbt run --select "wh_fact_data_quality"
The project includes comprehensive testing at all layers:
- Schema Tests: Uniqueness, not-null, relationships, accepted values
- Data Tests: Custom SQL tests for business logic validation
- Quality Tests: Data completeness, consistency, and integrity checks
# Run all tests
dbt test
# Run tests for specific layers
dbt test --select "staging"
dbt test --select "integration"
dbt test --select "warehouse"
# Run specific test types
dbt test --select "test_type:generic"
dbt test --select "test_type:singular"
# Run tests for specific model
dbt test --select "fact_orders"
# Store test results
dbt test --store-failures
- Referential Integrity: Ensures all foreign keys have valid references
- SCD Type 2 Integrity: Validates slowly changing dimension logic
- Attribution Weights: Ensures customer journey attribution sums to 100%
- Data Quality Metrics: Validates pipeline health scores are within ranges
- Business Rules: Customer segments, product performance tiers, etc.
The warehouse includes comprehensive data quality monitoring through the fact_data_quality
table:
-- Check overall pipeline health
SELECT
data_source,
overall_pipeline_health_score,
data_quality_rating,
pipeline_efficiency_rating
FROM fact_data_quality
ORDER BY overall_pipeline_health_score DESC;
- Data Flow Percentages: Source → Staging → Integration → Warehouse
- Test Pass Rates: By layer and overall
- Data Completeness: Percentage of expected data present
- Quality Scores: 0-100 scoring across all dimensions
- Pipeline Efficiency: Performance and optimization ratings
-- Identify data quality issues
SELECT data_source
FROM fact_data_quality
WHERE overall_pipeline_health_score < 85
OR data_quality_rating = 'Poor'
OR pipeline_efficiency_rating = 'Poor';
# Update data quality metrics
dbt run --select "int_data_pipeline_metadata"
dbt run --select "wh_fact_data_quality"
# Check quality test results
dbt test --select "fact_data_quality"
# Generate docs with model descriptions and lineage
dbt docs generate
# Serve documentation locally
dbt docs serve --port 8080
- Model Descriptions: Business purpose and context for each model
- Column Documentation: Detailed descriptions, data types, and constraints
- Data Lineage: Visual representation of model dependencies
- Test Results: Pass/fail status for all tests
- Source Freshness: Data recency checks
See docs/user_guide.md
for:
- Business use cases and KPIs
- Pre-built SQL queries for common analytics
- Visualization specifications
- Dashboard creation guidance
The project includes a complete Looker LookML implementation that provides a self-service analytics layer on top of the data warehouse.
- 16 View Files: Comprehensive coverage of all fact and dimension tables
- 12 Explores: Pre-configured analysis paths for different business use cases
- 3 Production-Ready Dashboards: Executive, Sales, and Data Quality monitoring
- Advanced Analytics: Multi-touch attribution, cohort analysis, and predictive metrics
Purpose: C-level business performance monitoring with real-time KPIs and trends
Key Features:
- 6 primary KPIs: Revenue, Orders, AOV, Customers, CAC, ROAS
- Revenue trend analysis with period comparisons
- Channel performance breakdown
- Marketing efficiency tracking (spend vs revenue)
- Website conversion funnel visualization
Purpose: Deep-dive sales analysis for operations and product teams
Key Features:
- Sales performance with YoY comparisons
- Product performance rankings
- Customer segmentation analysis
- Geographic distribution maps
- Order status and fulfillment tracking
Purpose: Real-time data pipeline health monitoring and quality assurance
Key Features:
- Overall test pass rate (97.8%)
- Pipeline health scores by data source
- Data flow efficiency metrics
- Error and warning tracking
- Source data volume monitoring
lookml/
├── models/
│ └── ecommerce_demo.model.lkml # Main model with explores
├── views/
│ ├── dim_*.view.lkml # Dimension views
│ └── fact_*.view.lkml # Fact table views
└── dashboards/
├── executive_overview.dashboard.lookml
├── sales_orders_analytics.dashboard.lookml
└── data_quality_monitoring.dashboard.lookml
-
Deploy LookML Project:
# Clone the lookml directory to your Looker instance # Configure connection to BigQuery as 'ra_dw_prod'
-
Access Dashboards:
- Navigate to Looker → Dashboards
- Select from Executive Overview, Sales Analytics, or Data Quality
- Use date filters to adjust reporting periods
-
Explore Data:
- Use any of the 12 pre-built explores
- Build custom reports and visualizations
- Schedule automated delivery of dashboards
For detailed LookML documentation, see docs/lookml-project-documentation.md
.
ra_warehouse_ecommerce_v2/
├── dbt_project.yml # Project configuration
├── packages.yml # dbt package dependencies
├── README.md # This file
├── docs/
│ ├── user_guide.md # Business user documentation
│ ├── lookml-project-documentation.md # LookML implementation guide
│ └── img/ # Dashboard screenshots
├── lookml/ # Looker LookML project
│ ├── models/
│ │ └── ecommerce_demo.model.lkml
│ ├── views/
│ │ ├── dim_*.view.lkml # Dimension views
│ │ └── fact_*.view.lkml # Fact table views
│ └── dashboards/
│ ├── executive_overview.dashboard.lookml
│ ├── sales_orders_analytics.dashboard.lookml
│ └── data_quality_monitoring.dashboard.lookml
├── seeds/ # Demo/reference data
│ ├── shopify/ # Shopify seed data
│ ├── ga4/ # GA4 seed data
│ ├── google_ads/ # Google Ads seed data
│ ├── facebook_ads/ # Facebook Ads seed data
│ └── klaviyo/ # Email marketing seed data
├── models/
│ ├── staging/ # Data cleaning and standardization
│ │ ├── stg_shopify_ecommerce/
│ │ ├── stg_ga4_events/
│ │ ├── stg_google_ads/
│ │ ├── stg_facebook_ads/
│ │ ├── stg_pinterest_ads/
│ │ ├── stg_instagram_business/
│ │ └── stg_klaviyo_emails/
│ ├── integration/ # Business logic and cross-source joins
│ │ ├── int_customers.sql
│ │ ├── int_orders.sql
│ │ ├── int_products.sql
│ │ ├── int_sessions.sql
│ │ ├── int_campaigns.sql
│ │ ├── int_email_events.sql
│ │ └── int_customer_journey.sql
│ └── warehouse/ # Dimensional model (facts & dimensions)
│ ├── wh_dim_customers.sql
│ ├── wh_dim_products.sql
│ ├── wh_dim_date.sql
│ ├── fact_orders.sql
│ ├── fact_sessions.sql
│ ├── fact_customer_journey.sql
│ ├── fact_marketing_performance.sql
│ ├── fact_email_marketing.sql
│ └── fact_data_quality.sql
└── tests/ # Custom SQL tests
Clean and standardize raw data with basic validation:
- Shopify: Orders, customers, products, order lines
- GA4: Page views, purchases, add-to-cart, session starts
- Ad Platforms: Campaigns, ad groups, ads with unified metrics
- Klaviyo: Email campaigns, events, person profiles
- Instagram: Social posts with engagement metrics
Apply business logic and create cross-source relationships:
- int_customers: Customer metrics, segmentation, LTV calculations
- int_orders: Order aggregations, sequence analysis, classifications
- int_products: Product performance, inventory, revenue analysis
- int_sessions: Website session analysis with conversion indicators
- int_campaigns: Multi-platform campaign performance
- int_customer_journey: Cross-channel attribution and journey analysis
- int_email_events: Email engagement with customer context
Dimensional model optimized for analytics:
- wh_dim_customers: SCD Type 2 customer dimension with segments
- wh_dim_products: SCD Type 2 product dimension with performance metrics
- wh_dim_date: Complete date dimension with fiscal calendar
- wh_dim_channels_enhanced: Marketing channel classifications
- wh_dim_email_campaigns: Email campaign details and categorization
- fact_orders: Order line items with full context and metrics
- fact_sessions: Website sessions with engagement and conversion data
- fact_events: Individual website events with context
- fact_customer_journey: Multi-touch attribution with weights
- fact_marketing_performance: Cross-platform advertising metrics
- fact_email_marketing: Email campaign performance and engagement
- fact_social_posts: Social media content performance
- fact_data_quality: Pipeline health and data quality monitoring
The warehouse supports 9 comprehensive business use cases:
- Executive Overview: High-level KPIs and business performance
- Sales & Orders: Revenue tracking and order analysis
- Marketing & Attribution: Cross-channel performance and ROI
- Website & User Engagement: Session analysis and conversion funnels
- Customer Insights & Segments: RFM analysis and customer lifetime value
- Product & Inventory: Product performance and stock management
- Email & Campaign Performance: Email marketing effectiveness
- Social Content Performance: Social media engagement analysis
- Data Quality Monitoring: Pipeline health and data completeness
See docs/user_guide.md
for detailed queries and visualization specifications.
- dbt_project.yml: Project settings, model configurations, variables
- packages.yml: External dbt package dependencies
- sources.yml: Source data definitions and freshness checks
- schema.yml: Model and column documentation with tests
# BigQuery project
export DBT_PROJECT_ID="your-project-id"
# Development vs production datasets
export DBT_DATASET_SUFFIX="_dev" # For development
# Data freshness settings
export DBT_SOURCE_FRESHNESS_WARN_AFTER="12 hours"
export DBT_SOURCE_FRESHNESS_ERROR_AFTER="24 hours"
- Staging: Tables (for performance with seed data)
- Integration: Tables (for complex transformations)
- Warehouse: Tables (for BI tool performance)
- Large Facts: Incremental (for production scale)
- Branch: Create feature branch from
main
- Develop: Make changes following project conventions
- Test: Run
dbt test
and ensure all tests pass - Document: Update model documentation and README if needed
- Quality: Run
dbt run --select fact_data_quality
to check pipeline health - PR: Submit pull request with clear description
- Follow dbt best practices for model organization
- Include comprehensive column-level documentation
- Add appropriate tests for all models
- Use consistent naming conventions
- Include business context in model descriptions
- All models must have schema tests
- New business logic requires custom SQL tests
- Maintain >95% test pass rate
- Update data quality monitoring when adding new sources
- Documentation:
dbt docs serve
for full model documentation - Issues: Check data quality tables for pipeline health
- Performance: Use
dbt run --threads <n>
to adjust parallelism - Debugging: Use
dbt --log-level debug
for detailed logging
Copyright © 2025 Rittman Analytics. All rights reserved.
This project is licensed under the MIT License - see the LICENSE file for details.