A collection of modern Spark Declarative Pipeline (SDP) implementations demonstrating different data processing paradigms. This repository showcases both open source Spark Declarative Pipelines (SDP) for analytics workloads and Spark Declarative Pipelines on Databricks, formely called LDP, for data processing.
etl-pipelines/
βββ README.md # This overview document
βββ CLAUDE.md # Claude Code configuration
βββ src/py/
βββ sdp/ # OSS Spark Declarative Pipelines examples
β βββ README.md # Comprehensive SDP documentation
β βββ daily_orders/ # E-commerce analytics pipeline
β βββ oil_rigs/ # Industrial IoT monitoring pipeline
β βββ utils/ # Shared data generation utilities
βββ lsdp/ # Lakeflow Spark Declarative Pipelines (on Databricks)
β βββ music_analytics/ # Million Song Dataset analytics pipeline
β βββ README.md # Music analytics documentation
β βββ images/ # Pipeline visualization assets
β βββ transformations/ # SDP transformation definitions
βββ generators/ # Cross-framework data generators
Perfect for batch analytics and data science workloads using PySpark.
# Navigate to SDP examples
cd src/py/sdp
# Install dependencies with UV
uv sync
# Run Daily Orders e-commerce pipeline
python main.py daily-orders
# Run Oil Rigs sensor monitoring pipeline
python main.py oil-rigsIdeal for streaming data processing with medallion architecture and data quality validation on Databricks Platform.
# Navigate to Music Analytics SDP example
cd src/py/lsdp/music_analytics
# Deploy pipeline to Databricks workspace
# Pipeline processes Million Song Dataset with medallion architecture
# See README.md for detailed implementation overview- Framework: Spark Declarative Pipelines
- Data: Synthetic e-commerce orders with 20+ product categories
- Features: Order lifecycle management, sales tax calculations, business analytics
- Storage: Local Spark warehouse with Parquet files
- Scale: Development/testing workloads
- Framework: Spark Declarative Pipelines
- Data: IoT sensor data from Texas oil fields (temperature, pressure, water level)
- Features: Multi-location monitoring, statistical analysis, interactive visualizations
- Storage: Local Spark warehouse with time-series data
- Scale: Sensor analytics and operational monitoring
- Framework: Spark Declarative Pipelines
- Data: Million Song Dataset with 20 fields of artist, song, and audio features
- Architecture: Medallion pattern with specialized silver tables and comprehensive gold analytics
- Silver Layer: Domain-focused tables (
songs_metadata_silver,songs_audio_features_silver) with comprehensive data quality validation - Gold Layer: 9 advanced analytics tables across temporal, artist, and musical analysis dimensions
- Analytics: Artist discography analysis, temporal trends, musical characteristics, tempo/time signature patterns, comprehensive artist profiles
- Storage: Delta tables with Unity Catalog integration and automatic data lineage
- Scale: Production-ready streaming data processing with Auto Loader and extensive data quality rules
- PySpark 4.1.0.preview2: Latest Spark features with Python API
- Spark Declarative Pipelines: Building data processing pipelines with materialized views and streaming tables
- UV Package Manager: Modern Python dependency management
- Faker: Realistic synthetic data generation
- Plotly: Interactive data visualizations
- Declarative Pipelines: SDP framework with Python decorators with
@dp.tabledecorators - Medallion Architecture: Bronze/Silver/Gold data layers with specialized silver tables for domain separation
- Materialized Views: Efficient data transformation caching and automatic dependency resolution
- Data Quality Framework: Comprehensive validation rules with
@dp.expectdecorators for tempo, duration, and metadata validation - Advanced Analytics: Multi-dimensional gold layer tables combining temporal, artist, and musical analysis
- Shared Utilities: Reusable data generation components across frameworks
# Environment setup
cd src/py/sdp && uv sync
# Run pipelines
python main.py daily-orders # E-commerce analytics
python main.py oil-rigs # IoT sensor monitoring
# Test utilities
uv run sdp-test-orders # Test order generation
uv run sdp-test-oil-sensors # Test sensor data generation
# Development commands
uv run pytest # Run tests
uv run black . # Format code
uv run flake8 . # Lint code# Navigate to Music Analytics pipeline
cd src/py/sdp2dbx/music_analytics
# View pipeline documentation and architecture
cat README.md
# Deploy to Databricks workspace (requires Databricks environment)
# See transformations/sdp_musical_pipeline.py for implementationThis repository demonstrates:
- Framework Comparison: SDP vs SDP2DBX for different use cases and data processing paradigms
- Data Generation: Realistic synthetic data creation patterns with Faker library
- Pipeline Architecture: Declarative transformations, medallion architecture, and specialized table design
- Quality Engineering: Comprehensive data validation with
@dp.expectrules and monitoring strategies - Advanced Analytics: Multi-dimensional analysis combining temporal trends, artist profiling, and musical characteristics
- Modern Tooling: UV package management, Unity Catalog, Auto Loader, and latest Spark features
- Production Patterns: Streaming ingestion, environment management, and scalable deployment workflows
- SDP README.md: Comprehensive Spark Declarative Pipelines guide
- Music Analytics SDP README: Million Song Dataset Spark Declarative Pipelines implementation
- CLAUDE.md: Claude Code configuration for repository navigation
- Python 3.11+: Required for all frameworks
- UV Package Manager: Modern dependency management
- Java 11+: Required by PySpark (handled automatically)
- Databricks Workspace: Required for SPD pipelines on Databricks
# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone <repository-url>
cd etl-pipelines
# Setup SDP environment
cd src/py/sdp && uv sync
# Verify installation
uv run python -c "import pyspark; print('PySpark version:', pyspark.__version__)"- Centralized Utilities: Shared data generation functions
- Clear Separation: Framework-specific implementations
- Configuration Management: Environment-specific settings
- Comprehensive Testing: Unit tests and validation scripts
- Quality First: Built-in data validation and monitoring
- Scalable Patterns: From development to production
- Modern Tooling: Latest framework features and best practices
- Documentation: Comprehensive guides and examples
- Modularity: Reusable components and transformations
- Observability: Metrics, logging, and monitoring
- Flexibility: Support for both batch and streaming workloads
- Maintainability: Clear structure and comprehensive documentation
This repository provides practical examples of modern data engineering patterns, suitable for learning and development. Each framework demonstrates different strengths and use cases in the data processing ecosystem.