A comprehensive repository for PySpark jobs, utilities, and notebooks.
# Clone the repository
git clone <your-repo-url>
cd PyDataEngineerTemplate
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install as editable package
pip install -e .
# Run tests
make test
# Start Jupyter
make jupyter
src/
- Source code for all PySpark jobs and utilitiesnotebooks/
- Jupyter notebooks for exploration and tutorialstests/
- Comprehensive test suitedocs/
- Documentation and examplesassets/
- Sample data and schemastemplates/
- Reusable templates
.
├── .gitignore
├── README.md # Overview, setup instructions, examples
├── requirements.txt # Python dependencies
├── setup.py # For installing as a package
├── Makefile # Common commands (test, lint, build)
├── .env.example # Environment variables template
├── pyproject.toml # Modern Python project config (alternative to setup.py)
│
├── tests/
│ ├── unit/ # Unit tests for small functions
│ ├── integration/ # Spark/cluster tests
│ └── fixtures/ # Test data & reusable objects
│
├── docs/
│ ├── api/ # API reference (auto-generated)
│ ├── tutorials/ # How-to guides
│ └── examples/ # Example scripts with explanations
│
├── assets/
│ ├── data/
│ │ ├── sample/ # Sample datasets
│ │ └── schemas/ # JSON/Avro/Parquet schema files
│ └── images/ # Diagrams, charts, visual assets
│
├── src/
│ ├── __init__.py
│ ├── schemas/ # Data schemas for jobs/utilities
│ ├── utilities/
│ │ ├── transformations/ # Data transformation snippets
│ │ ├── io/ # I/O operation snippets
│ │ ├── ml/ # ML utility snippets
│ │ ├── spark_utils/ # Spark-specific utilities
│ │ └── common/ # General utility snippets
│ ├── configs/ # YAML/JSON configs for jobs
│ ├── orchestration/
│ │ ├── airflow/ # Airflow DAGs
│ │ ├── prefect/ # Prefect flows (if used)
│ │ └── schedulers/ # Custom scheduling logic
│ ├── pipeline/ # Jobs py files
│ └── notebooks/ # All Jupyter notebooks
│ ├── exploratory/ # Data exploration notebooks
│ ├── tutorials/ # Educational notebooks
│ ├── examples/ # Standalone runnable notebooks
│ └── archived/ # Old/deprecated notebooks
│
└── templates/ # Project/job templates
├── notebook_templates/ # Pre-built Jupyter templates
├── job_templates/ # Spark job templates
└── config_templates/ # Sample configs
.gitignore
- Git ignore patterns for Python, Jupyter, and data filesREADME.md
- Project overview, setup instructions, and usage examplesrequirements.txt
- Python package dependenciessetup.py
- Package installation configuration (traditional approach)pyproject.toml
- Modern Python project configuration (PEP 518)Makefile
- Common development commands (test, lint, build, deploy).env.example
- Template for environment variables
unit/
- Fast, isolated unit tests for individual functions and classesintegration/
- End-to-end tests including Spark jobs and data pipelinesfixtures/
- Shared test data, mock objects, and reusable test components
api/
- Auto-generated API documentation from docstringstutorials/
- Step-by-step guides and how-to documentationexamples/
- Complete example scripts with detailed explanations
data/sample/
- Sample datasets for testing and examplesdata/schemas/
- JSON Schema, Avro, and Parquet schema definitionsimages/
- Architecture diagrams, charts, and visual documentation
schemas/
- Data schema definitions for validation and documentationutilities/
- Reusable utility modules organized by function:transformations/
- Data transformation functions and classesio/
- Input/output operations (CSV, Parquet, databases, APIs)ml/
- Machine learning utilities and model helpersspark_utils/
- Spark session management and configuration helperscommon/
- General-purpose utilities (logging, configuration, etc.)
configs/
- YAML/JSON configuration files for different environmentsorchestration/
- Workflow orchestration code:airflow/
- Apache Airflow DAG definitionsprefect/
- Prefect workflow definitions (if used)schedulers/
- Custom scheduling and orchestration logic
pipeline/
- Main pipeline and job Python filesnotebooks/
- Jupyter notebooks organized by purpose:exploratory/
- Data exploration and analysis notebookstutorials/
- Educational notebooks explaining conceptsexamples/
- Standalone runnable example notebooksarchived/
- Deprecated or old notebooks kept for reference
notebook_templates/
- Standardized Jupyter notebook templatesjob_templates/
- Spark job and pipeline templatesconfig_templates/
- Sample configuration files for different scenarios
This structure is fully supported by the comprehensive build system with the following capabilities:
- Code Validation: Syntax checking, linting, and type checking
- Testing: Unit and integration test execution with coverage reporting
- Documentation: Auto-generation of API docs and tutorial processing
- Asset Management: Schema validation and data file packaging
- Notebook Processing: Validation, execution, and format conversion
- Code Formatting: Black, isort, and flake8 integration
- Security Scanning: Bandit and safety checks
- Dependency Management: Requirements validation and updates
- Configuration Validation: YAML/JSON syntax and schema checking
- Package Building: Python wheel and source distribution creation
- Docker Integration: Containerized deployment support
- Spark Job Packaging: Cluster-ready job packaging
- Asset Bundling: Complete project artifact creation
# Full project build
python build.py full
make build
# Test specific directories
make test-unit # tests/unit/
make test-integration # tests/integration/
# Process specific components
python build.py assets # assets/ directory
python build.py docs # docs/ directory
python build.py package # src/ packaging
This structure provides a scalable foundation for data engineering projects with clear separation of concerns and comprehensive tooling support.
from src.utilities.spark_utils import create_spark_session
from src.utilities.transformations import clean_data
# Create Spark session
spark = create_spark_session("MyApp")
# Load and clean data
df = spark.read.parquet("data/input.parquet")
clean_df = clean_data(df)