A comprehensive data engineering repository template supporting ETL, ML, and GenAI workflows.
- 🔄 ETL Pipelines: Data extraction, transformation, and loading
- 🤖 Machine Learning: Model training and inference pipelines
- 🧠 Generative AI: LLM integration, RAG systems, and AI agents
- 📊 Analytics: Jupyter notebooks for exploration and reporting
- 🛠 Development Tools: Testing, linting, and CI/CD ready
- 📦 Modular Design: Extensible structure for any data project
# Clone the repository
git clone <repository-url>
cd PyDataEngineerTemplate
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Run tests
make test
# Run linting
make lint
# See all available commands
make help
PyDataEngineerTemplate/
├── src/ # Source code
│ ├── jobs/ # ETL/ML/GenAI jobs
│ ├── utilities/ # Reusable utilities
│ ├── configs/ # Configuration management
│ └── workflows/ # Orchestration (Airflow, etc.)
├── tests/ # Test suite
├── assets/ # Data, schemas, and resources
├── docs/ # Documentation
└── templates/ # Code templates
python -m src.jobs.sample_etl_job
jupyter lab src/notebooks/
# All tests
pytest
# Unit tests only
pytest tests/unit/
# Integration tests only
pytest tests/integration/
See STRUCTURE_GUIDE.md for detailed guidelines on extending this repository structure for your specific use cases.
Copy .env.example
to .env
and update with your settings:
cp .env.example .env
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.