Skip to content

A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.

License

Notifications You must be signed in to change notification settings

JoseRZapata/data-science-project-template

Repository files navigation

Data science project template

uv Ruff pre-commit security: bandit Checked with mypy pages-build-deployment CI codecov


A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.

✨📚✨ Documentation: https://joserzapata.github.io/data-science-project-template/

Source Code: https://github.com/JoseRZapata/data-science-project-template


Features


Table of Contents

📁 Creating a New Project

👍 Recommendations

It is highly recommended to use managers for the python versions, dependencies and virtual environments.

This project uses UV, a extremely fast tool to replace pip, pip-tools, Pipx, Poetry, Pyenv, twine, virtualenv, and more.

🌟 Check how to setup your environment: https://joserzapata.github.io/data-science-project-template/local_setup/

🍪🥇 Via Cruft - (recommended)

# Install cruft in a isolated environment using uv

uv tool install cruft 

# Or Install with pip

pip install --user cruft # Install `cruft` on your path for easy access
cruft create https://github.com/JoseRZapata/data-science-project-template

then inside the project folder, init git and uv environment using Make:

make init_git
make install_env
source .venv/bin/activate

🍪 Via Cookiecutter

uv tool install cookiecutter # Install cruft in a isolated environment

# Or Install with pip

pip install --user cookiecutter # Install `cookiecutter` on your path for easy access
cookiecutter gh:JoseRZapata/data-science-project-template

Note: Cookiecutter uses gh: as short-hand for https://github.com/

🔗 Linking an Existing Project

If the project was originally installed via Cookiecutter, you must first use Cruft to link the project with the original template:

cruft link https://github.com/JoseRZapata/data-science-project-template

Then/else:

cruft update

🗃️ Project structure

Folder structure for data science projects why?

.
├── .code_quality
│   ├── mypy.ini                        # mypy configuration
│   └── ruff.toml                       # ruff configuration
├── .github                             # github configuration
│   ├── actions
│   │   └── python-poetry-env
│   │       └── action.yml              # github action to setup python environment
│   ├── dependabot.md                   # github action to update dependencies
│   ├── pull_request_template.md        # template for pull requests
│   └── workflows                       # github actions workflows
│       ├── ci.yml                      # run continuous integration (tests, pre-commit, etc.)
│       ├── dependency_review.yml       # review dependencies
│       ├── docs.yml                    # build documentation (mkdocs)
│       └── pre-commit_autoupdate.yml   # update pre-commit hooks
├── .vscode                             # vscode configuration
|   ├── extensions.json                 # list of recommended extensions
|   ├── launch.json                     # vscode launch configuration
|   └── settings.json                   # vscode settings
├── conf                                # folder configuration files
│   └── config.yaml                     # main configuration file
├── data
│   ├── 01_raw                          # raw immutable data
│   ├── 02_intermediate                 # typed data
│   ├── 03_primary                      # domain model data
│   ├── 04_feature                      # model features
│   ├── 05_model_input                  # often called 'master tables'
│   ├── 06_models                       # serialized models
│   ├── 07_model_output                 # data generated by model runs
│   ├── 08_reporting                    # reports, results, etc
│   └── README.md                       # description of the data structure
├── docs                                # documentation for your project
│   ├── index.md                        # documentation homepage
├── models                              # store final models
├── notebooks
│   ├── 1-data                          # data extraction and cleaning
│   ├── 2-exploration                   # exploratory data analysis (EDA)
│   ├── 3-analysis                      # Statistical analysis, hypothesis testing.
│   ├── 4-feat_eng                      # feature engineering (creation, selection, and transformation.)
│   ├── 5-models                        # model training, evaluation and hyperparameter tuning.
│   ├── 6-interpretation                # model interpretation
│   ├── 7-deploy                        # model packaging, deployment strategies.
│   ├── 8-reports                       # story telling, summaries and analysis conclusions.
│   ├── notebook_template.ipynb         # template for notebooks
│   └── README.md                       # information about the notebooks
├── src                                 # source code for use in this project
│   ├── README.md                       # description of src structure
│   ├── tmp_mock.py                     # example python file
│   ├── data                            # data extraction, validation, processing, transformation
│   ├── model                           # model training, evaluation, validation, export
│   ├── inference                       # model prediction, serving, monitoring
│   └── pipelines                       # orchestration of pipelines
│       ├── feature_pipeline            # transforms raw data into features and labels
│       ├── training_pipeline           # transforms features and labels into a model
│       └── inference_pipeline          # takes features and a trained model for predictions
├── tests                               # test code for your project
│   ├── test_mock.py                    # example test file
│   ├── data                            # tests for data module
│   ├── model                           # tests for model module
│   ├── inference                       # tests for inference module
│   └── pipelines                       # tests for pipelines module
├── .editorconfig                       # editor configuration
├── .gitignore                          # files to ignore in git
├── .pre-commit-config.yaml             # configuration for pre-commit hooks
├── codecov.yml                         # configuration for codecov
├── Makefile                            # useful commands to setup environment, run tests, etc.
├── mkdocs.yml                          # configuration for mkdocs documentation
├── pyproject.toml                      # dependencies and configuration project file
├── uv.lock                             # locked dependencies
└── README.md                           # description of your project    

✨ Features and Tools

🚀 Project Standardization and Automation

🔨 Developer Workflow Automation

🌱 Conditionally Rendered Python Package or Project Boilerplate

🔧 Maintainability

🏷️ Type Checking and Data Validation

  • Static type-checking with Mypy

✅ 🧪 Testing/Coverage

🚨 Linting

🔍 Code quality
🎨 Code formatting

👷 CI/CD

Automatic Dependency updates
Dependency Review in PR
  • Dependency Review with dependency-review-action, This action scans your pull requests for dependency changes, and will raise an error if any vulnerabilities or invalid licenses are being introduced.
Pre-commit automatic updates
  • Automatic updates with GitHub Actions workflow .github/workflows/pre-commit_autoupdate.yml

🔒 Security

🔏 Static Application Security Testing (SAST)

⌨️ Accessibility

🔨 Automation tool (Makefile)

Makefile to automate the setup of your environment, the installation of dependencies, the execution of tests, etc. in terminal type make to see the available commands

Target                Description
-------------------   ----------------------------------------------------
check                 Run code quality tools with pre-commit hooks.
docs_test             Test if documentation can be built without warnings or errors
docs_view             Build and serve the documentation
init_env              Install dependencies with uv and activate env
init_git              Initialize git repository
install_data_libs     Install pandas, scikit-learn, Jupyter, seaborn
pre-commit_update     Update pre-commit hooks
test                  Test the code with pytest and coverage

📝 Project Documentation

🗃️ Templates

Good practices


References


About

A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •