-
Notifications
You must be signed in to change notification settings - Fork 67
Description
Overview
We would like to communicate the idea of test driven data processing and data science. Maybe we can start with one or two meetups to get some feedback and then move to a conference.
Idea
Typical data science workflows start with a pipeline of data transformations, typically for preprocessing and feature engineering. Often it is followed by steps for model training, selection and application.
In the software development lifecycle, continuous integration and test driven development already has a lot of attention. It improves code quality and allows new developers to quickly get started with the code and make changes to it, without the fear of breaking existing functionality.
When working with real data in real world data science use cases, you will encounter problems when it comes to data quality from the beginning. But also the user defined transformations may introduce problems or errors. Why don't we apply the idea of continuous integration and test driven development also partly to the data science workflow?
When you then want to apply an existing transformation to a new version of your data source, the automated checks will tell you whether your following steps (like feature engineering) are still valid. It is the basic principle of failing fast and avoid bugs that are discovered at a later stage (e.g. at a model that performs badly).
Key Points
- Data sources, code and tests should be versioned
- Data loading and transformation should be reproducible and covered by automated tests