Data Science 2.0

### Overview

We would like to communicate the idea of test driven data processing and data science. Maybe we can start with one or two meetups to get some feedback and then move to a conference.
### Idea

Typical data science workflows start with a pipeline of data transformations, typically for preprocessing and feature engineering. Often it is followed by steps for model training, selection and application.

In the software development lifecycle, continuous integration and test driven development already has a lot of attention. It improves code quality and allows new developers to quickly get started with the code and make changes to it, without the fear of breaking existing functionality.

When working with real data in real world data science use cases, you will encounter problems when it comes to data quality from the beginning. But also the user defined transformations may introduce problems or errors. Why don't we apply the idea of continuous integration and test driven development also partly to the data science workflow?

When you then want to apply an existing transformation to a new version of your data source, the automated checks will tell you whether your following steps (like feature engineering) are still valid. It is the basic principle of failing fast and avoid bugs that are discovered at a later stage (e.g. at a model that performs badly).
### Key Points
- Data sources, code and tests should be versioned
- Data loading and transformation should be reproducible and covered by automated tests


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Science 2.0 #56

Overview

Idea

Key Points

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data Science 2.0 #56

Description

Overview

Idea

Key Points

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions