Skip to content

Data Science 2.0 #56

@FRosner

Description

@FRosner

Overview

We would like to communicate the idea of test driven data processing and data science. Maybe we can start with one or two meetups to get some feedback and then move to a conference.

Idea

Typical data science workflows start with a pipeline of data transformations, typically for preprocessing and feature engineering. Often it is followed by steps for model training, selection and application.

In the software development lifecycle, continuous integration and test driven development already has a lot of attention. It improves code quality and allows new developers to quickly get started with the code and make changes to it, without the fear of breaking existing functionality.

When working with real data in real world data science use cases, you will encounter problems when it comes to data quality from the beginning. But also the user defined transformations may introduce problems or errors. Why don't we apply the idea of continuous integration and test driven development also partly to the data science workflow?

When you then want to apply an existing transformation to a new version of your data source, the automated checks will tell you whether your following steps (like feature engineering) are still valid. It is the basic principle of failing fast and avoid bugs that are discovered at a later stage (e.g. at a model that performs badly).

Key Points

  • Data sources, code and tests should be versioned
  • Data loading and transformation should be reproducible and covered by automated tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions