EasyData Talks and Tutorials

Talks

Makefiles: One Great Trick for Making Your Conda Environments More Managable | PyData Global 2021

Summary

Conda environments can be fantastic for managing your data science dependencies. They can also be fragile, conflict-riddled, disk-filling monsters. Wouldn't it be great if we could easily maintain, delete, and reproduce these environments on a project-by-project basis? We can, and all it takes is a little Makefile magic.

Description

At our shop, we had a problem: our conda environments were a mess. Most of us kept one or two monolithic environments per python version around (conda activate data_science_37 anyone?), but quickly, these environments became fragile and unmaintainable. Upgrading packages was near-impossible because of version conflicts with other installed packages. Switching machines was a nightmare, as we were never really sure which packages were required for a particular application. We couldn’t easily fix environments, and we couldn’t delete them. We didn't know how to recreate them, and so we had no easy way to share them. We were stuck.

In desperation, we started scripting our conda environment creation. Since we were already using make for our data pipelines, we started stashing the creation code there, forcing ourselves to creating a unique conda environment for each git repo, and checking it in with the rest of the codebase.

Over time, we tweaked these Makefile targets to work around some long-standing limitations of our conda setups. We added lockfiles, and self-documenting targets. We found reliable ways to mix pip and conda (in the odd cases where it was needed), and started making heavy use of editable python modules in our workflow. It worked out better than we ever imagined. Our work became reproducible, portable, and better documented.

In this talk, I walk you through the challenges of creating a reproducible, maintainable data science environment using little more than conda, environment.yml, Makefiles, and git, in hopes that you too will be able to make your conda environments more managable.

Repo

https://github.com/hackalog/make_better_defaults

Tutorials

Love Your (Data Scientist) Neighbour - Amy Wooding | PyData Global 2021

Summary

Tired of wasting your time and energy re-doing work that you’ve done before? Want to reduce the hidden costs that come with collaboration? In this hands-on tutorial, we’ll uncover the overlooked parts of making your data science workflow reproducible. You’ll learn about gotchas, reproducibility bugs, and better defaults along the way.

Repo

https://github.com/acwooding/easydata-tutorial

Up your Bus Number - A Reproducible Data Science Workflow - Kjell Wooding, Amy Wooding | PyData NYC 2018

Summary

How fragile is your data science pipeline? Can you recover from a laptop crash? Can you reproduce last year's analysis? (Can your co-workers?) This tutorial will take you through the process of making your data science work reproducible: from using the right tools, to creating reproducible workflows, to patterns for testing, automating, and sharing your results.

Repo

Warning: Based off a very old version of EasyData, so the implementation is significantly out of date https://github.com/hackalog/bus_number

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EasyData Talks and Tutorials

Talks

Makefiles: One Great Trick for Making Your Conda Environments More Managable | PyData Global 2021

Summary

Description

Repo

Tutorials

Love Your (Data Scientist) Neighbour - Amy Wooding | PyData Global 2021

Summary

Repo

Up your Bus Number - A Reproducible Data Science Workflow - Kjell Wooding, Amy Wooding | PyData NYC 2018

Summary

Repo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally