-
Notifications
You must be signed in to change notification settings - Fork 23
EasyData Talks and Tutorials
Conda environments can be fantastic for managing your data science dependencies. They can also be fragile, conflict-riddled, disk-filling monsters. Wouldn't it be great if we could easily maintain, delete, and reproduce these environments on a project-by-project basis? We can, and all it takes is a little Makefile magic.
At our shop, we had a problem: our conda environments were a mess. Most of us kept one or two monolithic environments per python version around (conda activate data_science_37 anyone?), but quickly, these environments became fragile and unmaintainable. Upgrading packages was near-impossible because of version conflicts with other installed packages. Switching machines was a nightmare, as we were never really sure which packages were required for a particular application. We couldn’t easily fix environments, and we couldn’t delete them. We didn't know how to recreate them, and so we had no easy way to share them. We were stuck.
In desperation, we started scripting our conda environment creation. Since we were already using make for our data pipelines, we started stashing the creation code there, forcing ourselves to creating a unique conda environment for each git repo, and checking it in with the rest of the codebase.
Over time, we tweaked these Makefile targets to work around some long-standing limitations of our conda setups. We added lockfiles, and self-documenting targets. We found reliable ways to mix pip and conda (in the odd cases where it was needed), and started making heavy use of editable python modules in our workflow. It worked out better than we ever imagined. Our work became reproducible, portable, and better documented.
In this talk, I walk you through the challenges of creating a reproducible, maintainable data science environment using little more than conda, environment.yml, Makefiles, and git, in hopes that you too will be able to make your conda environments more managable.
https://github.com/hackalog/make_better_defaults
Tired of wasting your time and energy re-doing work that you’ve done before? Want to reduce the hidden costs that come with collaboration? In this hands-on tutorial, we’ll uncover the overlooked parts of making your data science workflow reproducible. You’ll learn about gotchas, reproducibility bugs, and better defaults along the way.
https://github.com/acwooding/easydata-tutorial
Up your Bus Number - A Reproducible Data Science Workflow - Kjell Wooding, Amy Wooding | PyData NYC 2018
How fragile is your data science pipeline? Can you recover from a laptop crash? Can you reproduce last year's analysis? (Can your co-workers?) This tutorial will take you through the process of making your data science work reproducible: from using the right tools, to creating reproducible workflows, to patterns for testing, automating, and sharing your results.
Warning: Based off a very old version of EasyData, so the implementation is significantly out of date https://github.com/hackalog/bus_number