-
Notifications
You must be signed in to change notification settings - Fork 23
Getting Started with EasyData Environments
Let's say that you want to use EasyData to manage your environments reproducibly. Here's a tutorial outlining how you can get up and running managing your environments this way. Here's what we'll cover.
-
Install requirements
- Installing Anaconda or Miniconda
- Creating an easydata conda environment
-
Create your repo (also called a project) based on an EasyData project template
- Create a project using an EasyData cookiecutter template
- Initialize the project as a git repo
-
Create and explore the default environment
- Create and explore the default project conda environment
- Explore the default paths
-
Customize your conda environment
- Updating the environment
- Checking your updates back into the project repo
- Deleting and recreating the environment
-
Customize your local settings (things you shouldn't check in to a repo)
- Customize your local paths configuration
- Customize your environment variables
- Customizing your local config to include credentials
This is a setup step that you only need to do once. After this is done, you shouldn't need to do this step again. Occasionally it may be necessary to update your requirements.
- Install anaconda: if you don't already have anaconda or miniconda, you'll need to install it following the instructions for your platform (MacOS/Windows/Linux)
- Open a terminal window
- Configure Anaconda/Miniconda
- Set your channel priority to strict:
conda config --set channel_priority strict - On a JupyterHub instance you will need to store your conda environment in your home directory so that the environments will persist across JupyterHub sessions:
conda config --prepend envs_dirs ~/.conda/envs # Store environments in local dir for JupyterHub
- Set your channel priority to strict:
- Install the remaining requirements:
conda create -n easydata python=3 cookiecutter make
conda activate easydata
pip install ruamel.yamlWe've created a conda environment that we'll use to create EasyData projects easydata to house the other requirements. Once this environment exists as created above, we won't need to create it again.
The best time to use an EasyData template is when you first create your project/repo. We will assume that you are starting your project from scratch.
Note: We recommend using EasyData to create every project you work with so that there is at least a 1:1 ratio between your conda environments and your projects. There are many issues with having more than one repo using the same conda environment, so whatever you do please don't use a monolithic environment to rule them all. For more on this idea see Tip #2 of Kjell's talk on building reproducible workflows.
- Open a terminal window
- Activate the
easydataenvironment created above:conda activate easydata - Navigate to the location that you'd like your project to located (without creating the project directory, that happens automagically in the next step). For example, if I want my project to be located in
/home/<my-repo-name>I would navigate to/homein this step. - Create your project. Run
cookiecutter https://github.com/hackalog/easydataand fill in the prompts. Note that the repo name that you enter will be the name of the directory that your project lives in.
We've now created a project filled with the EasyData template files in <my-repo-name>.
We'd like to use git to keep track of changes that are made to our project. Now is the best time to initialize the git repo.
- Navigate into the project:
cd <my-repo-name>as entered into the prompts of the previous step - Initialize the repo:
git init
git add .
git commit -m "initial import"
git branch easydata # tag for future easydata upgrades- Tag this branch for future EasyData updates
The EasyData template project <my-repo-name> is populated with recommended default settings. Defaults that we've developed over time that work as a nice base for most folks who are creating a data science related project. It's a great place to start, and until you're familiar with how and why these defaults work in most cases, we don't recommend messing with them.
First off, the EasyData template comes with everything that you need to create a conda environment with the same name as your project base directory. That is, a conda environment with the name <my-repo-name>. To create this environment:
- Navigate into the project repo: e.g.
cd <my-repo-name> - Check that the conda binary is specified correctly: check that the output of
which condamatches the line ofCONDA_EXEinMakefile.include. If not, update theMakefile.includefile accordingly. - Create the environment:
make create_environment - Activate the environment:
conda activate <my-repo-name>
The conda environment is now ready. To see what's in it, open up the environment.yml file. In it is a list of the packages that are installed in the environment. There is a section near the top under - pip: that lists pip-based dependencies and the rest of the dependencies in the default environment.yml are conda-based dependencies. These packages are now all installed in the virtual conda environment <my-repo-name>.
There is also a file of the form environment.<your-architecture>.lock.yml. If you open this up, you will see all of the packages that are installed in your conda environment <my-repo-name>. This is the list of all the dependencies that were installed to make the packages in environment.yml run.
We like to think of these two files are representing two different perspectives:
-
environment.yml: The packages that you want -
environment.<your-architecture>.lock.yml: The packages that you need (to run what you want)
As hardcoded paths are a notorious source of reproducibility issues, EasyData attempts to help avoid path-related issues by introducing a mechanism to handle paths. These paths are part of the local configuration that point to the location of directories in this project through code. By default, these depend on the out-of-the-box project organization that is described at the bottom of the README file in the project.
The goal of the paths mechanism is to help ensure that hardcoded path data is never checked-in to the git repository.
The default paths are all relative to the catalog_path, the exact location of the local config file catalog/config.ini (don't move this file!)
[Paths]
cache_path = ${data_path}/interim/cache
data_path = ${project_path}/data
figures_path = ${output_path}/figures
interim_data_path = ${data_path}/interim
notebook_path = ${project_path}/notebooks
output_path = ${project_path}/reports
processed_data_path = ${data_path}/processed
project_path = ${catalog_path}/..
raw_data_path = ${data_path}/raw
template_path = ${project_path}/reference/templates
abfs_cache = ${interim_data_path}/abfs_cache
Note that, for chicken-and-egg reasons, catalog_path (the location of the config.ini file used to specify the paths) is not specified in this file. It is set upon module instantiation (when {{ cookiecutter.module_name }} is imported) and is write-protected:
You can access any of the default paths in python code once paths has been imported via the name of the path. For example, to access the data_path,
from {{ cookiecutter.module_name }} import paths
paths['data_path']where {{ cookiecutter.module_name }} is the module name specified when answering the project prompts. If you've forgotten the module name, look it up in .easydata.yml. Typically {{ cookiecutter.module_name }} is set to src.
Exercise: Check that the paths mechanism resolves to the correct paths as laid out in the default project.
Notice that paths are automatically resolved to absolute filenames (in [pathlib] format) when accessed.
Recall that one of the Easydata design goals is to ensure that hardcoded paths should not be checked into your git repository. To this end, paths should never be set from within notebooks or source code that is checked-in to git. If you wish to modify a path on your local system, edit config.ini directly, or use python from the command line, as shown show below:
>>> python -c "import {{ cookiecutter.module_name }}; {{ cookiecutter.module_name }}.paths['project_path'] = /alternate/bigdata/path"When accessed from Python, you'll immediately see the paths have all changed:
>>> for name, location in paths.items():
>>> print(f"{name}: {location}")
data_path: /alternate/bigdata/path/{{ cookiecutter.repo_name }}/data
raw_data_path: /alternate/bigdata/path/{{ cookiecutter.repo_name }}/data/raw
interim_data_path: /alternate/bigdata/path/{{ cookiecutter.repo_name }}/data/interim
processed_data_path: /alternate/bigdata/path/{{ cookiecutter.repo_name }}/data/processed
project_path: /alternate/bigdata/path/{{ cookiecutter.repo_name }}as has config.ini:
>>> cat catalog/config.ini
[Paths]
data_path = ${project_path}/data
raw_data_path = ${data_path}/raw
interim_data_path = ${data_path}/interim
processed_data_path = ${data_path}/processed
project_path:/alternate/bigdata/pathIf you ever need to see the raw (non-resolved) versions of the paths from within Python, use paths.data:
>>> for name, location in paths.data.items():
>>> print(f"{name}: {location}")
data_path:${project_path}/data
raw_data_path:${data_path}/raw
interim_data_path:${data_path}/interim
processed_data_path:${data_path}/processed
project_path:/alternate/bigdata/path>>> from {{ cookiecutter.module_name }} import paths
>>> help(paths)One of the features we love about originally basing Easydata off of cookiecutter-datascience is the thoughtful project organization structure and description that it includes as part of the template README file. We love that. And while we've made our own tweaks, we haven't strayed far from the original.
YOUR NEXT TASK for this stage of the quest: explore the "Project Organization" section of the (mostly) default Easydata README.md for this repo. If you look closely, you'll find the next stop on your Quest for Reproducibility.