Skip to content

Getting Started with EasyData Environments

Amy Wooding edited this page Dec 31, 2022 · 25 revisions

Let's say that you want to use EasyData to manage your environments reproducibly. Here's a tutorial outlining how you can get up and running managing your environments this way. Here's what we'll cover.

  1. Install requirements
    • Installing Anaconda or Miniconda
    • Creating an easydata conda environment
  2. Create your repo (also called a project) based on an EasyData project template
    • Create a project using an EasyData cookiecutter template
    • Initialize the project as a git repo
  3. Create and explore the default environment
    • Create and explore the default project conda environment
    • Explore the default paths
  4. Customize your conda environment
    • Updating the environment
    • Checking your updates back into the project repo
    • Deleting and recreating the environment
  5. Customize your local settings (things you shouldn't check in to a repo)
    • Customize your local paths configuration
    • Customize your environment variables
    • Customizing your local config to include credentials

1. Install Requirements

This is a setup step that you only need to do once. After this is done, you shouldn't need to do this step again. Occasionally it may be necessary to update your requirements.

  1. Install anaconda: if you don't already have anaconda or miniconda, you'll need to install it following the instructions for your platform (MacOS/Windows/Linux)
  2. Open a terminal window
  3. Configure your anaconda setup
    • Set your channel priority to strict: conda config --set channel_priority strict
    • On a JupyterHub instance you will need to store your conda environment in your home directory so that the environments will persist across JupyterHub sessions: conda config --prepend envs_dirs ~/.conda/envs # Store environments in local dir for JupyterHub
  4. Install the remaining requirements:
conda create -n easydata python=3 cookiecutter make
conda activate easydata
pip install ruamel.yaml

We've created a conda environment that we'll use to create EasyData projects easydata to house the other requirements. Once this environment exists as created above, we won't need to create it again.

2. Create your EasyData repo

The best time to use an EasyData template is when you first create your project/repo. We will assume that you are starting your project from scratch.

Note: We recommend using EasyData to create every project you work with so that there is at least a 1:1 ratio between your conda environments and your projects. There are many issues with having more than one repo using the same conda environment, so whatever you do please don't use a monolithic environment to rule them all. For more on this idea see Tip #2 of Kjell's talk on building reproducible workflows.

Create a project using an EasyData cookiecutter template

  1. Open a terminal window
  2. Activate the easydata environment created above: conda activate easydata
  3. Navigate to the location that you'd like your project to located (without creating the project directory, that happens automagically in the next step). For example, if I want my project to be located in /home/<my-repo-name> I would navigate to /home in this step.
  4. Create your project. Run cookiecutter https://github.com/hackalog/easydata and fill in the prompts. Note that the repo name that you enter will be the name of the directory that your project lives in.

We've now created a project filled with the EasyData template files in <my-repo-name>.

Initialize the project as a git repo

We'd like to use git to keep track of changes that are made to our project. Now is the best time to initialize the git repo.

  1. Navigate into the project: cd <my-repo-name> as entered into the prompts of the previous step
  2. Initialize the repo:
git init
git add .
git commit -m "initial import"
git branch easydata   # tag for future easydata upgrades
  1. Tag this branch for future EasyData updates

3. Create and explore the default environment

The EasyData template project <my-repo-name> is populated with recommended default settings. Defaults that we've developed over time that work as a nice base for most folks who are creating a data science related project. It's a great place to start, and until you're familiar with how and why these defaults work in most cases, we don't recommend messing with them.

Create and explore the default project conda environment

First off, the EasyData template comes with everything that you need to create a conda environment with the same name as your project base directory. That is, a conda environment with the name <my-repo-name>. To create this environment:

  1. Navigate into the project repo: e.g. cd <my-repo-name>
  2. Check that the conda binary is specified correctly: check that the output of which conda matches the line of CONDA_EXE in Makefile.include. If not, update the Makefile.include file accordingly.
  3. Create the environment: make create_environment
  4. Activate the environment: conda activate <my-repo-name>

The conda environment is now ready. To see what's in it, open up the environment.yml file. In it is a list of the packages that are installed in the environment. There is a section near the top under - pip: that lists pip-based dependencies and the rest of the dependencies in the default environment.yml are conda-based dependencies. These packages are now all installed in the virtual conda environment <my-repo-name>.

There is also a file of the form environment.<your-architecture>.lock.yml. If you open this up, you will see all of the packages that are installed in your conda environment <my-repo-name>. This is the list of all the dependencies that were installed to make the packages in environment.yml run.

We like to think of these two files are representing two different perspectives:

  • environment.yml: The packages that you want
  • environment.<your-architecture>.lock.yml: The packages that you need (to run what you want)

Explore the default paths

There is another part of your compute environment that is created by default: the project paths structure.

4. Customize your conda environment

5. Customize your local settings

Clone this wiki locally