- Best practices for source control on Terra, Part II: Work Environment
- Introduction and Scope
- Objectives
- Work environment: Terra workspace and GitHub
- Setting up all parts of your work environment
- Create your SSH keys on your laptop or desktop computer
- Set up your SSH key on your cloud environment’s persistent disk
- Test authentication
- Git setup
- Setup of notebook diff and cell output removal tools
- Set up a global "gitignore" file
- Clone a first GitHub repository
- Restore after cloud environment and persistent disk deletion
- What’s next?
- References
The previous document in this series, Best practices for source control on Terra, Part I: Background, provided some background on Terra and source control concepts.
This document describes best practices for source code control in Terra Workspaces for artifacts like notebooks, Python and R packages, or workflows. The goal of this solution is to enable you to manage, share and collaborate on artifacts effectively using the source code control system GitHub. In the following we use the term “source control” for brevity.
The initial focus is on source controlling notebooks and not on other artifacts like workflows. Those are discussed separately at a later point in time. Source controlling notebooks is an important use case and will have the biggest benefit for Terra users (including All of Us Workbench users).
All of Us Workbench: The All of Us workbench differs from the general Terra.bio system in a few areas. These differences are called out throughout this document so that it applies to the All of Us workbench as well.
The best practices do not discuss the management of workspace tables, reference data, samples in buckets, tables in BigQuery, or any other data - the discussion is focused on code only.
Reading this document and executing the commands provide you with the required knowledge and toolset to:
- Learn to use source control for notebooks
- Learn to prepare your Terra cloud environment and reinitialize it after its restart or its recreation
- Learn about the Terra workspace deployment architecture and understand the various storage systems involved
- Learn about several source control user journeys and the use cases that enable them
This section guides you through the initial setup of the Terra workspace cloud environment to use GitHub as the source control system. In addition, it discusses how to recover the cloud environment after the deletion of your persistent disk.
Several systems are involved in setting up your environment. These include:
- Laptop/desktop. This system is used to create and manage SSH keys. The reason for this is that it is possible to delete your Terra cloud environment completely: if you manage keys in the cloud environment – which is possible – you would have to recreate the SSH keys when the cloud environment is deleted, and you would have to update your GitHub account settings with the new keys. It is a lot easier and requires less effort to manage the SSH keys on your laptop or desktop and recover those from it as necessary.
- Persistent disk. The persistent disk holds all the artifacts that are source controlled (e.g., notebooks, workspace description, workspace metadata, workflows and packages). It is a best practice to have one subdirectory for each GitHub repository that you are interacting with on the persistent disk. Effectively this means that the artifacts of each workspace you work on is in a separate subdirectory. When the cloud environment is deleted and recreated, the persistent disk is unchanged and your data is preserved. It also contains required configuration files that are referred to by the boot disk using commands to read those.
- Boot disk. While the persistent disk holds all your artifacts, the boot disk needs configuration information related to GitHub in order to execute any Git command properly. The best practice is that all required data is stored on the persistent disk, and from there made available to the appropriate directory location on the boot disk. If the cloud environment is recreated, the configuration information must only be made available to the boot disk again by you without the need for recreating it, as it is stored on the persistent disk.
- GitHub. GitHub holds the artifacts that you want to put under source control as well as the public SSH key so that secure and authenticated communication is possible between the cloud environment and GitHub. You configure GitHub with the SSH keys that you generate on your laptop.
In the following you create all directories, keys, and configurations required to source control your artifacts.
Note: the following assumes that you have a GitHub account set up. If not, follow these instructions to create an account: Signing up for a new GitHub account. It might be that your organization has one set up for you already, or that it has specific requirements for how to set up a GitHub account.
In order to securely communicate with GitHub the best strategy is to use SSH, as discussed in this document: Github "About SSH". Aside from the secure communication, it makes the interaction more efficient as when using SSH you do not have to constantly type in your user name and password to authenticate yourself to GitHub.
SSH is not the only approach to authentication in the context of GitHub: About authentication to GitHub; however, the alternatives are not discussed here.
On your local machine, follow the process described in this GitHub doc: Connecting to GitHub with SSH.
You may want to check for existing SSH keys. If you want to create a new key, follow these instructions: Generating a new SSH key and adding it to the ssh-agent.
At this point in your setup, you have a private and a public key. You transfer the private key to the persistent disk and add it to the ssh-agent in the cloud environment. You add the public key to your GitHub account. The instructions follow next.
The following sections assume that your private key file is named key-for-terra.
Edit the commands below before running them if your filename is different.
The next step is to copy your private ssh key from your laptop or desktop to the persistent disk.
-
Create a
.sshdirectory in your $HOME directory, and set its permissions so that only you can view it. In the terminal window, run:cd; mkdir -p ~/.ssh;chmod 700 ~/.ssh
-
Upload your private key to your $HOME directory via the Jupyter UI. Open an existing notebook in the tab NOTEBOOKS (preferably in PLAYGROUND MODE). If you do not have one already, create one as at least one is required for accessing the notebooks in the local repository. Click on the Jupyter logo to open the file browser.
Click on the “folder” icon to navigate to
/home/jupyter(this is the default directory when you first open this view).
From the file browser, click “Upload” to upload your private key.Tip: On MacOS, in the ‘File Open’ dialog, do
cmd-shift-. to display dot files in the dialog window. This is useful to navigate to a .ssh directory on your laptop or desktop machine. -
In the terminal window, run the following command to move your uploaded key file from $HOME to the .ssh directory, and make it readable only to you. Edit the following first if your key file has a different name:
mv key-for-terra ~/.ssh; chmod 600 ~/.ssh/key-for-terra
At this point the private key is available in the cloud environment. The next step is to add it to the ssh-agent.
This simplifies interactions with GitHub. Add the private ssh key to the ssh-agent as follows.
-
If you have not already done so, start the ssh-agent:
eval $(ssh-agent -s)
This returns a
pid, for example,Agent pid 80 -
Add the private key to the ssh-agent. (Edit this command before running it if your key has a different name).
ssh-add ~/.ssh/key-for-terra -
The command prompts you for the passphrase for the private key.
-
Enter the passphrase that you used when you created the ssh key:
-
Check that the key was added to the agent by executing:
ssh-add -l
At this point the ssh-agent is set up.
If you restart your cloud environment, you will need to restart your ssh-agent and re-add your key. Open a terminal window and execute the following commands again:
eval $(ssh-agent -s)
ssh-add ~/.ssh/key-for-terraIf you haven’t done so already, add your public ssh key to your GitHub account as described in: Adding a new SSH key to your GitHub account.
At this point, the setup of the ssh key is completed and what is left is for you to test that the authentication works.
This process GitHub: Testing your SSH connection shows how to test that the ssh connection is working between the cloud environment and GitHub.
Note: make sure that you run the test from the terminal of your cloud environment, not from a terminal on your laptop or desktop, as all GitHub interaction will take place from your cloud environment.
git is already installed in the default cloud environment. If you are using a custom container, you may need to install it first.
-
Set your name and email for git to use. Feel free to change the content of the user.name to your real first and last name
git config --global user.email "$OWNER_EMAIL" git config --global user.name "$OWNER_EMAIL"
-
Test that the configuration worked. This configuration is stored in
$HOME/.gitconfig.git config --global --list
Once git is available and the configuration established, you can continue with the next set of setups.
Note: It is possible to configure git command completion (using the tab key on the keyboard) with the following command (if this is available on your machine):
source /usr/share/bash-completion/completions/gitA discussion for different environments as well as alternative setups are here: How to configure git bash command line completion?.
When working with notebooks you will find two tools very helpful for your daily work:
- Jupyter diff tool. A diff tool that is aware of it being used for Jupyter can display notebook diffs in a more structured way compared to tools that solely show you a diff based on the textual representation of a notebook.
- Cell output removal tool. A tool that removes cell outputs automatically when you commit changes to GitHub helps ensure that sensitive data is not made available in GitHub.
In the following you will install nbdime(notebook diffing and merging) as the diff tool, and nbstripout as the cell output removal tool.
-
Open a workspace with a running cloud environment. If you do not have a running cloud environment, start it
-
Open a terminal window
-
Execute the following command that installs both tools into the cloud environment
pip3 install --upgrade nbdime nbstripout
If you receive a suggestion similar to the following, upgrade pip right away and run the previous command again:
WARNING: You are using pip version 21.1.3; however, version 21.2.4 is available. You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.Then, make sure these tools are on your
PATH. You may need to run:export PATH=$PATH:/home/jupyter/.local/bin
-
To check the successful installation, run the following
nbdime --version nbstripout --version
-
The command might produce significant installation output and might contain the following error message (depending on your environment and the precise version you install) that you can safely ignore:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts tensorflow 2.4.2 requires grpcio~=1.32.0, but you have grpcio 1.38.0 which is incompatible. tensorflow 2.4.2 requires six~=1.15.0, but you have six 1.16.0 which is incompatible. tensorflow 2.4.2 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible. jupyterlab-git 0.11.0 requires nbdime<2.0.0,>=1.1.0, but you have nbdime 3.1.0 which is incompatible. -
The following command turns on
nbdimefor all GitHub repositories:nbdime config-git --enable --global
You might see an ignorable error similar to the following:
/opt/conda/lib/python3.7/site-packages/jupyter_server_mathjax/app.py:40: FutureWarning: The alias `_()` will be deprecated. Use `_i18n()` instead.
help=_("""The MathJax.js configuration file that is to be used."""),- You can set up
nbstripoutas a git filter in your global~/.gitconfiglike this:If you later want to remove the filter, run:nbstripout --install --global
nbstripout --uninstall --global.
At this point the two tools are installed and enabled for all GitHub repositories.
Note: when you open a notebook in edit mode you will see a UI control labeled nbdiff.
This control is not related to the nbdime tool that you just installed, and is not currently functional. In the following this is ignored and not further discussed.
For your repository, use a .gitignore file similar to the example gitignore file in this repo, to ensure that CSVs and image files are not accidentally commited to the repository.
If you are using Git from the terminal of your AoU workbench machine, a global gitignore file is
preinstalled and enabled. You can view file /home/jupyter/gitignore_global to confirm this.
On Terra, you will need to create and configure the gitignore file. You can do that as follows:
-
Upload the gitignore example file to your workspace, or open an editor in the Cloud Environment Terminal and create a new file with its contents.
-
Rename the file to
~/.gitignore_globalin your home directory, e.g:mv gitignore ~/.gitignore_globalTip: to list "dot files" like
.gitignore_global, use the-aflag:ls -a. -
Then, configure Git to use this file for all repositories:
git config --global core.excludesfile ~/.gitignore_global
After you’ve finished setup and configuration, try cloning a GitHub repo.
The next set of instructions will guide you through one example of how to use Git and GitHub. With the following commands you:
- Clone an existing repository containing several notebooks
- Open one notebook and modify it
- Use the nbdime to observe a diff after modifying a notebook
- Observe that cell outputs are stripped out by nbstripout
This example gives you a first impression on the process as well as tools involved in the process. We’ll use this repository for the example: https://github.com/verilylifesciences/site-selection-tool/.
The first step is to clone the repository. This section assumes that you have set up the environment as described in the previous sections of this document.
-
Open a workspace with a running cloud environment. If you do not have a running cloud environment, start it.
-
Open a terminal window
-
Execute the following command. You can run the command from your
$HOMEdirectory, or if you like, you can create a subdirectory under$HOMEfor GitHub repos and change to that directory first.git clone git@github.com:verilylifesciences/site-selection-tool.git
-
If you see the following error, you need to set up the
ssh-agentagain.git@github.com: Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.The commands for setting up the ssh-agent are:
eval $(ssh-agent -s) ssh-add ~/.ssh/key-for-terra
You will have to enter your passphrase at the prompt of the last command.
-
Observe via
lsthat a new directory is created calledsite-selection-tool.Navigate into the directory:cd site-selection-tool ls -al
Observe a directory called .git: this indicates that this is the local version of a remote repository
-
Execute the following Git command:
git remote -v
This shows you the location of the remote repository in GitHub:
Now you have cloned a remote repository into your cloud environment. The directory is the local repository and is linked to the remote repository. You may want to explore the directory more, especially the notebooks subdirectory.
A public Terra workspace that contains all site-selection-tool notebooks is here: https://app.terra.bio/#workspaces/verily-metis/Site-selection-tool-for-vaccine-trial-planning.
The notebooks in the local repository are not created from the NOTEBOOKS tab, and so the Terra workspace does not know about those: they are not listed. In order to access a notebook in the local repository, follow these commands in the terminal
-
Open an existing notebook in the tab NOTEBOOKS (preferably in PLAYGROUND MODE). If you do not have one already, create one as at least one is required for accessing the notebooks in the local repository.
-
Click on the Jupyter logo to open the file browser.
-
Navigate to your clone of the repository and click in to the ‘notebooks’ subdirectory:
-
Pick a notebook and open it by clicking on its file name.
-
Add a new cell, enter an addition into that cell, and run it. For example:
At this point, you have modified a notebook in the local repository. This change is not visible in the remote repository. Any change you make in the local repository remains local until you explicitly propagate that change. (This will be further explored in Best Practices for source control on Terra, Part III: Source control for notebooks.)
Since you made a change, you might want to know what that change was. The following commands first let you know which parts of the local repository changed.
-
Navigate to the local repository:
cd site-selection-tool -
Execute the git command:
git status
This command compares the current state of any file in the repository and determines if that was changed.
The tool nbdiff is used to show the difference between the current state and the previous version of a notebook:
-
Execute:
nbdiff ./site-selection-tool/notebooks/<the-notebook-you-changed>
This command shows all changes that you made. It shows your additional cell, as well as the removal of all outputs (via
nbstripout). It also shows differences of the cell execution counter if those changed as well.
It is recommended that you use nbdiff when reviewing notebook differences, as it understands the Jupyter format. The Git command, git diff, shows the difference of the file format, not the Jupyter format. Execute the same diff using git diff and you see the difference:
If you like, you can remove your clone of the GitHub repository by deleting its directory– this will not affect the remote repository. (If you want to keep this local repository, that is fine as well, since it contains several tutorial notebooks that might be good for you to explore.)
In the terminal, change to the parent directory of your repository clone. You should see site-selection-tool listed when you run ls.
The next command removes the site-selection-tool directory and all its contents. Be careful to not accidentally delete directories that you do not mean to delete. The safe approach would be to navigate into the directory and delete from one level down.
rm -fr site-selection-toolAll of Us Workbench. This section is relevant to you since your Cloud Environment and its persistent disk are periodically deleted by the system.
If you deleted the cloud environment as well as your persistent disk, then you have to recreate both. There are basically two different situations
- No persistent disk backup. If you do not have a backup of your persistent disk, then you have to start from the very beginning of the setup: Preparing your work environment. If you already have code in a remote repository, you can restore from it: Restoring local repository from GitHub.
- Backup for persistent disk exists. In this case you can restore the persistent disk, and after you restore it, you only have to add the configuration to the boot disk: Recreation after cloud environment deletion with existing persistent disk.
Note that in case of restoring the persistent disk based on a backup, the backup does not contain the changes you made since the backup was created. This might include additional Python package installations, latest synchronization with GitHub repositories or any other change. The best practice is to verify that you have the latest changes by individually checking for the latest, like files, packages, or repositories.
At this point it is a good opportunity for you to delete the cloud environment including the persistent disk and to recreate it from the very beginning. This assures you of the process and gives you confidence that a lost environment can be recreated.
Restoring a local repository after your persistent disk was deleted is similar to joining a collaboration with others in the sense that you do not start with an empty repository but join a repository that was created before (even if only by yourself).
This means that restoring a local repository is equivalent to Journey 4: Joining an ongoing collaboration with others.
At this point, you are ready to dive into various user journeys and use cases. See: Best practices for source control on Terra, Part III: Source control for notebooks.
If you’re not familiar with git, the GitHub documentation and “cheat sheets” like this one or this one may be useful.










