Skip to content

1. Installation

Brian Clapper edited this page Oct 28, 2019 · 1 revision

Table of Contents

The build tools are written in Python 3. They will not work with Python 2.

The only supported way to install the tools is via Docker.

Note: If you don't already have Docker installed, see Installing Docker.

Installing official build tools releases

To install or update the build tools, just run:

$ curl -L https://git.io/fhaLg | bash

NOTE: If you have the latest set of shell aliases installed and active, you can just type:

$ update-tools  # or, update_tools

That command is equivalent to:

$ curl https://raw.githubusercontent.com/databricks-edu/build-tooling/master/docker/install.sh | bash

This command:

  • Pulls down the prebuilt Docker image (databrickseducation/build-tool:latest) from Docker Hub.
  • Updates your local Docker image, if necessary.
  • Pulls down the build tool aliases and installs them in $HOME/.build-tools-aliases.sh

All you have to do is ensure that Docker is installed (see below) and that you have this command in your .bashrc or .zshrc:

. ~/.build-tools-aliases.sh

That aliases file defines command aliases for bdc, gendbc, master_parse, databricks, and course; those aliases invoke the corresponding commands with the Docker image.

Installing snapshot releases

From time to time, we push preliminary versions of the build tools to the snapshot branch. You can install and use a snapshot version by following this procedure.

  • Run update-tools snapshot (or curl -L https://git.io/fhaLg | bash -s snapshot)

  • Switch to the snapshot version with dbe snapshot. (See below.)

  • Ensure that you're using the latest version of the build tool aliases. They respect that environment variable.

This procedure installs the snapshot release into a separate Docker image (databrickseducation/build-tool:snapshot). It will not conflict with the installation of the release version of the build tools; that version is always installed in databrickseducation/build-tool:latest.

To switch back to using the release version, simply run

$ dbe latest

Switching between snapshot and release versions

The aliases.sh file installs a command called dbe for switching between the snapshot and release versions.

$ dbe latest   # use release version
$ dbe snapshot # use snapshot version
$ dbe          # display what version you're using

This command is equivalent to setting the BUILD_TOOL_DOCKER_TAG environment variable to snapshot or latest.

Configuring databricks

You'll also want to configure the databricks command, if you haven't already done so. You don't have to install it. There's a version already installed in the Docker image, and the shell aliases define a databricks alias that invokes the Docker version.

But you do have to configure it, to you can use course or bdc to upload and download your notebooks. You'll need a configuration section for each Databricks workspace you'll be using for notebook development.

For each such workspace, you'll have to set up authentication. The databricks command supports username and password authentication, as well as API token authentication.

However, the build tools only support API token authentication.

Here's a sample ~/.databrickscfg file:

[DEFAULT]
host = https://trainers.cloud.databricks.com/
token = dapi9b1bd21f3cb79f26c5103d28d667967e
[azure]
host = https://eastus2.azuredatabricks.net
token = dapi24426c433e561579a55c7a80f0f1c9c1

Note the DEFAULT is special.

  1. If you don't specify a profile, when using the tools (or the databricks command), they all assume DEFAULT.
  2. In addition, if you do specify a profile (e.g., azure), any missing fields in that section of .databrickscfg come from DEFAULT.

For instance, consider this example:

[DEFAULT]
host = https://trainers.cloud.databricks.com/
token = dapi9b1bd21f3cb79f26c5103d28d667967e
[azure]
host = https://eastus2.azuredatabricks.net

If you tried to invoke, say, databricks workspace ls --profile azure /, the databricks command would use the host value from the [azure] section and the token value from the DEFAULT section (because token is missing from the [azure] section). This is probably not what you want.

Additional configuration for the build tools

You'll also want to set your Databricks home directory. Both bdc and course need to know your home directory in Databricks, for various operations. You can set this value in several ways.

  • With course, you can set it in the course configuration, by setting DB_SHARD_HOME. When using course, the course configuration overrides all other ways of setting your home.

  • You can set the DB_SHARD_HOME environment variable. This value takes precedence over the other methods, below. For example:

# My home directory on all Databricks instances is /Users/[email protected]
DB_SHARD_HOME=/Users/[email protected]
  • Set home in your .databrickscfg file. Note that only the build tools honor this value. The databricks command ignores it. Here are a couple examples:
# My home directory is different on the default workspace than on Azure.
[DEFAULT]
host = https://trainers.cloud.databricks.com/
token = dapi9b1bd21f3cb79f26c5103d28d667967e
home = /Users/[email protected]
[azure]
host = https://eastus2.azuredatabricks.net
token = dapi24426c433e561579a55c7a80f0f1c9c1
home = /Users/[email protected]

If you have the same directory on all Databricks workspaces, you can just set it in DEFAULT:

[DEFAULT]
host = https://trainers.cloud.databricks.com/
token = dapi9b1bd21f3cb79f26c5103d28d667967e
home = /Users/[email protected]
[azure1]
host = https://eastus2.azuredatabricks.net
token = dapi24426c433e561579a55c7a80f0f1c9c1
[azure2]
host = https://westus2.azuredatabricks.net
token = dapi704d362303f3235cfcc505d6655eea6
  • Set username in your .databrickscfg. If the tools can't find any of the home values using the methods, above, it'll look for a username value in the profile configuration, and it'll calculate your home directory using that value.
[DEFAULT]
host = https://trainers.cloud.databricks.com/
token = dapi9b1bd21f3cb79f26c5103d28d667967e
username = [email protected]

In this case, assuming DB_SHARD_HOME isn't set in the course configuration and the environment, the build tools will assume your home directory is /Users/[email protected].

Cleaning up "dangling" Docker images

Over time, as you update your Docker image, you might find you're accumulating a bunch of dangling (stale) Docker images. If you run docker images, you may see a bunch with labels like <none>. Some of these might be stale, and stale images can consume disk space.

Consider running the following command periodically to clean things up:

$ docker rmi $(docker images -f "dangling=true" -q)
Clone this wiki locally