Skip to content

4. Building your course

Brian Clapper edited this page Oct 28, 2019 · 1 revision

You can build your course using bdc directly, or you can use the higher-level course workflow tool.

Table of Contents

Should I use course or bdc?

If you're new to building Databricks curriculum, start by using course. It's a higher-level wrapper around the lower-level bdc. You'll probably find it easier.

Once you're comfortable with building Databricks curriculum, you're free to drop down to bdc, if you want. Or, you can keep using course.

Some people prefer course. Some prefer bdc. Some prefer a mixture of both. Ultimately, the choice is yours.

Building with course

To build with course, you first have to select a course to work on. e.g.:

$ course workon Delta

Building locally

Then, you can build it locally with

$ course build-local

The build output will be written to $HOME/tmp/curriculum/<course-id>. For instance, suppose you have this course_info block in build.yaml:

course_info:
  name: Dummy
  title: "Dummy course"
  version: 1.1.0
  type: self-paced

Running course build-local will write the output to $HOME/tmp/curriculum/Dummy-1.1.0.

Building and uploading

To build the current course and upload the output to your default Databricks workspace, you can use:

$ course build

course will build the local artifact (as with build-local), and it will upload those artifacts to your default Databricks workspace.

Using a different build file

What if your build file isn't called build.yaml? course has an option for that.

You can either:

  • Modify your configuration so that COURSE_YAML specifies the build file name.

  • (Preferred) Specify the build file name on the fly, using -f or --build-file. For instance:

$ course workon Delta -f build-ilt.yaml buildlocal

You can even do this:

$ course workon Delta -f build-ilt.yaml buildlocal -f build-sp.yaml buildlocal

That second command:

  • selects the "Delta" course
  • builds it using its build-ilt.yaml build file, then
  • builds it again, using its build-sp.yaml build file.

Uploading and downloading your source notebooks

You can upload your current course's source notebooks to your Databricks workspace using course upload. Make sure the following values are set in your configuration or environment. (Run course showconfig to see your configuration, which will pick up environment variables for values not explicitly set in the configuration.)

  • DB_PROFILE defines which workspace (or shard) to use. It refers to a section in ~/.databrickscfg, which used by the Databricks CLI. It defaults to default.

  • DB_SHARD_HOME defines your home directory on the Databricks workspace

  • SOURCE defines the subfolder in DB_SHARD_HOME where you want your source notebooks to go.

(Hint: You can update your configuration on the fly with course set. For instance, course set SOURCE=:MySources.)

Note: You must have a properly configured .databrickscfg for uploading and downloading to work. Specifically, for each profile you intend to use, you'll need a username and password or, preferably, an API token. course and bdc delegate to the databricks command for uploading and downloading.

Uploading

Once your configuration is correct, you can upload the source notebooks with one simple command:

$ course upload

course will find your build.yaml file and ask bdc to do the upload, using the list of notebooks in the build file.

If your course uses a different name for its build file, just add -f:

$ course -f build-sp.yaml upload

For example, let's assume:

  • SOURCE is the default value of :Sources
  • DB_SHARD_HOME is /Users/[email protected]
  • DB_PROFILE is set to the default, which happens to map to trainers.cloud.databricks.com.
$ course workon ETL-Part-3 upload

will upload the source notebooks for the ETL-Part-3 self-paced course to /Users/[email protected]/:Sources/ETL-Part-3 on trainers.

NOTE: upload will bail if the target folder already exists. If that happens, just run course clean-source.

Downloading

After you've edited your notebooks, you'll want to download them to build the course and check them into Git. Once again, assuming everything is configured properly, it's one command:

course download

Again, use -f if your build file isn't build.yaml.

Sample course workflow

Select the course:

$ course workon My-Course

Upload the source notebooks to your Databricks workspace:

$ course upload

Log into the Databricks workspace and work on the notebooks. When you're ready to build, download them:

$ course download

Then, build and upload the built artifacts:

$ course build

Repeat until done. Then, do a final download, and check everything into Git.

Using course to tag the Git repo

You can use course tag to tag the Git repo with a tag constructed from the current course's name and version. For example:

$ course workon ETL-Part-1 tag
Repo /Users/me/repos/training: Created tag ETL-Part-1-1.2.9 on branch
master, pointing to commit 8f75b380852feef3b1fa5810f0e255362c699d50.

The tag is applied to the top-most commit on whatever branch is currently selected in the repo. The tag is not automatically pushed to the remote Git repo(s). To do that, use git push --tags.

This command will abort if the tag already exists.

You cannot delete an existing tag via course. Use git tag -d. For example:

$ git tag -d ETL-Part-1-1.2.9

The tag command just delegates bdc --tag. You can use that command directly, if you prefer.

More details

For more information on course, run the following command:

$ course help

Building with bdc

To build a course with bdc, just run:

$ bdc -o /path/to/course/build.yaml

If you omit the path to the build file, bdc assumes build.yaml in the current directory.

By default, bdc will build the course and write the output to $HOME/tmp/curriculum/<course-id>. For instance, suppose you have this course_info block in build.yaml:

course_info:
  name: Dummy
  title: "Dummy course"
  version: 1.1.0
  type: self-paced

Running bdc -o on that build.yaml will write the output to $HOME/tmp/curriculum/Dummy-1.1.0.

Uploading the results

bdc has no feature for uploading the built artifacts. However, you can do it yourself, by invoking the databricks command directly. For example:

$ databricks workspace import --format DBC --profile DEFAULT --language Python \
  ~/tmp/curriculum/My-Course-1.2.0/azure/Lessons.dbc \
  /Users/[email protected]/:Builds/My-Course-1.2.0

The --language option is necessary, but it doesn't really do anything. See databricks workspace --help for more information.

Uploading and downloading your source notebooks

You can upload and download directly with bdc. You just have to specify a few more things.

Uploading

An example is the easiest way to start:

$ bdc --upload /Users/[email protected]/:Sources/My-Course build.yaml

That command uploads all the notebooks listed in build.yaml to /Users/[email protected] on whatever workspace is defined as the default in ~/.databrickscfg.

If DB_SHARD_HOME is set, you can use a relative path. Let's assume DB_SHARD_HOME is set to /Users/[email protected]. In that case, this command is functionally identical (but easier to type):

$ bdc --upload :Sources/My-Course build.yaml

If you want to use a different .databrickscfg profile, just specify --dprofile PROFILE (or -P PROFILE).

NOTE: --upload will bail if the target folder already exists. If that happens, you can delete the target directory with the databricks command. e.g.:

$ databricks workspace rm -r /Users/[email protected]/:Sources/My-Course

WARNING: The databricks command does not look at DB_SHARD_HOME.

Downloading

Downloading is similar:

$ bdc --download :Sources/My-Course build.yaml

bdc will download all the notebooks under $DB_SHARD_HOME/:Sources/My-Course and try to find out where to put them, by consulting build.yaml. It will warn you about notebooks it downloads that don't exist in build.yaml.

Sample bdc workflow

Upload the source notebooks to your Databricks workspace:

$ bdc --upload :Sources/My-Course /path/to/your/build.yaml

Log into the Databricks workspace and work on the notebooks. When you're ready to build, download them:

$ bdc --download :Sources/My-Course /path/to/your/build.yaml

Build the course:

$ bdc -o /path/to/your/build.yaml

Upload the artifacts. You can manually import the built DBCs using the Databricks UI, or you can import them via the databricks command. Here's an example of the latter:

$ databricks workspace import --format DBC --language R \
  ~/tmp/curriculum/My-Course-1.2.0/amazon/Lessons.dbc \
  /Users/[email protected]/:Builds/My-Course-1.2.0

Repeat until done. Then, do a final download, and check everything into Git.

For a complete overview of bdc usage, see the Detailed bdc Usage page.

Using bdc to tag the Git repo

You can use bdc --tag to tag the Git repo with a tag constructed from the a course's name and version. For example:

$ bdc --tag /Users/me/repos/training/courses/Self-Paced/ETL-Part-1/build.yaml
Repo /Users/me/repos/training: Created tag ETL-Part-1-1.2.9 on branch
master, pointing to commit 8f75b380852feef3b1fa5810f0e255362c699d50.

The tag is applied to the top-most commit on whatever branch is currently selected in the repo. The tag is not automatically pushed to the remote Git repo(s). To do that, use git push --tags.

This command will abort if the tag already exists.

You cannot delete an existing tag via bdc. Use git tag -d. For example:

$ git tag -d ETL-Part-1-1.2.9

A minimal complete example, using course

Note that this minimal example assumes that your training repo is located at ~/repos/training.

  1. You must set DB_SHARD_HOME in ~/.bash_profile, ~/.bashrc, or (if you're using the Z-shell) ~/.zshrc). For example:

    export DB_SHARD_HOME="/Users/[email protected]"
    
  2. You must configure at least [DEFAULT] in ~/.databrickscfg. The databricks command supports username and password authentication, as well as API token authentication.

    However, the build tools only support API token authentication.

    [DEFAULT]
    host = https://dbc-728c1937-2bf0.cloud.databricks.com/
    token = dapi9b1bd21f3cb79f26c5103d28d667967e
    
  3. Create the directory ~/repos/training/courses/minimal.

  4. Below is a minimal build.yaml file. Create the file ~/repos/training/courses/minimal/build.yaml, and fill it with the contents, below.

  5. To work on the "minimal" example

    $ course workon minimal
    
  6. To build the "minimal" example

    $ course build
    

    After this completes, you should have the directory ~/tmp/curriculum/minimal-example-1.0.0 containing

    • CHANGELOG.html
    • CHANGELOG.md
    • CHANGELOG.pdf
    • StudentFiles/Labs.dbc

    You should also have folder Target/minimal in your Databricks workspace, containing the built course.

  7. To upload the "minimal" example

    $ course upload
    

    After this completes, you should have folder _Source/minimal in your Databricks workspace, containing the file Intro-To-DataFrames-Part-1

Minimal build.yaml file

course_info:
 name: minimal-example
 version: 1.0.0
 type: ILT

bdc_min_version: "1.24"
master_parse_min_version: "1.18"
top_dbc_folder_name: $course_id
src_base: ../../modules

notebook_defaults:
 dest: $target_lang/$notebook_type/$basename.$target_extension
 master:
   enabled: true
   scala: false
   python: true
   answers: false
   instructor: false
   enable_templates: false

misc_files:
 - src: CHANGELOG.md
   dest: ""

notebooks:
 - src: DB-105/Intro-To-DataFrames-Part-1.scala
   dest: $target_lang/$notebook_type/$basename.$target_extension