-
Notifications
You must be signed in to change notification settings - Fork 7
4. Building your course
You can build your course using bdc directly, or you can use the higher-level course workflow tool.
- Should I use
course
orbdc
? -
Building with
course
-
Building with
bdc
-
A minimal complete example, using
course
If you're new to building Databricks curriculum, start by using
course
. It's a higher-level wrapper around
the lower-level bdc
. You'll probably find it easier.
Once you're comfortable with building Databricks curriculum, you're free to
drop down to bdc
, if you want. Or, you can keep using
course
.
Some people prefer course
. Some prefer bdc
. Some prefer a mixture of
both. Ultimately, the choice is yours.
To build with course
, you first have to select a course to work on. e.g.:
$ course workon Delta
Then, you can build it locally with
$ course build-local
The build output will be written to $HOME/tmp/curriculum/<course-id>
.
For instance, suppose you have this course_info
block in build.yaml
:
course_info:
name: Dummy
title: "Dummy course"
version: 1.1.0
type: self-paced
Running course build-local
will write the output to
$HOME/tmp/curriculum/Dummy-1.1.0
.
To build the current course and upload the output to your default Databricks workspace, you can use:
$ course build
course
will build the local artifact (as with build-local
), and it will
upload those artifacts to your default Databricks workspace.
What if your build file isn't called build.yaml
? course
has an option
for that.
You can either:
-
Modify your configuration so that
COURSE_YAML
specifies the build file name. -
(Preferred) Specify the build file name on the fly, using
-f
or--build-file
. For instance:
$ course workon Delta -f build-ilt.yaml buildlocal
You can even do this:
$ course workon Delta -f build-ilt.yaml buildlocal -f build-sp.yaml buildlocal
That second command:
- selects the "Delta" course
- builds it using its
build-ilt.yaml
build file, then - builds it again, using its
build-sp.yaml
build file.
You can upload your current course's source notebooks to your Databricks
workspace using course upload
. Make sure the following values are set in
your configuration or environment. (Run course showconfig
to see your
configuration, which will pick up environment variables for values not
explicitly set in the configuration.)
-
DB_PROFILE
defines which workspace (or shard) to use. It refers to a section in~/.databrickscfg
, which used by the Databricks CLI. It defaults todefault
. -
DB_SHARD_HOME
defines your home directory on the Databricks workspace -
SOURCE
defines the subfolder inDB_SHARD_HOME
where you want your source notebooks to go.
(Hint: You can update your configuration on the fly with course set
.
For instance, course set SOURCE=:MySources
.)
Note: You must have a properly configured .databrickscfg
for uploading
and downloading to work. Specifically, for each profile you intend to use,
you'll need a username and password or, preferably, an API token. course
and bdc
delegate to the databricks
command for uploading and downloading.
Once your configuration is correct, you can upload the source notebooks with one simple command:
$ course upload
course
will find your build.yaml
file and ask bdc
to do the upload,
using the list of notebooks in the build file.
If your course uses a different name for its build file, just add -f
:
$ course -f build-sp.yaml upload
For example, let's assume:
-
SOURCE
is the default value of:Sources
-
DB_SHARD_HOME
is/Users/[email protected]
-
DB_PROFILE
is set to the default, which happens to map totrainers.cloud.databricks.com
.
$ course workon ETL-Part-3 upload
will upload the source notebooks for the ETL-Part-3 self-paced course to
/Users/[email protected]/:Sources/ETL-Part-3
on trainers
.
NOTE: upload
will bail if the target folder already exists. If that
happens, just run course clean-source
.
After you've edited your notebooks, you'll want to download them to build the course and check them into Git. Once again, assuming everything is configured properly, it's one command:
course download
Again, use -f
if your build file isn't build.yaml
.
Select the course:
$ course workon My-Course
Upload the source notebooks to your Databricks workspace:
$ course upload
Log into the Databricks workspace and work on the notebooks. When you're ready to build, download them:
$ course download
Then, build and upload the built artifacts:
$ course build
Repeat until done. Then, do a final download, and check everything into Git.
You can use course tag
to tag the Git repo with a tag constructed from the
current course's name and version. For example:
$ course workon ETL-Part-1 tag
Repo /Users/me/repos/training: Created tag ETL-Part-1-1.2.9 on branch
master, pointing to commit 8f75b380852feef3b1fa5810f0e255362c699d50.
The tag is applied to the top-most commit on whatever branch is currently
selected in the repo. The tag is not automatically pushed to the remote
Git repo(s). To do that, use git push --tags
.
This command will abort if the tag already exists.
You cannot delete an existing tag via course
. Use git tag -d
. For
example:
$ git tag -d ETL-Part-1-1.2.9
The tag
command just delegates bdc --tag
. You can
use that command directly, if you prefer.
For more information on course
, run the following command:
$ course help
To build a course with bdc
, just run:
$ bdc -o /path/to/course/build.yaml
If you omit the path to the build file, bdc
assumes build.yaml
in the
current directory.
By default, bdc
will build the course and write the output to
$HOME/tmp/curriculum/<course-id>
. For instance, suppose you have this
course_info
block in build.yaml
:
course_info:
name: Dummy
title: "Dummy course"
version: 1.1.0
type: self-paced
Running bdc -o
on that build.yaml
will write the output to
$HOME/tmp/curriculum/Dummy-1.1.0
.
bdc
has no feature for uploading the built artifacts. However, you can
do it yourself, by invoking the databricks
command directly. For example:
$ databricks workspace import --format DBC --profile DEFAULT --language Python \
~/tmp/curriculum/My-Course-1.2.0/azure/Lessons.dbc \
/Users/[email protected]/:Builds/My-Course-1.2.0
The --language
option is necessary, but it doesn't really do anything. See
databricks workspace --help
for more information.
You can upload and download directly with bdc
. You just have to specify a
few more things.
An example is the easiest way to start:
$ bdc --upload /Users/[email protected]/:Sources/My-Course build.yaml
That command uploads all the notebooks listed in build.yaml
to
/Users/[email protected]
on whatever workspace is defined as the
default in ~/.databrickscfg
.
If DB_SHARD_HOME
is set, you can use a relative path. Let's assume
DB_SHARD_HOME
is set to /Users/[email protected]
. In that case,
this command is functionally identical (but easier to type):
$ bdc --upload :Sources/My-Course build.yaml
If you want to use a different .databrickscfg
profile, just specify
--dprofile PROFILE
(or -P PROFILE
).
NOTE: --upload
will bail if the target folder already exists. If that
happens, you can delete the target directory with the databricks
command.
e.g.:
$ databricks workspace rm -r /Users/[email protected]/:Sources/My-Course
WARNING: The databricks
command does not look at DB_SHARD_HOME
.
Downloading is similar:
$ bdc --download :Sources/My-Course build.yaml
bdc
will download all the notebooks under $DB_SHARD_HOME/:Sources/My-Course
and try to find out where to put them, by consulting build.yaml
. It will
warn you about notebooks it downloads that don't exist in build.yaml
.
Upload the source notebooks to your Databricks workspace:
$ bdc --upload :Sources/My-Course /path/to/your/build.yaml
Log into the Databricks workspace and work on the notebooks. When you're ready to build, download them:
$ bdc --download :Sources/My-Course /path/to/your/build.yaml
Build the course:
$ bdc -o /path/to/your/build.yaml
Upload the artifacts. You can manually import the built DBCs using the
Databricks UI, or you can import them via the databricks
command. Here's
an example of the latter:
$ databricks workspace import --format DBC --language R \
~/tmp/curriculum/My-Course-1.2.0/amazon/Lessons.dbc \
/Users/[email protected]/:Builds/My-Course-1.2.0
Repeat until done. Then, do a final download, and check everything into Git.
For a complete overview of bdc
usage, see the
Detailed bdc Usage page.
You can use bdc --tag
to tag the Git repo with a tag constructed from the
a course's name and version. For example:
$ bdc --tag /Users/me/repos/training/courses/Self-Paced/ETL-Part-1/build.yaml
Repo /Users/me/repos/training: Created tag ETL-Part-1-1.2.9 on branch
master, pointing to commit 8f75b380852feef3b1fa5810f0e255362c699d50.
The tag is applied to the top-most commit on whatever branch is currently
selected in the repo. The tag is not automatically pushed to the remote
Git repo(s). To do that, use git push --tags
.
This command will abort if the tag already exists.
You cannot delete an existing tag via bdc
. Use git tag -d
. For
example:
$ git tag -d ETL-Part-1-1.2.9
Note that this minimal example assumes that your training repo is located at
~/repos/training
.
-
You must set
DB_SHARD_HOME
in~/.bash_profile
,~/.bashrc
, or (if you're using the Z-shell)~/.zshrc
). For example:export DB_SHARD_HOME="/Users/[email protected]"
-
You must configure at least
[DEFAULT]
in~/.databrickscfg
. Thedatabricks
command supportsusername
andpassword
authentication, as well as API token authentication.However, the build tools only support API token authentication.
[DEFAULT] host = https://dbc-728c1937-2bf0.cloud.databricks.com/ token = dapi9b1bd21f3cb79f26c5103d28d667967e
-
Create the directory
~/repos/training/courses/minimal
. -
Below is a minimal
build.yaml
file. Create the file~/repos/training/courses/minimal/build.yaml
, and fill it with the contents, below. -
To work on the "minimal" example
$ course workon minimal
-
To build the "minimal" example
$ course build
After this completes, you should have the directory
~/tmp/curriculum/minimal-example-1.0.0
containing- CHANGELOG.html
- CHANGELOG.md
- CHANGELOG.pdf
- StudentFiles/Labs.dbc
You should also have folder
Target/minimal
in your Databricks workspace, containing the built course. -
To upload the "minimal" example
$ course upload
After this completes, you should have folder
_Source/minimal
in your Databricks workspace, containing the fileIntro-To-DataFrames-Part-1
course_info:
name: minimal-example
version: 1.0.0
type: ILT
bdc_min_version: "1.24"
master_parse_min_version: "1.18"
top_dbc_folder_name: $course_id
src_base: ../../modules
notebook_defaults:
dest: $target_lang/$notebook_type/$basename.$target_extension
master:
enabled: true
scala: false
python: true
answers: false
instructor: false
enable_templates: false
misc_files:
- src: CHANGELOG.md
dest: ""
notebooks:
- src: DB-105/Intro-To-DataFrames-Part-1.scala
dest: $target_lang/$notebook_type/$basename.$target_extension
NOTICE
- This software is copyright © 2017-2021 Databricks, Inc., and is released under the Apache License, version 2.0. See LICENSE.txt in the main repository for details.
- Databricks cannot support this software for you. We use it internally, and we have released it as open source, for use by those who are interested in building similar kinds of Databricks notebook-based curriculum. But this software does not constitute an official Databricks product, and it is subject to change without notice.