-
Notifications
You must be signed in to change notification settings - Fork 7
3. Notebook Markup
The master parser component parses Databricks source notebooks and, based on specialized markup embedded within the notebooks, produces (possibly multiple) output notebooks.
The master parser is automatically invoked by bdc, whenever a notebook
in the build file has a master
section.
- Notebook Processing
The tool looks for various labels, as well as language-specific tokens, within notebook cells.
By default, the master parse tool processes %scala
, %python
, %r
and
%sql
cells specially. How it handles those cells is best described by
example.
Suppose you've run the tool on a Python notebook (i.e., a file ending in .py
),
and the notebook also contains some %scala
and %r
cells.
- If you've specified
--scala
(on the command line) or passedscala=True
(tomaster_parse.process_notebooks()
), then the tool will create Scala notebooks that contain all non-code cells (Markdown cells,%fs
and%sh
cells, etc.) in the original, as well as any%scala
cells. All other language code cells will be stripped from the Scala notebooks. - If you've specified
--rproject
(on the command line) or passedr=True
(tomaster_parse.process_notebooks()
), then the tool will create R notebooks that contain all non-code cells (Markdown cells,%fs
and%sh
cells, etc.) in the original, as well as any%r
cells. All other language code cells will be stripped from the R notebooks. - If you've specified
--python
(on the command line) or passedpython=True
(tomaster_parse.process_notebooks()
), then the tool will create Python notebooks that contain all non-code cells (Markdown cells,%fs
and%sh
cells, etc.) in the original, as well as any Python cells. Since the file ends in.py
, Python cells are assumed to be cells with explicit%python
magic or any non-decorated (i.e., normal) code cells.
You can modify this behavior somewhat, using the labels below.
Master parse labels are cells that are marked with special tokens that only the master parse tool recognizes. Some labels make sense only in code cells. Others can be used in code cells, Markdown cells, etc.
All labels must be preceded by a comment sequence. For instance:
# TODO
// TODO
-- TODO
Labels must appear on a line by themselves. Thus, use:
%md
// SCALA_ONLY
not
%md // SCALA_ONLY
Cells not marked with any label are handled specially, depending on the cell type:
-
%md
,%md-sandbox
: Markdown cells appear in all output notebooks, unless suppressed, for example, withSCALA_ONLY
,PYTHON_ONLY
,INSTRUCTOR_NOTE
, etc. -
%fs
and%sh
cells appear in all output notebooks, unless explicitly suppressed. -
Code cells only appear in the output notebook for their language, unless marked with
ALL_NOTEBOOKS
. Thus, a Scala cell only shows up in Scala notebooks, unless marked withALL_NOTEBOOKS
.
In a Scala code cell:
// ANSWER
// Scala answer goes here
In a markdown cell in a Python notebook:
%md
-- SCALA_ONLY
-- SQL_ONLY
This Markdown cell is in a Python notebook, but it only appears in Scala or SQL
notebooks generated by the master parse tool.
The valid labels are:
This cell type is deprecated and will be removed in a future release of this tool. Use of it will generate warnings.
Cells which need to be in IPython (or Jupyter) notebooks only. If IPython notebooks aren't being generated, these cells are stripped out.
This cell type is deprecated and will be removed in a future release of this tool. Use of it will generate warnings.
Cells which need to be in Databricks notebooks only.
Cells marked with this show up only when generating notebooks for lang. These
are for special cells (like Markdown cells, %fs
cells, %sh
cells) that you
want to include on a language-dependent basis. For example, if a Markdown cell
is different for Scala vs. Python, then you can create two %md
cells, with
one marked PYTHON_ONLY and the other marked SCALA_ONLY.
Cells marked with AMAZON_ONLY
only show up when building for target profile
amazon
. Cells marked with AZURE_ONLY
only show up when building for
target profile azure
.
See the -tp
command line option (in
Miscellaneous options, above) or the
bdc
setting only_in_profile
(in the bdc
Notebooks section).
Cells show up only in exercise notebooks. These cells are usually exercises the user needs to complete.
As a special case, if the entire TODO cell is comment out, the master parser will strip the first level of comments. This allows for runnable TODO cells in source notebooks. Thus, the following three TODO cells are functionally equivalent in the output notebooks:
Not runnable in source notebook:
# TODO
x = FILL_THIS_IN
Runnable in source notebook:
# TODO
x = FILL_THIS_IN
# TODO
x = FILL_THIS_IN
All three cells will render as follows in the Python answers output notebook:
# TODO
x = FILL_THIS_IN
NOTES:
-
When you create a runnable TODO cell, you can use at most one blank character after the leading comment. (The blank is optional.) The master parser will remove the leading comment and, optionally, one subsequent blank from every commented line except for the line with the "TODO" marker.
-
Do not precede
TODO
with multiple comment characters, even in a runnableTODO
cell.. It won't work. That is, use// TODO
or# TODO
, not// // TODO
or# # TODO
. The latter won't be recognized as a proper TODO cell.
Cells show up only in instructor and answer notebooks.
These cells identify tests and usually follow an exercise cell. Test cells
provide a means for a student to test the solution to an exercise. You can
include an annotation after the word TEST
. For example:
# TEST - Please run this cell to test your solution.
If you don't supply an annotation, the tool will add one. So, this line:
// TEST
will be emitted, in the generated notebooks, as:
// TEST - Run this cell to test your solution.
Cells show up in instructor/answer notebooks.
Valid only in Markdown cells, this command is replaced with HTML for a
large video button. When clicked, the button launches a new tab to the
specified URL. The command takes the form VIDEO url [title]
. url
is the link to the video. The title (optional) is the video's title which,
if present, will appear in the button. If no title is supplied, the button
will not contain a title.
INSTRUCTOR_ONLY
and INSTRUCTOR_NOTES
are both aliases for this tag.
Valid only in Markdown cells, this command causes the cell to be copied into the instructor notebook (if instructor notebooks are being generated) and omitted from the exercises and answers notebooks. An "Instructor Note" header will automatically be added to the cell.
In addition, if
consolidated instructor notes
are enabled for the notebook, cells marked with -- INSTRUCTOR_NOTE
are consolidated and copied into a single Markdown document associated with
the notebook.
Valid in any cell, this tag marks a cell as a source-only cell. Source-only cells are never copied to output notebooks. Source-only cells are useful for many things, such as cells with credentials that are only to be used during curriculum development.
An ILT_ONLY
cell is only copied to output notebooks if the course type
is "ilt". See the -ct
(--content-type
) command line parameter.
An SELF_PACED_ONLY
cell is only copied to output notebooks if the course type
is "self-paced". See the -ct
(--content-type
) command line parameter.
The cell should be copied into all generated notebooks, regardless of language. Consider the following code in a Scala notebook:
%python
ALL_NOTEBOOKS
x = 10
If you run the master parse tool to create Scala and Python notebooks, with instructor and student notebooks, that cell will appear in the generated Scala notebooks (instructor and answers) as well in the generated Python notebooks (instructor and answers).
This cell type is deprecated and will be removed in a future release of this tool. Use of it will generate warnings.
Can be used for multilanguage notebooks to force another language to be inserted. The behavior is a little counterintuitive. Here's an example.
You're processing a notebook called foo.scala
, so the base language is Scala.
The notebook has these cells somewhere inside:
%python
# INLINE
x = 10
// INLINE
val y = 100
The first cell is a Python cell that would normally be suppressed in the output Scala output; it would either be written to the output Python notebook or suppressed entirely (if Python output was disabled).
However, because of the // INLINE
, the cell is written to the output Scala
notebook, instead, and suppressed in the output Python notebook.
Meanwhile, the opposite happens with the second cell. Because the second cell
is Scala, but is marked as // INLINE
, it is only written to non-Scala
output notebooks.
Start a new part of the lab. A lab can be divided into multiple parts with each
part starting with a cell labeled NEW_PART
. Every time the tool encounters a NEW_PART
label, it creates a new notebook that starts with a cell that runs the previous
part notebook (via %run
), which enables students who are lagging behind to
catch up.
The master parser also supports special inline tokens in Markdown cells. These tokens are replaced with images and, sometimes, markup. The four currently supported tokens are:
-
:HINT:
A hint for the student. -
:CAUTION:
A caution or warning -
:BESTPRACTICE:
Indicates a best practice -
:SIDENOTE:
Something of note that’s not necessarily 100% pertinent to the rest of the cell.
Here's an example cell containing each token:
%md
We're talking about life here, people. This is some important stuff. Pay attention.
:HINT: Don't worry too much.
:CAUTION: Stress'll kill ya, man.
:BESTPRACTICE: Eat right, and get plenty of rest.
:SIDENOTE: No one gets out alive.
Currently, these tokens render as follows, in a %md-sandbox
cell:
The master parser supports treating Markdown cells (%md
and %md-sandbox
cells) as templates. This feature is disabled by default, but it can be
enabled:
- on a per-notebook basis in
build.yaml
, by setting theenable_templates
field in themaster
section; - via the
--templates
command line option, if you're calling the master parser from the command line; or - via a parameter setting to the API, if you're calling the master parser programmatically.
When templates are enabled, Markdown cells are treated as Mustache templates. Its use, in notebook cells, allows you to:
- do conditional substitution. For instance, insert this sentence if building for Azure, but use this other sentence if building for Amazon.
- do token substitution. For instance, substitute the current value of this parameter here.
See below for a brief introduction to Mustache syntax.
The master parser defines the following variables automatically:
-
amazon
: Set to "Amazon" (which also evaluates astrue
in a template), if building for Amazon. Otherwise, set to an empty string (which also evaluates asfalse
in a template). -
azure
: Set to "Azure" (which also evaluates astrue
in a template), if building for Azure. Otherwise, set to an empty string (which also evaluates asfalse
in a template). -
copyright_year
: The value of the copyright year parameter. -
notebook_language
: The programming language of the notebook being generated (e.g., "Scala", "Python", "R", "SQL".) -
scala
:true
if the output notebook is Scala,false
otherwise. -
python
:true
if the output notebook is Python,false
otherwise. -
r
:true
if the output notebook is R,false
otherwise. -
sql
:true
if the output notebook is SQL,false
otherwise. -
self_paced
:true
if the build is a self-paced build;false
if it is an ILT build. -
ilt
:true
if the build is an ILT build;false
if it is a self-paced build.
In addition, you can substitute any variables defined in the bdc
build file's
variables
section.
If calling the master parser from the command line, there's a --variable
parameter that allows you to pass additional variables.
The Mustache templating also provides some other convenient expansions, each of which is described here.
The parser supports a special nested block, in Markdown cells only, for
revealable hints. The {{#HINTS}}
construct introduces a block of hints (and
is closed by {{/HINTS}}
); such a construct contains one or more revealable
hints and an optional answer.
This construct is best described by example. Consider the following Markdown cell:
%md
This is a pithy description of an exercise you are to perform, below.
{{#HINTS}}
{{#HINT}}Revealable hint 1.{{/HINT}}
{{#HINT}}
Revealable hint 2. Note that the source for this one
is multiple lines _and_ contains some **Markdown** to be
rendered.
{{/HINT}}
{{#ANSWER}}
Still no luck? Here's your answer:
```
df = spark.read.option("inferSchema", "true").option("header", "true").csv("dbfs:/tmp/foo.csv")
df.limit(10).show()
```
{{/ANSWER}}
{{/HINTS}}
When run through the master parser, the above will render a cell that initially looks like this:
After the first button click, the cell will look like this:
After the second button click, the cell will look like this:
After the final button click, the cell will look like this:
More formally:
A hints block:
-
must contain at least one hint block. A hint is Markdown or HTML in between a starting
{{#HINT}}
and an ending{{/HINT}}
. -
may contain multiple
{{#HINT}}
blocks. -
may contain an
{{#ANSWER}}
block.
{{#HINTS}}
, {{#HINT}}
and {{#ANSWER}}
blocks may contain leading and
trailing blank lines, to aid source readability; those lines are stripped on
output.
Mustache is a very simple template language. For full details, see the Mustache manual page. For our purposes, two most useful constructs are conditional content and variable substitution.
Here's an example of conditional content:
{{#amazon}}
Rendered if amazon is defined.
{{/amazon}}
If the variable "amazon" has a non-empty value (or is true
), then the
string "Rendered if amazon is defined" is included in the cell. Otherwise,
the entire construct is omitted.
This is Mustache's form of an if statement. There is no else statement.
There's a kind of if not, however: Simply replace the #
with a ^
.
{{^amazon}}
Rendered if amazon is not defined.
{{/amazon}}
This construct also works inline:
Mount your {{#amazon}}S3 bucket{{/amazon}}{{#azure}}blob store{{/azure}}
to DBFS.
Variable substitution is quite simple: Just enclose the variable's name in
{{
and }}
. For example:
This is a {{notebook_language}} notebook.
If notebook_language
is set to "Scala", that line will render as:
This is a Scala notebook.
For a more complete example, consider this Markdown cell:
%md
In this {{notebook_language}} notebook,
you can access your data by mounting your
{{#amazon}}
S3 bucket
{{/amazon}}
{{#azure}}
Azure blob store
{{/azure}}
to DBFS.
When generated with an Amazon profile, in a Scala output notebook, this cell would become:
%md
In this Scala notebook,
you can access your data by mounting your
S3 bucket
to DBFS.
NOTICE
- This software is copyright © 2017-2021 Databricks, Inc., and is released under the Apache License, version 2.0. See LICENSE.txt in the main repository for details.
- Databricks cannot support this software for you. We use it internally, and we have released it as open source, for use by those who are interested in building similar kinds of Databricks notebook-based curriculum. But this software does not constitute an official Databricks product, and it is subject to change without notice.