Skip to content

Commit 83ebc38

Browse files
authored
doc: Add a KEP for Jupyter notebook components (#12238)
Signed-off-by: mprahl <[email protected]>
1 parent 42ee9dc commit 83ebc38

File tree

1 file changed

+298
-0
lines changed

1 file changed

+298
-0
lines changed
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# KEP-12238: Jupyter Notebook Components
2+
3+
<!-- toc -->
4+
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Baseline Feature For Embedded Assets](#baseline-feature-for-embedded-assets)
11+
- [SDK User Experience](#sdk-user-experience)
12+
- [Example](#example)
13+
- [notebook_component Decorator Arguments](#notebook_component-decorator-arguments)
14+
- [Behavior Notes](#behavior-notes)
15+
- [User Stories](#user-stories)
16+
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
17+
- [Risks and Mitigations](#risks-and-mitigations)
18+
- [Design Details](#design-details)
19+
- [Security Considerations](#security-considerations)
20+
- [Test Plan](#test-plan)
21+
- [Unit Tests](#unit-tests)
22+
- [Integration tests](#integration-tests)
23+
- [Graduation Criteria](#graduation-criteria)
24+
- [Implementation History](#implementation-history)
25+
- [Drawbacks](#drawbacks)
26+
- [Alternatives](#alternatives)
27+
<!-- /toc -->
28+
29+
## Summary
30+
31+
This proposal introduces a first-class `@dsl.notebook_component` decorator that lets users build Kubeflow Pipelines
32+
(KFP) components directly from Jupyter notebooks. The decorator embeds a `.ipynb` file into the component and executes
33+
it at runtime via `nbclient`, with parameters injected as a prepended cell. This provides a simple path for
34+
notebook-centric workflows to run in KFP without requiring separate packaging or bespoke wrappers.
35+
36+
## Motivation
37+
38+
Many users begin experimentation and development in Jupyter notebooks. Turning those notebooks into pipeline components
39+
currently requires boilerplate: exporting to Python, writing a wrapper function, or managing custom container images. A
40+
native notebook component:
41+
42+
- Reduces friction to productionize notebook code in pipelines
43+
- Preserves the notebook as the source of truth while allowing parameterization
44+
- Avoids extra build steps by embedding notebook content into the component
45+
46+
### Goals
47+
48+
1. Enable defining a component from a `.ipynb` notebook with a single decorator.
49+
2. Support parameter injection into the notebook at execution time.
50+
3. Use the existing Python executor to keep compatibility with existing inputs/outputs concepts.
51+
52+
### Non-Goals
53+
54+
1. Notebook validation or linting beyond JSON and structural checks.
55+
2. Adding a new pipeline IR or backend executor. This reuses the Python executor.
56+
57+
## Proposal
58+
59+
### Baseline Feature For Embedded Assets
60+
61+
This KEP establishes a baseline, generic capability for Python-function components to embed arbitrary files or
62+
directories directly into a lightweight component. The goal is to support cases where a component needs a small amount
63+
of read-only assets (configs, scripts, models, notebooks, etc.) without requiring a custom image.
64+
65+
- Add a new decorator argument to `@dsl.component`:
66+
- `embedded_artifact_path: Optional[str]` — path to a file or directory on the authoring machine to embed within the
67+
component.
68+
- Add a new SDK type: `dsl.EmbeddedInput[T]` — a runtime-only input annotation that resolves to an artifact instance of
69+
type `T` (e.g. `dsl.EmbeddedInput[dsl.Dataset]`) whose `path` points to the extracted embedded artifact root.
70+
- If a directory is embedded, `.path` points to the extracted directory.
71+
- If a single file is embedded, `.path` points to that file.
72+
73+
Example:
74+
75+
```python
76+
@dsl.component(
77+
embedded_artifact_path="assets/config_dir",
78+
)
79+
def my_component(cfg: dsl.EmbeddedInput[dsl.Dataset], param: int):
80+
# cfg.path is a directory when a directory is embedded; a file path if a file is embedded
81+
print(cfg.path)
82+
```
83+
84+
Execution model (lightweight components):
85+
86+
- At compile time, the file/dir is archived (tar + gzip) and base64-embedded into the ephemeral module.
87+
- At runtime, the embedded artifact is made available by extracting to a temporary location and the `EmbeddedInput[...]`
88+
parameter is injected as an artifact with `.path` pointing to the extracted file/dir in a temporary directory. The
89+
extracted root is also added to `sys.path` for Python module resolution if it's a directory.
90+
- `sys.path` precedence: the embedded path/zip is prepended to `sys.path` (before existing entries) to ensure
91+
deterministic use of embedded modules when names overlap.
92+
93+
Relationship to `@dsl.notebook_component`:
94+
95+
- The notebook component leverages the same embedding pattern, specializing the runtime helper to execute `.ipynb`
96+
content via `nbclient`. If `notebook_path` is a directory, the `.ipynb` file in the directory is executed to allow for
97+
utility functions to be embedded. Pipeline compilation fails if there is more than one `.ipynb` file.
98+
99+
### SDK User Experience
100+
101+
#### Example
102+
103+
```python
104+
from kfp import dsl
105+
106+
107+
@dsl.notebook_component(
108+
notebook_path="train.ipynb",
109+
packages_to_install=["pandas", "scikit-learn", "nbclient"],
110+
)
111+
def train_from_notebook(dataset_uri: str, model: dsl.Output[dsl.Model]):
112+
dsl.run_notebook(
113+
dataset_uri=dataset_uri,
114+
output_model_path=model.path,
115+
)
116+
117+
118+
@dsl.pipeline(name="nb-pipeline")
119+
120+
def pipeline(dataset_uri: str = "s3://bucket/dataset.csv"):
121+
train_from_notebook(dataset_uri=dataset_uri)
122+
```
123+
124+
#### Complex Example
125+
126+
A mixed pipeline with a Python preprocessor, a notebook training step, and a notebook evaluation step:
127+
128+
```python
129+
from kfp import dsl
130+
131+
132+
@dsl.component(
133+
base_image="python:3.11-slim",
134+
packages_to_install=["pandas==2.2.2"],
135+
)
136+
def preprocess(text: str, cleaned_text: dsl.Output[dsl.Dataset]):
137+
"""Cleans whitespace from input text and writes to cleaned_text."""
138+
import re
139+
140+
cleaned = re.sub(r"\s+", " ", text).strip()
141+
with open(cleaned_text.path, "w", encoding="utf-8") as f:
142+
f.write(cleaned)
143+
144+
145+
@dsl.notebook_component(
146+
notebook_path="dev-files/nb_train.ipynb",
147+
base_image="registry.access.redhat.com/ubi9/python-311:latest",
148+
)
149+
def train_model(
150+
cleaned_text: dsl.Input[dsl.Dataset],
151+
model: dsl.Output[dsl.Model],
152+
):
153+
"""Trains a model from cleaned text and writes model."""
154+
import shutil
155+
156+
# Read the dataset for the notebook since it expects a string
157+
with open(cleaned_text.path, "r", encoding="utf-8") as f:
158+
cleaned_text_str = f.read()
159+
160+
# Execute the embedded notebook with kwargs injected as variables
161+
dsl.run_notebook(cleaned_text=cleaned_text_str)
162+
163+
# Translate notebook outputs into KFP outputs
164+
nb_model_dir = "/tmp/kfp_nb_outputs/model_dir"
165+
shutil.copytree(nb_model_dir, model.path, dirs_exist_ok=True)
166+
167+
168+
@dsl.notebook_component(
169+
notebook_path="dev-files/nb_eval.ipynb",
170+
base_image="registry.access.redhat.com/ubi9/python-311:latest",
171+
)
172+
def evaluate_model(
173+
model: dsl.Input[dsl.Model],
174+
metrics_json: dsl.Output[dsl.Metrics],
175+
):
176+
"""Evaluates a model and writes metrics JSON output."""
177+
import json
178+
# Execute the notebook with the model artifact path
179+
dsl.run_notebook(model=model.path)
180+
181+
# Copy notebook-generated metrics into output parameter
182+
with open("/tmp/kfp_nb_outputs/metrics.json", "r", encoding="utf-8") as f:
183+
metrics_dict = json.load(f)
184+
185+
for metric_name, metric_value in metrics_dict.items():
186+
if isinstance(metric_value, (int, float)):
187+
metrics_json.log_metric(metric_name, metric_value)
188+
189+
190+
@dsl.pipeline(name="three-step-nb-mix")
191+
def pipeline(text: str = "Hello world"):
192+
p = preprocess(text=text).set_caching_options(False)
193+
t = train_model(cleaned_text=p.output).set_caching_options(False)
194+
evaluate_model(model=t.output).set_caching_options(False)
195+
```
196+
197+
#### notebook_component Decorator Arguments
198+
199+
Only differences from the standard Python executor component:
200+
201+
- `notebook_path: str` – New parameter specifying the `.ipynb` file to embed and execute (required).
202+
- `packages_to_install: Optional[List[str]]` – Same as Python executor except when `None` (default here) it installs a
203+
slimmer default runtime of `nbclient>=0.10,<1`; `[]` installs nothing; a non-empty list installs exactly the provided
204+
packages.
205+
206+
All other decorator arguments and behaviors are identical to the Python executor.
207+
208+
#### Behavior Notes
209+
210+
- The notebook JSON is compressed using gzip and base64-encoded before embedding into the ephemeral Python module used
211+
by the Python executor. This reduces command line length and allows for larger notebooks.
212+
- At runtime, `dsl.run_notebook(**kwargs)` is bound to a helper that:
213+
1. Decompresses and parses the embedded notebook into memory
214+
2. Injects parameters following Papermill semantics:
215+
- If the notebook contains a code cell tagged with `parameters`, a new code cell tagged `injected-parameters` is
216+
inserted immediately after it to override defaults.
217+
- If no `parameters` cell exists, the `injected-parameters` cell is inserted at the top of the notebook.
218+
3. Executes via `nbclient.NotebookClient`
219+
4. Streams cell outputs (stdout/stderr and `text/plain` displays)
220+
221+
For the baseline bundling feature:
222+
223+
- Files/directories are archived (tar+gz) and base64-embedded similarly; by default they are extracted at import time to
224+
satisfy the `EmbeddedInput[...]` contract of providing a real filesystem `.path` and to support non-Python assets.
225+
- For Python import resolution, the embedded path (extracted root or zip archive) is prepended into `sys.path` if it's a
226+
directory.
227+
- `dsl.EmbeddedInput[T]` is not part of the component interface; it is injected at runtime and provides an artifact with
228+
`.path` set to the extracted file/dir.
229+
230+
### User Stories
231+
232+
1. As a data scientist, I can take an existing exploratory notebook and run it as a KFP component with parameters,
233+
without rewriting it into a Python script.
234+
2. As a platform user, I can standardize execution images and dependency sources while still allowing teams to embed
235+
notebooks into components.
236+
237+
### Notes/Constraints/Caveats
238+
239+
- Embedded content increases the size of the generated command; extremely large notebooks may hit container argument
240+
length limits, though gzip compression typically reduces notebook size significantly.
241+
- Notebooks must be valid JSON and include a `cells` array; otherwise creation fails with a clear error.
242+
- The SDK warns when embedded artifacts or notebooks exceed 1MB to flag potential issues. The backend has a configurable
243+
maximum pipeline spec size; if exceeded, the error recommends moving content to a container image or object store.
244+
245+
### Risks and Mitigations
246+
247+
- **Dependency drift/conflicts**: Installing packages at runtime can introduce variability.
248+
- Mitigation: Encourage providing a `base_image` with pinned deps or using `packages_to_install` with exact versions.
249+
- **Command length/performance**: Large embedded notebooks may slow compilation or exceed limits.
250+
- Mitigation: Automatic gzip compression reduces notebook size; warn on large files (>1MB original); recommend
251+
refactoring or pre-building images for very large notebooks.
252+
253+
## Design Details
254+
255+
### Security Considerations
256+
257+
This feature does not introduce additional security risks beyond those inherent to executing notebooks. It relies on the
258+
`nbclient` package within the execution environment (installed automatically unless overridden).
259+
260+
### Test Plan
261+
262+
#### Unit Tests
263+
264+
- Verify `packages_to_install` behavior for `None`, `[]`, and non-empty lists.
265+
- Ensure helper source is generated and injected correctly; binding of `dsl.run_notebook`.
266+
- Test notebook compression and decompression round-trip correctness.
267+
- Large-notebook warning logic.
268+
- KFP local executing a pipeline with embedded artifacts.
269+
- KFP local executing a pipeline with notebooks.
270+
271+
#### Integration tests
272+
273+
- Execute a pipele with two parameterized notebooks that writes to an output artifact.
274+
- One notebook should have no `parameters` cell.
275+
- The other notebook should have a `parameters` cell with some overrides.
276+
- Failure path: invalid notebook JSON; notebook cell raises execution error.
277+
- Large notebook over 1 MB.
278+
279+
### Graduation Criteria
280+
281+
N/A
282+
283+
## Implementation History
284+
285+
- Initial proposal: 2025-09-10
286+
287+
## Drawbacks
288+
289+
- Embedded notebooks can bloat the command payload and slow compilation/execution for large files, though gzip
290+
compression typically helps.
291+
- Notebooks are less modular than Python modules for code reuse and testing.
292+
293+
## Alternatives
294+
295+
1. Use `@dsl.component` with manual `nbconvert` calls inside the function. This requires boilerplate and manual
296+
packaging of the notebook.
297+
2. Pre-build a container image containing the notebook and its dependencies, then use `@dsl.container_component`. This
298+
improves reproducibility but increases operational overhead.

0 commit comments

Comments
 (0)