|
| 1 | +# KEP-12238: Jupyter Notebook Components |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | + |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [Baseline Feature For Embedded Assets](#baseline-feature-for-embedded-assets) |
| 11 | + - [SDK User Experience](#sdk-user-experience) |
| 12 | + - [Example](#example) |
| 13 | + - [notebook_component Decorator Arguments](#notebook_component-decorator-arguments) |
| 14 | + - [Behavior Notes](#behavior-notes) |
| 15 | + - [User Stories](#user-stories) |
| 16 | + - [Notes/Constraints/Caveats](#notesconstraintscaveats) |
| 17 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 18 | +- [Design Details](#design-details) |
| 19 | + - [Security Considerations](#security-considerations) |
| 20 | + - [Test Plan](#test-plan) |
| 21 | + - [Unit Tests](#unit-tests) |
| 22 | + - [Integration tests](#integration-tests) |
| 23 | + - [Graduation Criteria](#graduation-criteria) |
| 24 | +- [Implementation History](#implementation-history) |
| 25 | +- [Drawbacks](#drawbacks) |
| 26 | +- [Alternatives](#alternatives) |
| 27 | +<!-- /toc --> |
| 28 | + |
| 29 | +## Summary |
| 30 | + |
| 31 | +This proposal introduces a first-class `@dsl.notebook_component` decorator that lets users build Kubeflow Pipelines |
| 32 | +(KFP) components directly from Jupyter notebooks. The decorator embeds a `.ipynb` file into the component and executes |
| 33 | +it at runtime via `nbclient`, with parameters injected as a prepended cell. This provides a simple path for |
| 34 | +notebook-centric workflows to run in KFP without requiring separate packaging or bespoke wrappers. |
| 35 | + |
| 36 | +## Motivation |
| 37 | + |
| 38 | +Many users begin experimentation and development in Jupyter notebooks. Turning those notebooks into pipeline components |
| 39 | +currently requires boilerplate: exporting to Python, writing a wrapper function, or managing custom container images. A |
| 40 | +native notebook component: |
| 41 | + |
| 42 | +- Reduces friction to productionize notebook code in pipelines |
| 43 | +- Preserves the notebook as the source of truth while allowing parameterization |
| 44 | +- Avoids extra build steps by embedding notebook content into the component |
| 45 | + |
| 46 | +### Goals |
| 47 | + |
| 48 | +1. Enable defining a component from a `.ipynb` notebook with a single decorator. |
| 49 | +2. Support parameter injection into the notebook at execution time. |
| 50 | +3. Use the existing Python executor to keep compatibility with existing inputs/outputs concepts. |
| 51 | + |
| 52 | +### Non-Goals |
| 53 | + |
| 54 | +1. Notebook validation or linting beyond JSON and structural checks. |
| 55 | +2. Adding a new pipeline IR or backend executor. This reuses the Python executor. |
| 56 | + |
| 57 | +## Proposal |
| 58 | + |
| 59 | +### Baseline Feature For Embedded Assets |
| 60 | + |
| 61 | +This KEP establishes a baseline, generic capability for Python-function components to embed arbitrary files or |
| 62 | +directories directly into a lightweight component. The goal is to support cases where a component needs a small amount |
| 63 | +of read-only assets (configs, scripts, models, notebooks, etc.) without requiring a custom image. |
| 64 | + |
| 65 | +- Add a new decorator argument to `@dsl.component`: |
| 66 | + - `embedded_artifact_path: Optional[str]` — path to a file or directory on the authoring machine to embed within the |
| 67 | + component. |
| 68 | +- Add a new SDK type: `dsl.EmbeddedInput[T]` — a runtime-only input annotation that resolves to an artifact instance of |
| 69 | + type `T` (e.g. `dsl.EmbeddedInput[dsl.Dataset]`) whose `path` points to the extracted embedded artifact root. |
| 70 | + - If a directory is embedded, `.path` points to the extracted directory. |
| 71 | + - If a single file is embedded, `.path` points to that file. |
| 72 | + |
| 73 | +Example: |
| 74 | + |
| 75 | +```python |
| 76 | +@dsl.component( |
| 77 | + embedded_artifact_path="assets/config_dir", |
| 78 | +) |
| 79 | +def my_component(cfg: dsl.EmbeddedInput[dsl.Dataset], param: int): |
| 80 | + # cfg.path is a directory when a directory is embedded; a file path if a file is embedded |
| 81 | + print(cfg.path) |
| 82 | +``` |
| 83 | + |
| 84 | +Execution model (lightweight components): |
| 85 | + |
| 86 | +- At compile time, the file/dir is archived (tar + gzip) and base64-embedded into the ephemeral module. |
| 87 | +- At runtime, the embedded artifact is made available by extracting to a temporary location and the `EmbeddedInput[...]` |
| 88 | + parameter is injected as an artifact with `.path` pointing to the extracted file/dir in a temporary directory. The |
| 89 | + extracted root is also added to `sys.path` for Python module resolution if it's a directory. |
| 90 | +- `sys.path` precedence: the embedded path/zip is prepended to `sys.path` (before existing entries) to ensure |
| 91 | + deterministic use of embedded modules when names overlap. |
| 92 | + |
| 93 | +Relationship to `@dsl.notebook_component`: |
| 94 | + |
| 95 | +- The notebook component leverages the same embedding pattern, specializing the runtime helper to execute `.ipynb` |
| 96 | + content via `nbclient`. If `notebook_path` is a directory, the `.ipynb` file in the directory is executed to allow for |
| 97 | + utility functions to be embedded. Pipeline compilation fails if there is more than one `.ipynb` file. |
| 98 | + |
| 99 | +### SDK User Experience |
| 100 | + |
| 101 | +#### Example |
| 102 | + |
| 103 | +```python |
| 104 | +from kfp import dsl |
| 105 | + |
| 106 | + |
| 107 | +@dsl.notebook_component( |
| 108 | + notebook_path="train.ipynb", |
| 109 | + packages_to_install=["pandas", "scikit-learn", "nbclient"], |
| 110 | +) |
| 111 | +def train_from_notebook(dataset_uri: str, model: dsl.Output[dsl.Model]): |
| 112 | + dsl.run_notebook( |
| 113 | + dataset_uri=dataset_uri, |
| 114 | + output_model_path=model.path, |
| 115 | + ) |
| 116 | + |
| 117 | + |
| 118 | +@dsl.pipeline(name="nb-pipeline") |
| 119 | + |
| 120 | +def pipeline(dataset_uri: str = "s3://bucket/dataset.csv"): |
| 121 | + train_from_notebook(dataset_uri=dataset_uri) |
| 122 | +``` |
| 123 | + |
| 124 | +#### Complex Example |
| 125 | + |
| 126 | +A mixed pipeline with a Python preprocessor, a notebook training step, and a notebook evaluation step: |
| 127 | + |
| 128 | +```python |
| 129 | +from kfp import dsl |
| 130 | + |
| 131 | + |
| 132 | +@dsl.component( |
| 133 | + base_image="python:3.11-slim", |
| 134 | + packages_to_install=["pandas==2.2.2"], |
| 135 | +) |
| 136 | +def preprocess(text: str, cleaned_text: dsl.Output[dsl.Dataset]): |
| 137 | + """Cleans whitespace from input text and writes to cleaned_text.""" |
| 138 | + import re |
| 139 | + |
| 140 | + cleaned = re.sub(r"\s+", " ", text).strip() |
| 141 | + with open(cleaned_text.path, "w", encoding="utf-8") as f: |
| 142 | + f.write(cleaned) |
| 143 | + |
| 144 | + |
| 145 | +@dsl.notebook_component( |
| 146 | + notebook_path="dev-files/nb_train.ipynb", |
| 147 | + base_image="registry.access.redhat.com/ubi9/python-311:latest", |
| 148 | +) |
| 149 | +def train_model( |
| 150 | + cleaned_text: dsl.Input[dsl.Dataset], |
| 151 | + model: dsl.Output[dsl.Model], |
| 152 | +): |
| 153 | + """Trains a model from cleaned text and writes model.""" |
| 154 | + import shutil |
| 155 | + |
| 156 | + # Read the dataset for the notebook since it expects a string |
| 157 | + with open(cleaned_text.path, "r", encoding="utf-8") as f: |
| 158 | + cleaned_text_str = f.read() |
| 159 | + |
| 160 | + # Execute the embedded notebook with kwargs injected as variables |
| 161 | + dsl.run_notebook(cleaned_text=cleaned_text_str) |
| 162 | + |
| 163 | + # Translate notebook outputs into KFP outputs |
| 164 | + nb_model_dir = "/tmp/kfp_nb_outputs/model_dir" |
| 165 | + shutil.copytree(nb_model_dir, model.path, dirs_exist_ok=True) |
| 166 | + |
| 167 | + |
| 168 | +@dsl.notebook_component( |
| 169 | + notebook_path="dev-files/nb_eval.ipynb", |
| 170 | + base_image="registry.access.redhat.com/ubi9/python-311:latest", |
| 171 | +) |
| 172 | +def evaluate_model( |
| 173 | + model: dsl.Input[dsl.Model], |
| 174 | + metrics_json: dsl.Output[dsl.Metrics], |
| 175 | +): |
| 176 | + """Evaluates a model and writes metrics JSON output.""" |
| 177 | + import json |
| 178 | + # Execute the notebook with the model artifact path |
| 179 | + dsl.run_notebook(model=model.path) |
| 180 | + |
| 181 | + # Copy notebook-generated metrics into output parameter |
| 182 | + with open("/tmp/kfp_nb_outputs/metrics.json", "r", encoding="utf-8") as f: |
| 183 | + metrics_dict = json.load(f) |
| 184 | + |
| 185 | + for metric_name, metric_value in metrics_dict.items(): |
| 186 | + if isinstance(metric_value, (int, float)): |
| 187 | + metrics_json.log_metric(metric_name, metric_value) |
| 188 | + |
| 189 | + |
| 190 | +@dsl.pipeline(name="three-step-nb-mix") |
| 191 | +def pipeline(text: str = "Hello world"): |
| 192 | + p = preprocess(text=text).set_caching_options(False) |
| 193 | + t = train_model(cleaned_text=p.output).set_caching_options(False) |
| 194 | + evaluate_model(model=t.output).set_caching_options(False) |
| 195 | +``` |
| 196 | + |
| 197 | +#### notebook_component Decorator Arguments |
| 198 | + |
| 199 | +Only differences from the standard Python executor component: |
| 200 | + |
| 201 | +- `notebook_path: str` – New parameter specifying the `.ipynb` file to embed and execute (required). |
| 202 | +- `packages_to_install: Optional[List[str]]` – Same as Python executor except when `None` (default here) it installs a |
| 203 | + slimmer default runtime of `nbclient>=0.10,<1`; `[]` installs nothing; a non-empty list installs exactly the provided |
| 204 | + packages. |
| 205 | + |
| 206 | +All other decorator arguments and behaviors are identical to the Python executor. |
| 207 | + |
| 208 | +#### Behavior Notes |
| 209 | + |
| 210 | +- The notebook JSON is compressed using gzip and base64-encoded before embedding into the ephemeral Python module used |
| 211 | + by the Python executor. This reduces command line length and allows for larger notebooks. |
| 212 | +- At runtime, `dsl.run_notebook(**kwargs)` is bound to a helper that: |
| 213 | + 1. Decompresses and parses the embedded notebook into memory |
| 214 | + 2. Injects parameters following Papermill semantics: |
| 215 | + - If the notebook contains a code cell tagged with `parameters`, a new code cell tagged `injected-parameters` is |
| 216 | + inserted immediately after it to override defaults. |
| 217 | + - If no `parameters` cell exists, the `injected-parameters` cell is inserted at the top of the notebook. |
| 218 | + 3. Executes via `nbclient.NotebookClient` |
| 219 | + 4. Streams cell outputs (stdout/stderr and `text/plain` displays) |
| 220 | + |
| 221 | +For the baseline bundling feature: |
| 222 | + |
| 223 | +- Files/directories are archived (tar+gz) and base64-embedded similarly; by default they are extracted at import time to |
| 224 | + satisfy the `EmbeddedInput[...]` contract of providing a real filesystem `.path` and to support non-Python assets. |
| 225 | +- For Python import resolution, the embedded path (extracted root or zip archive) is prepended into `sys.path` if it's a |
| 226 | + directory. |
| 227 | +- `dsl.EmbeddedInput[T]` is not part of the component interface; it is injected at runtime and provides an artifact with |
| 228 | + `.path` set to the extracted file/dir. |
| 229 | + |
| 230 | +### User Stories |
| 231 | + |
| 232 | +1. As a data scientist, I can take an existing exploratory notebook and run it as a KFP component with parameters, |
| 233 | + without rewriting it into a Python script. |
| 234 | +2. As a platform user, I can standardize execution images and dependency sources while still allowing teams to embed |
| 235 | + notebooks into components. |
| 236 | + |
| 237 | +### Notes/Constraints/Caveats |
| 238 | + |
| 239 | +- Embedded content increases the size of the generated command; extremely large notebooks may hit container argument |
| 240 | + length limits, though gzip compression typically reduces notebook size significantly. |
| 241 | +- Notebooks must be valid JSON and include a `cells` array; otherwise creation fails with a clear error. |
| 242 | +- The SDK warns when embedded artifacts or notebooks exceed 1MB to flag potential issues. The backend has a configurable |
| 243 | + maximum pipeline spec size; if exceeded, the error recommends moving content to a container image or object store. |
| 244 | + |
| 245 | +### Risks and Mitigations |
| 246 | + |
| 247 | +- **Dependency drift/conflicts**: Installing packages at runtime can introduce variability. |
| 248 | + - Mitigation: Encourage providing a `base_image` with pinned deps or using `packages_to_install` with exact versions. |
| 249 | +- **Command length/performance**: Large embedded notebooks may slow compilation or exceed limits. |
| 250 | + - Mitigation: Automatic gzip compression reduces notebook size; warn on large files (>1MB original); recommend |
| 251 | + refactoring or pre-building images for very large notebooks. |
| 252 | + |
| 253 | +## Design Details |
| 254 | + |
| 255 | +### Security Considerations |
| 256 | + |
| 257 | +This feature does not introduce additional security risks beyond those inherent to executing notebooks. It relies on the |
| 258 | +`nbclient` package within the execution environment (installed automatically unless overridden). |
| 259 | + |
| 260 | +### Test Plan |
| 261 | + |
| 262 | +#### Unit Tests |
| 263 | + |
| 264 | +- Verify `packages_to_install` behavior for `None`, `[]`, and non-empty lists. |
| 265 | +- Ensure helper source is generated and injected correctly; binding of `dsl.run_notebook`. |
| 266 | +- Test notebook compression and decompression round-trip correctness. |
| 267 | +- Large-notebook warning logic. |
| 268 | +- KFP local executing a pipeline with embedded artifacts. |
| 269 | +- KFP local executing a pipeline with notebooks. |
| 270 | + |
| 271 | +#### Integration tests |
| 272 | + |
| 273 | +- Execute a pipele with two parameterized notebooks that writes to an output artifact. |
| 274 | + - One notebook should have no `parameters` cell. |
| 275 | + - The other notebook should have a `parameters` cell with some overrides. |
| 276 | +- Failure path: invalid notebook JSON; notebook cell raises execution error. |
| 277 | +- Large notebook over 1 MB. |
| 278 | + |
| 279 | +### Graduation Criteria |
| 280 | + |
| 281 | +N/A |
| 282 | + |
| 283 | +## Implementation History |
| 284 | + |
| 285 | +- Initial proposal: 2025-09-10 |
| 286 | + |
| 287 | +## Drawbacks |
| 288 | + |
| 289 | +- Embedded notebooks can bloat the command payload and slow compilation/execution for large files, though gzip |
| 290 | + compression typically helps. |
| 291 | +- Notebooks are less modular than Python modules for code reuse and testing. |
| 292 | + |
| 293 | +## Alternatives |
| 294 | + |
| 295 | +1. Use `@dsl.component` with manual `nbconvert` calls inside the function. This requires boilerplate and manual |
| 296 | + packaging of the notebook. |
| 297 | +2. Pre-build a container image containing the notebook and its dependencies, then use `@dsl.container_component`. This |
| 298 | + improves reproducibility but increases operational overhead. |
0 commit comments