Support multi-document compiled YAML (pipeline + platform spec)#100
Support multi-document compiled YAML (pipeline + platform spec)#100sduvvuri1603 wants to merge 3 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
- Read compiled output with yaml.safe_load_all so multi-doc does not fail. - Merge deploymentSpec/root/components from all docs into one dict so base image extraction sees images from both pipeline and platform spec. - Add _merge_ir_docs in kfp_compilation and unit tests for merge behavior. - No change to compiler write path or to validate_base_images.py. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: sduvvuri1603 <sduvvuri@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
a8faa24 to
009967f
Compare
Signed-off-by: sduvvuri1603 <sduvvuri@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Updates KFP compilation parsing to support multi-document compiled YAML (pipeline spec + platform spec) by loading all YAML documents and merging the relevant IR sections before downstream base-image validation.
Changes:
- Switch YAML parsing from
yaml.safe_loadtoyaml.safe_load_allincompile_and_get_yaml(). - Add
_merge_ir_docs()to mergedeploymentSpec.executors,root.dag.tasks, andcomponentsacross documents. - Add unit tests covering
_merge_ir_docs()behavior and integration withextract_base_images().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
scripts/lib/kfp_compilation.py |
Loads all YAML documents and merges IR sections for downstream image extraction. |
scripts/lib/tests/test_kfp_compilation.py |
Adds unit tests validating multi-doc merge behavior and compatibility with extract_base_images(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _merge_ir_docs(docs: list[dict[str, Any]]) -> dict[str, Any]: | ||
| """Merge multiple IR YAML docs so deploymentSpec, root, components are combined.""" | ||
| if not docs: | ||
| return {} | ||
| if len(docs) == 1: | ||
| return docs[0] | ||
|
|
||
| def _get(d: Any, path: list[str]) -> dict[str, Any]: | ||
| for k in path: | ||
| d = (d.get(k) or {}) if isinstance(d, dict) else {} | ||
| return d if isinstance(d, dict) else {} | ||
|
|
||
| def _ensure_and_merge(out: dict[str, Any], path: list[str], src: dict[str, Any]) -> None: | ||
| cur: Any = out | ||
| for k in path[:-1]: | ||
| cur = cur.setdefault(k, {}) if isinstance(cur, dict) else {} | ||
| if not path or not isinstance(cur, dict): | ||
| return | ||
| key = path[-1] | ||
| target = cur.get(key) | ||
| if not isinstance(target, dict): | ||
| cur[key] = {} | ||
| target = cur[key] | ||
| target.update(src) | ||
|
|
||
| out: dict[str, Any] = dict(docs[0]) | ||
| paths = [["deploymentSpec", "executors"], ["root", "dag", "tasks"], ["components"]] |
There was a problem hiding this comment.
_merge_ir_docs() is annotated as taking list[dict[str, Any]], but the implementation already anticipates non-dict docs (and the new tests pass a string). Also, out = dict(docs[0]) will raise TypeError if the first YAML document is None/non-mapping (safe_load_all can yield None for empty/null docs). Consider changing the parameter type to list[Any] (or Sequence[Any]) and filtering to dict docs up front (return {} if none) so the merge is robust regardless of document ordering/content.
| with open(output_path) as f: | ||
| return yaml.safe_load(f) | ||
| docs = list(yaml.safe_load_all(f)) | ||
| if not docs: | ||
| return {} | ||
| if len(docs) == 1: | ||
| return docs[0] | ||
| return _merge_ir_docs(docs) |
There was a problem hiding this comment.
compile_and_get_yaml() is declared to return dict[str, Any], but when there is exactly one YAML document it returns docs[0] without validating it is a dict. If the compiled YAML ever contains an empty/null document (safe_load_all yields None) or a non-mapping top-level, this will propagate a non-dict to callers and will cause extract_base_images() to raise (it requires a dict). Consider filtering/validating the loaded documents and returning {} (or raising a clearer error) when no dict document is present.
There was a problem hiding this comment.
Single-document YAML is already guaranteed to be a dict because we only keep dict documents here:
docs = [d for d in raw if isinstance(d, dict)]
So when there’s one document we return docs[0], which is always a dict.
Added a clear ValueError for the case where there is no dict document at all: " Expected at least one pipeline/component spec.” instead of returning {}."
Signed-off-by: sduvvuri1603 <sduvvuri@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
| return {} | ||
| if len(docs) == 1: | ||
| return docs[0] | ||
| return _merge_ir_docs(docs) |
There was a problem hiding this comment.
Do we need to combine the docs? Do we need platform spec somewhere? Can we just return pipeline spec dict?
There was a problem hiding this comment.
The issue was multi-doc compiled YAML causing us to miss base images when we only read the first document.
So my idea was to merge so that extract_base_images() sees both pipeline and platform spec, since images can live in either doc ( incorporate both for correct validation)
Pls correct me if you think pipeline spec alone would be enough :)
scripts/lib/kfp_compilation.py
Outdated
| compiler_class().compile(func, output_path) | ||
| with open(output_path) as f: | ||
| return yaml.safe_load(f) | ||
| docs = list(yaml.safe_load_all(f)) |
There was a problem hiding this comment.
I think we don't need to cast it to a list, we can just use safe_load_all as it is and check its instance type
Also,
- Throw an exception in case docs is nil or empty
- Return docs if instance is dict
- Else check which index of docs is pipeline spec (by checking which doc contains key
pipelineInfo
What was going wrong
When the KFP compiler produced a multi-document YAML file (pipeline spec + platform spec), we only read the first document with yaml.safe_load(f). Base images that lived only in the second document were never passed to extract_base_images(), so validation either missed those images or failed later when something expected them.
What we changed
Therefore, Raising this PR to switch to yaml.safe_load_all(f) and, when there are multiple documents, merge deploymentSpec, root, and components from all of them into a single dict. That merged dict is what we pass to extract_base_images(), so validation now sees images from both the pipeline spec and the platform spec. Only the read/merge logic in kfp_compilation.py was changed; the compiler write path and validate_base_images.py are unchanged.