Skip to content

feat: Add script to get all upstream images#1418

Open
mvlassis wants to merge 3 commits intomainfrom
kf-8550-gather-upstream-images
Open

feat: Add script to get all upstream images#1418
mvlassis wants to merge 3 commits intomainfrom
kf-8550-gather-upstream-images

Conversation

@mvlassis
Copy link
Contributor

@mvlassis mvlassis commented Mar 19, 2026

Closes #1417.

This PR adds a directory under scripts with a new Python script to gather all images used in a specified release of upstream Kubeflow.

The script is heavily based on https://github.com/kubeflow/manifests/blob/0837fb9cf3ec73f51cbddff656a160cb258eaad5/tests/trivy_scan.py. I tried to introduce as few changes as possible:

  • Add a new function to temporarily clone the kubeflow/manifests repository and operate from there
  • Change the default location of the script to be in a directory where the script is called from
  • Remove the vulnerability scanning part which is not relevant in this case
  • Add a new --skip argument to skip specific working groups. This is useful since we probably want to skip controllers that we do not deploy as charmed operators.

I would probably do more reformatting of the script, but I suggest we keep the changes as minimal as possible to not deviate much.

I also added a README.md with instructions on how to use the script.

@mvlassis mvlassis requested review from dariofaccin, deusebio and misohu and removed request for misohu March 19, 2026 12:59
Copy link
Contributor

@dariofaccin dariofaccin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some improvements. Note that upstream CI is utterly broken even on master and has probably never worked/won't work on any tag.

From inside this directory:

```shell
python3 extract_images.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python3 extract_images.py
python3 get_upstream_images.py

suggestion(blocking): probably a typo but the script name doesn't match

try:
clone_cmd = ["git", "clone", "--depth", "1"]
if version != "latest":
tag = f"v{version}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(blocking): do not hardcode the v prefix. I tried with the 26.03-rc.1 tag and the script failed:

python3 get_upstream_images.py 26.03-rc.1
Cloning repository: https://github.com/kubeflow/manifests.git
ERROR: Failed to clone repository. Details: Command '['git', 'clone', '--depth', '1', '--branch', 'v26.03-rc.1', 'https://github.com/kubeflow/manifests.git', 'manifests']' returned non-zero exit status 128.

I would also suggest not to limit the script to tags, but also branches. The difference is negligible but it's more flexible.


def validate_semantic_version(version):
# Validates a semantic version string (e.g., "0.1.2" or "latest").
regex = r"^[0-9]+\.[0-9]+\.[0-9]+$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(blocking): this regex doesn't properly validate a semantic version. For instance, it failed validating both 1.11.0-rc.1 and 26.03-rc.1. I put the regex into a validator and also 26.03 wouldn't match.
Starting from the suggested SemVer regex, I built one which allows to have only major.minor groups and also the prefix v:

^v?(?P<major>0|[1-9]\d*)\.(?P<minor>|[0-9]\d*)\.?(?P<patch>|[0-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

It's fairly complex and it can be checked here.

def extract_images(version, skip_list=None):
if skip_list is None:
skip_list = []
version = validate_semantic_version(version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(blocking): I would run the semver validation even before cloning the repo.
You could actually leverage the type keyword to perform the validation directly from the parser. If you raise an argparse.ArgumentTypeError you get also this nice output:

usage: get_upstream_images.py [-h] [--skip [SKIP ...]] [version]
get_upstream_images.py: error: argument version: Invalid semantic version: 'invalid'

"version",
nargs="?",
type=str,
default="latest",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(blocking): I would default to master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, we can probably just make this argument required. The user should definitely know which images they want to fetch

Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script looks fine, but Dario provided sensible comments, especially to make it compatible with branches.

I'm only concerned that this may require also effort to make sure that it is in sync with upstream (or with a given branch). I'm actually wondering if we could:

  1. Fetch the file from upstream (even just with getting the raw data from the given branch)
  2. Import the custom mappings (e.g. wg_dirs) from the file to use them or as sanity check that we are still "compatible"

"version",
nargs="?",
type=str,
default="latest",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, we can probably just make this argument required. The user should definitely know which images they want to fetch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add script to gather upstream images

3 participants