Skip to content

Conversation

@sameerdattav
Copy link

Fixes #221.

Summary

Adds a strict control-plane version compatibility check for the Trainer
Kubernetes backend. The check is fail-fast, user-friendly, opt-out, and
covered by unit tests.

Approach

  • Runs from KubernetesBackend.__init__ after CoreV1Api creation
  • Uses namespace from KUBEFLOW_SYSTEM_NAMESPACE (default: kubeflow)
  • Reads kubeflow-trainer-public ConfigMap (kubeflow_trainer_api_version)
  • Compares against local version from importlib.metadata
  • Raises RuntimeError on mismatch or missing control plane
  • Skippable via KUBEFLOW_TRAINER_SKIP_VERSION_CHECK

Tests

  • Added coverage for success, mismatch, missing ConfigMap/data, and skip path
  • Updated existing tests to account for verification during backend init
  • pytest passes locally

Thanks @RavinduWeerakoon for the initial approach and discussion.

Copilot AI review requested due to automatic review settings January 10, 2026 18:56
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@sameerdattav sameerdattav changed the title Enforce strict backend version compatibility checks chore: Enforce strict backend version compatibility checks Jan 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a strict version compatibility check between the Kubeflow Trainer SDK and the Kubernetes control plane. The check runs automatically when initializing the KubernetesBackend and compares the local kubeflow-trainer-api package version against the version published in a ConfigMap by the control plane. Users can opt out via the KUBEFLOW_TRAINER_SKIP_VERSION_CHECK environment variable.

Changes:

  • Added verify_backend() method to KubernetesBackend that performs version compatibility checks
  • Added abstract verify_backend() hook to the base RuntimeBackend class
  • Comprehensive test coverage for success, mismatch, missing ConfigMap, and skip scenarios
  • Updated existing test fixtures to mock the version check behavior

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
kubeflow/trainer/backends/kubernetes/backend.py Implements version verification logic that reads control-plane version from ConfigMap and compares with local SDK version
kubeflow/trainer/backends/base.py Adds optional verify_backend() hook to the RuntimeBackend abstract class with default no-op implementation
kubeflow/trainer/backends/kubernetes/backend_test.py Adds comprehensive test coverage for version check scenarios and updates existing fixtures to mock version check dependencies
kubeflow/trainer/api/trainer_client_test.py Updates test mocks to include ConfigMap and version metadata mocking for backend initialization

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
version published by the control plane in a public ConfigMap.

The verification can be skipped entirely by setting the
``KUBEFLOW_TRAINER_SKIP_VERSION_CHECK`` environment variable to any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to introduce this env variable?
I would suggest that we always check the version for now.

Comment on lines 131 to 140
if local_version != server_version:
raise RuntimeError(
"Kubeflow Trainer SDK version mismatch detected: local "
f"kubeflow-trainer-api=={local_version}, control plane "
f"kubeflow-trainer-api=={server_version}. The control plane "
"may be outdated or mismatched with this SDK. Please upgrade "
"either the SDK or the control plane so the versions match, "
"or set KUBEFLOW_TRAINER_SKIP_VERSION_CHECK=1 to bypass "
"this verification (not recommended)."
)
Copy link
Member

@andreyvelich andreyvelich Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's do logger.warning() if version of control plane is incompatible or config map is missing.
We need to release the Trainer v2.2 where we add this ConfigMap.

Comment on lines 122 to 124
try:
local_version = metadata.version("kubeflow-trainer-api")
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this check, since this package is always installed.

"KUBEFLOW_TRAINER_SKIP_VERSION_CHECK=1 to skip this check."
) from e

if local_version != server_version:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we should do this check:

  1. For the minor version it should: server_version == local_version
  2. For the patch version it should: server_version >= local_version

Since we don't introduce new features for the patch versions.

We might also want to introduce strict dependency for minor version here
(e.g. kubeflow-trainer-api~=2.1.0): https://github.com/kubeflow/sdk/blob/7797f35dfe8dce1993e968364c17c59a6a422697/pyproject.toml#L31C4-L31C24

What do you think @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-trainer-team ?

Let's talk about it more at the Kubeflow SDK call: https://docs.google.com/document/d/1jH2WAX2ePxOfI4JuiVK9nPlesDMiyg67xzLwhpR7wTQ/edit?tab=t.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sameerdattav We briefly talked about it at the latest call: https://youtu.be/zwPVfB8tTuY?t=2254
Initially, let's just verify that ConfigMap with the appropriate info (e.g. kubeflow_trainer_version) exists.
In the future, we can discuss how we can verify the appropriate version in TrainerClient.

Comment on lines 110 to 111
if data and isinstance(data, dict):
server_version = data.get("kubeflow_trainer_api_version")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server_version == dev, you should skip the check.

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/L labels Jan 30, 2026
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
@sameerdattav
Copy link
Author

@andreyvelich, there was some issue with git and line endings, so i went ahead and opened up a new pr with a clean slate.
this implements the minimal version that you were talking about on the call.
#250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure Control Plane Availability During Kubeflow SDK Client Creation

2 participants