Skip to content

Conversation

@robert-bell
Copy link
Contributor

What this PR does / why we need it:

This is a KEP for adding real-time progress and exposing training metrics on TrainJob.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part of #2779


### Runtime: emitting trainer status messages

The "primary" training runtime pod will write trainer status messages to stdout in the following format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider taking inspiration / aligning with the metricsCollectorSpec mechanism from The Katib API.

We can prioritize the support for the StdOut collector type for the first implementation, but that'll give the API enough flexibility and consistency across components.

Copy link
Member

@abhijeet-dhumal abhijeet-dhumal Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @astefanutti , Please correct me here :
As described in this Katib's natively supported metrics-collector approaches here (pull/push based): https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/
There are 3 collector types (Stdout/TensorFlowEvent/file based) ..
So for initial phase we can start with adding a metricsCollectorSpec.kind field to our API now by keeping simple regex parsing handled by a metrics collector sidecar, something like this which is exact match for Katib's approach :

spec:
  metricsCollectorSpec:
    kind: StdOut 
    source:
      filter: "([\w|-]+)=([+-]?\d+\.?\d*)"

and then we can introduce file/TensorFlowEvent and custom based metrics collector approaches too in next phases ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeet-dhumal exactly, I suggest the KEP to define the API accordingly, but not all the collector types have to be supported in the first implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's really an awesome insight.. Thanks @astefanutti 🙌
We will update the proposal promptly !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I wasn't aware of the metricsCollectorSpec in Katib. Thanks @astefanutti for pointing it out!

Are you thinking about aligning with the Katib approach for consistency and flexibility, or about making it easier for train jobs to send metrics to Katib (e.g. the same metric collection approach can be used by both Katib and KFT)? Or possibly both or something else entirely.

If making it easier for Katib to consume the metrics, would it make sense to consider extending Katib to watch the TrainJob resource instead, possibly from a sidecar or maybe directly in the Katib control plane? The control plane approach could be particularly appealing for user experience as it would mean they would not need to modify their training code at all (provided they use a supported runtime), or depend on the Katib SDK. The train job service account would also not need any additional RBAC permissions.

If you're thinking about consistency and flexibility, do you have an example in mind where the extra flexibility would benefit, or where the proposed approach would be limiting? I agree there's a need for flexibility, but I wonder whether that flexibility can be moved entirely into the runtime code? This may be easier to maintain and extend to add support for other ML frameworks, and would allow us to keep the KFT controller implementation simpler. We could even package some helper utilities to a python package (e.g. kubeflow-utils) to help users instrument their code.

Copy link
Contributor

@astefanutti astefanutti Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm thinking metrics collection should become the responsibility of the trainer / TrainJob API so these are exposed in a standard way for Katib to consume.

The Trainer would collect metrics from runtimes according to the metricsCollectorSpec pattern, and expose them in a standard way, e.g. in the TrainJob status. Then Katib trial controller would watch the TrainJob to retrieve the metrics related to the experiment. This would make Katib being able to operate without a database.

The metricsCollectorSpec pattern could be configured in the ClusterTrainingRuntime is a way that's compatible with the runtimes. For custom trainers, the SDK would be responsible to customize it depending on the provided trainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @astefanutti. I've looked at Katib's metricsCollectorSpec and thought about it in more detail, and I don't think we need to or should use the same pattern here.

Rather than putting flexibility in the control plane, we're proposing that we use the runtime pods to get flexibility. Each different ML framework would have different instrumentation code, but all frameworks would communicate using the same "protocol" (e.g. the specific log messages). This means that we don't need any flexibility in the control plane.

The advantages of using code in the runtime is

  • much more flexibility than can be provided by metricsCollectorSpec,
  • it's much easier for data scientists to consume: configuration is all in code, rather than split across a yaml file and code,
  • the control plane can be kept simpler. In particular, I think katib needs a sidecar to collect the metrics.

To make it easier for users to instrument their code, we can leverage the sdk to automatically instrument training jobs submitted if the framework supports it (e.g. for frameworks that have callback mechanisms).

For frameworks that don’t support callbacks, or for users that aren’t creating jobs via the sdk, we could provide a runtime utility library that implements the protocol and allow users to add the instrumentation themselves (e.g. it could be embedded in their runtime images). Or they are free to just not instrument their training loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell thanks that sounds great!
I should have expressed the requirement for flexibility rather than jumping directly to the implementation details :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't included something like metricsCollectorSpec in the updated approach, but I think we have all the flexibility that means we don't need it. @astefanutti do take a look.

}


type TrainJobTrainerStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So #2909 and this PR seem to conflict with each other.

In this one, we want to make TrainJob a general solution for deep learning training.

But in 2909, we are hoping to make TrainJob a general solution for HPC solutions (not just deep learning frameworks).

cc @vsoch

I'm not sure its blocking as this project is called TrainJob but I do want to call this out as a descrepancy.

cc @andreyvelich

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention is to better support training jobs that use HPC technologies (MPI, topology and fine grained scheduling, etc.) It's no more strange than having the MPI plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 is the conflict that #2909 would allow TrainJob to be used for more general hpc/non-training workloads, but this proposal is specifically for model training workloads?

If so, I'm not sure its a problem because as you say TrainJob is designed for model training. Plus, the extra status information we're proposing is optional so users can use TrainJob for more general hpc workloads if they wanted.

Also just to clarify, this proposal isn't limited to just deep learning frameworks. Other training frameworks, e.g. XGBoost, LightGBM, should be able to use it too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @robert-bell, the metrics status might be also very useful for HPC of jobs. We just need to figure out how to make these statuses agnostic to the job you run.

On the other hand, TrainJob's APIs are more appropriate for the training jobs (e.g. trainer, initializer, potentially exporter/checkpointer cc @rst0git).
At some point, we need to discuss whether these abstractions make sense for HPC jobs or we need to re-think something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should map quite well to HPC use cases! We tend to have Figures of Merit (FOM) that are derived from different steps or stages. There are often multiple FOM. The metrics look relevant here with the exception of epoch - not everything you run in HPC is going to have an epoch, but AI/ML jobs run with MPI will.

Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work! I was also thinking:

trainingStatus:
  metrics:
    - name:  loss
      value: 0.8
      type: train
    - name: auroc
      value: 0.7
      type: train

I like your design because finding a type is a O(1) lookup instead of needing to loop through an entire list (which could be large).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch I've updated the schema to the metrics group approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell @vsoch @astefanutti Shall we consider the list instead of map as @vsoch proposed?
I understand that map gives us O(1) access, but it has some other problems in Kubernetes API design.

Additionally, it would be hard to do strategic merge patches on maps.

Maybe @tenzen-y, @kannon92, or @ahg-g have any insights why we introduce slice for ReplicatedJobs API in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L245-L248

Copy link
Contributor

@kannon92 kannon92 Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes api conventions discourage the use of maps.

Kube api linter even has a linter for nomaps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell robert-bell changed the title KEP-2779: Track TrainJob progress and expose training metrics feat(docs): KEP-2779: Track TrainJob progress and expose training metrics Oct 31, 2025
@coveralls
Copy link

coveralls commented Oct 31, 2025

Pull Request Test Coverage Report for Build 21214073338

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 19 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 51.385%

Files with Coverage Reduction New Missed Lines %
pkg/runtime/core/trainingruntime.go 19 69.94%
Totals Coverage Status
Change from base Build 21036734925: 0.1%
Covered Lines: 1243
Relevant Lines: 2419

💛 - Coveralls

Copy link
Member

@szaher szaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @robert-bell @abhijeet-dhumal
This is great proposal, it brings a lot of value like

  • Excellent User Experience: The goal is perfect. A data scientist seeing PROGRESS: 45% and ETA: 15m is a massive win.
  • No dependency on external components like prometheus or sidecars
  • Nice SDK integration with transformers library

but it has few problems too

  • Architectural anti-pattern by allowing control plane to access data plane and stream logs
  • While the footprint of go-routine is minimal on the operator itself but it abuses the Control Plane by maintaining constant GET request to /api/v1/namespaces/.../pods/.../log?follow=true connection for every single active TrainJob (so assuming scalability to 1000 job is questionable)
  • Broad Security Permissions for the training operator to access cluster wide pod logs
  • Error prone as if training is verbose some dataset might have very large line examples or there might be extremely long log message

Another proposed solution is as follows:
Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates.
this approach is much cleaner

  • No log parsing
  • No go-routines
  • No impact on the kubeapi-server for fetching and streaming logs
  • Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
  • Let the actual train job itself update itself

Overall, this is excellent and I think we should push it forward.


### Goals

1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep training and eval metrics outside the status as Data Scientists have dedicated tools for this like mlflow, tensorboard, wandb, ...etc.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.

Copy link
Contributor

@astefanutti astefanutti Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.

I think the main target of this KEP is client integration, e.g. CLI (e.g. kubectl), GUI or Katib that would consume those metrics, not the data scientists.persona, and it's indeed not the goal to overlap with tools like Tensorboard.

This is unlike Katib or HPO where we have an optimization problem and final metrics are important.

Exactly Katib would indeed be a candidate client to consume those metrics to carry the optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the non-goals in the second draft to try clarify that this feature is not replacing these other tools, but is trying to provide something that works out-of-the-box without needing any additional components installed.


### Story 2: Data Scientist / ML Engineer monitoring training jobs

As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the controller always stream the logs to provide real-time information?

  • How does that scale?
  • Is this design anti-pattern since control plane access data plane ? The control plane's job is to make decisions. It includes the API server, schedulers, and your controller/Operator. Its job is to "establish and enforce policy." It is not designed for high-throughput, high-volume data processing. It's designed for state reconciliation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair point that the log streaming approach is mixing control plane with data plane, though I wonder what the max throughput would be in practice? I'd possibly anticipate that applications would be adding at most a few lines to the logs per second, meaning throughput from a single job would be ~kB/s, and throughput from 1000 active train jobs may be ~MB/s which should not be particularly onerous on the controller. This is something that could be benchmarked of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new http request based approach should hopefully address your scalability concerns.

Comment on lines 220 to 221
trainMetrics: dict[str, float] | None
evalMetrics: dict[str, float] | None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are very generic and not well defined. What metrics are we expecting. Since we are going to develop the callback in the SDK, should this be more well defined or structured to identify which metrics will be available?

I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892

progressPercentage, estimatedRemainingSeconds, currentEpoch, ...etc. give very good information already about the training process so I suggest we get rid of trainingMetrics and evalMetrics in favor of MLFlow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892

As being discussed in the comment above, the main audience / target of this KEP is not the data scientists persona, and it's indeed not the goal to overlap with tools like Tensorboard nor with the scope of experimentation tracking.

The main target of this KEP is client integration, e.g. GUI or Katib that would consume those metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szaher I've updated the goals/non-goals to clarify that we're not trying to replace MLFlow. Does that address your concerns?

@astefanutti
Copy link
Contributor

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

  • No log parsing
  • No go-routines
  • No impact on the kubeapi-server for fetching and streaming logs
  • Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
  • Let the actual train job itself update itself

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

@robert-bell
Copy link
Contributor Author

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

Granting permissions so the train job itself can directly update the training status comes with a few more downsides -

  • The controller would also need to create a rolebinding for the serviceaccount, which may need some care to avoid privilege escalations (I'm not too sure of the details).
  • The controller may need to provide additional ways to configure the service account so users can, e.g. grant additional permissions to the service account (e.g. if the TrainJob needs to call the K8s api), or configure out-of-cluster permissions, e.g. for AWS IRSA.
  • If users are using a custom service account for their train job, they would need to manually add a role binding to allow the service account to update the train job.
  • The training runtime pods would require a kubernetes client to be available. Whilst this could be done (e.g. installing at the start of the job, or by using a sidecar containing the client to export the progress updates) it adds some complexity.

@andreyvelich
Copy link
Member

Hi Folks, thanks again for this effort! That feature should significantly improve experience of AI workloads on Kubernetes, and has a lot of future potential for checkpointing, scheduling, etc. 🎉

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

cc @romanbaron @EkinKarabulut @Ronkahn21 as we discussed, this should address problems to detect running status of TrainJob mentioning by Ron in this talk: https://sched.co/28D8Z

@robert-bell
Copy link
Contributor Author

Hey @andreyvelich! Thanks for chasing, and apols for the silence.

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

I'm working on an update to bring in things from that discussion. I'm aiming to find some time this week.

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Dec 1, 2025
@robert-bell
Copy link
Contributor Author

Hi folk, @astefanutti @andreyvelich @tenzen-y,

I've updated the proposal with a new design based on the discussions from Kubecon, and looked to incorporate some of the ideas from the first draft. The changes are put into separate commits to make it easier to see what's changed.

The main change is that I've replaced the log watching mechanism with http calls. I've actually made 2 variants of the proposal - a push-based one (runtime sends to control plane) and pull-based one (control plane scrapes from runtime) - because they're similar but have different trade offs. We only need one of the approaches though, and I thought this might be the easiest way to weigh up the merits of each an help us reach consensus on how best to move forward.

The other changes I've made are:

  • replaced the trainMetrics and evalMetrics fields with a more general MetricsGroup struct.
  • clarified the goals to make it clearer that we're not trying to replace tools like MLFlow.
  • updated the other considered alternatives section with a few more details.

Please let me know if any extra info is needed to help this move forward.

// for the metric and the value is the corresponding numeric value serialized
// as a string.
// +optional
Values map[string]string `json:"values,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch following up on the other thread, I just wanted to highlight I'm making an assumption here that we only need to support scalar metrics - we can't (directly) support non-scalar things like confusion matrices etc.

Should we be considering more structure here to let us support non-scalar metrics in the future? E.g.

values:
  loss:
    type: scalar
    value: 0.9
  confusion_matrix:
    type: ConfusionMatrix:
    ... more complicated struct

I know TFX has done some work here but it looks like it quickly gets very complicated -- see tf-metadata or tfma.

If we don't consider it now, though, it may be hard to add support for it later.


### Runtime: emitting trainer status messages

The "primary" training runtime pod will write trainer status messages to stdout in the following format:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't included something like metricsCollectorSpec in the updated approach, but I think we have all the flexibility that means we don't need it. @astefanutti do take a look.

Comment on lines 220 to 221
trainMetrics: dict[str, float] | None
evalMetrics: dict[str, float] | None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szaher I've updated the goals/non-goals to clarify that we're not trying to replace MLFlow. Does that address your concerns?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @robert-bell!
I left a few comments.
Let's chat more about it during today's call if you can join: https://bit.ly/2PWVCkV

"""
read -r -d '' SCRIPT << EOM\n
import os
from transformers import TrainerCallback, trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have open issue to introduce this Callback to transformers: https://huggingface.co/docs/transformers/en/main_classes/callback#callbacks?

I am curious whether we should call this Callback TrainerCallback, KFTrainerCallback, or KubeflowCallback ?
Can this callback be useful for other jobs like OptimizationJobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think having a callback in transformers would be ideal.

I'd been holding off starting a discussion there until we'd got a rough consensus here on the general approach to take, particularly the idea of introducing these framework-specific custom trainers. I'd be keen to open an issue there though.

In terms of names, I'd gone with KubeflowTrainerStatusCallback but I appreciate it's a mouthful! Just FYI TrainerCallback is the name of the base class in the transformers library so we shouldn't use that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we name it KubeflowCallback or KubeflowTrainerCallback, since we might also want to use it for checkpointing in the future (cc @kramaranya @rst0git)

Are we planning to use this callback for other things, like for HP algorithms (e.g. Hyperband) or early stopping?
I know that sometimes algorithms make decision during Trial execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't have a strong opinion on the naming. I'd be happy to use either KubeflowCallback or KubeflowTrainerCallback.

I wonder whether from a maintenance perspective we might want to have separate callbacks for each bit of functionality (status, checkpointing, early-stopping)? We could for sure wrap them up in a single "main" callback and allow users to enable/disable each callback individually.

Copy link
Member

@andreyvelich andreyvelich Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether from a maintenance perspective we might want to have separate callbacks for each bit of functionality (status, checkpointing, early-stopping)?

Could we ask HuggingFace community for that? I feel that starting with unified KubeflowCallback might be easier.
Additionally, we can use kwargs to configure callback in a way we want: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L332

Additionally, I know that Transformers have other integrations for HP search, for example: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/integration_utils.py#L296C5-L296C22

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we name it KubeflowCallback or KubeflowTrainerCallback, since we might also want to use it for checkpointing in the future

Sounds like a good idea.


1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.
2. **Have zero dependencies.** The user should not need to install any additional components into their cluster for the feature to work. It should work "out-of-the-box".
3. **Optional.** Users can choose to opt in to providing progress tracking, but are not required to use this feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to control it via Trainer config or TrainJob API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the push-based approach, it'd be controlled by the runtime - the user would opt in by instrumenting their code. The user wouldn't need to configure anything on the Trainer or TrainJob.

For the pull-based approach, users would need instrument their runtime and also configure their (Cluster)TrainingRuntime so the control plane knows which pod to inject the sidecar and then scrape, so there'd be an extra step. We could set up this config for the standard ClusterTrainingRuntimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the pull-based approach, users would need instrument their runtime and also configure their (Cluster)TrainingRuntime so the control plane knows which pod to inject the sidecar and then scrape

This could be done in the SDK via PodTemplateOverride by each specific trainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that'd be a nice approach and would keep the implementation clean.

We might need to update the PodTemplateOverrides implementations because some of the options can't currently be used more than once e.g. SpecLabels doesn't merge the labels it only keeps the last label set applied, so we might either overwrite user-applied label overrides, or the user-applied label overrides might overwrite the progress tracking ones.

Copy link
Member

@andreyvelich andreyvelich Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the push-based approach, it'd be controlled by the runtime - the user would opt in by instrumenting their code

Do you mean we will always serve the endpoint, and this endpoint is called, we will update TrainJob status?

This could be done in the SDK via PodTemplateOverride by each specific trainers.

We talked yesterday with @robert-bell that might not always work. Since we use single Job template for torch-distributed runtime, we can't inject sidecar only to the MASTER pod (e.g. Job index = 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we will always serve the endpoint, and this endpoint is called, we will update TrainJob status?

Yeah, that's right. The endpoint would always be available. It's essentially the same model as mlflow - the server is always there available, and the client submits the metrics whenever it wants to update them. It's the same server that's used by all TrainJobs in the cluster.

We talked yesterday with @robert-bell that might not always work. Since we use single Job template for torch-distributed runtime, we can't inject sidecar only to the MASTER pod (e.g. Job index = 0).

Just to clarify - I think it may still be possible to make it work, e.g. maybe the torch-distributed runtime has to be changed so there's 2 Jobs in it - one for the master pod which gets the sidecar, and another for all the other workers. I'm concerned about the overhead this'd add, though, to creating runtimes, especially if users want to create their own runtimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. The endpoint would always be available. It's essentially the same model as mlflow - the server is always there available, and the client submits the metrics whenever it wants to update them.

I am wondering if we should make this feature optional, and enabled by default in the config: https://github.com/kubeflow/trainer/blob/master/manifests/base/manager/controller_manager_config.yaml

We can also think about feature gates flags, like we do in JobSet (cc @kannon92 @astefanutti @tenzen-y): kubernetes-sigs/jobset#1096

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should introduce FGs after beta graduation regardless of this feature.
Currently, Trainer is in the alpha stage. So, all features could be considered as an alpha, but after beta graduation, we want to have a feature stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to put this behind a feature gate, using the same approach in JobSet. It makes sense to me to add the scaffolding for FGs sooner rather than later so it's there and ready to be used by other features.

When it comes to the implementation, would folk prefer a separate PR for the FG or include it in the same PR for adding the progress tracking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to introduce FG after we implement this feature. As @tenzen-y mentioned it would make sense to introduce it before we graduate Trainer API to beta.

1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.
2. **Have zero dependencies.** The user should not need to install any additional components into their cluster for the feature to work. It should work "out-of-the-box".
3. **Optional.** Users can choose to opt in to providing progress tracking, but are not required to use this feature.
4. **Provide built-in progress tracking support for selected ML frameworks (e.g. transformers, pytorch-lightning) in the kubeflow sdk.** Data scientists should be able to use the kubeflow sdk to create training jobs using these frameworks, and have progress tracking automatically instrumented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a goal to introduce callbacks to third party ML library like HuggingFace transformers to support our instrumentation ?

Copy link
Contributor Author

@robert-bell robert-bell Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this should be a goal. I'll update it in a batch with any other changes.

// Empty if the status is unknown, e.g. the job has just started
// or the job is not instrumented to report its status.
// +optional
TrainerStatus *TrainJobTrainerStatus `json:"trainerStatus,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about TrainerProgress API, since we can introduce InitializerProgress in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking renaming TrainerStatus => TrainerProgress? My thinking was "progress" didn't really include metrics, so status might be a better name.

We could have InitializerStatus to contain initializer progress. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let's finalize the API design here: #2905 (comment)

}


type TrainJobTrainerStatus struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell @vsoch @astefanutti Shall we consider the list instead of map as @vsoch proposed?
I understand that map gives us O(1) access, but it has some other problems in Kubernetes API design.

Additionally, it would be hard to do strategic merge patches on maps.

Maybe @tenzen-y, @kannon92, or @ahg-g have any insights why we introduce slice for ReplicatedJobs API in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L245-L248


#### Security considerations

The control plane endpoint will be secured with TLS and token-based authentication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti @robert-bell Did you get a chance to check the in-place restart KEP from @GiuseppeTT ?
https://github.com/kubernetes-sigs/jobset/tree/main/keps/467-InPlaceRestart#implementation---permissions-for-the-agent-sidecar

Wondering if we can re-use similar ideas with ValidatingAdmissionPolicy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this - I wasn't aware of the ValidatingAdmissionPolicy api!

I've had a look but I'm not sure how we could secure this completely: we could use a ValidatingAdmissionPolicy to restrict it so only the trainer status field can be updated, but I'm not sure how we can restrict it so that nothing else running in the namespace can update the field? I think we'd need the train jobs to run using custom service accounts which is something I was trying to avoid for the reasons I outlined in this comment.

I might have misunderstood something in the ValidatingAdmissionPolicy api though.

Comment on lines +282 to +261
- configmap:
name: <train-job-name>-tls-config
items:
- key: ca.crt
path: ca.crt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can other pods in this namespace re-use this Volume while TrainJob is running?
If yes, that might allow attacker to re-use certificate in other pods to call system endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the public ca cert so the runtime trusts the control plane endpoint TLS encryption. It's not an issue if other pods have access to it, they can't do anything with it.

Comment on lines 327 to 314
```python
from urllib import request
import os
import ssl

payload = ...

url = os.environ["KUBEFLOW_TRAINER_STATUS_URL"]
ca_file = os.environ["KUBEFLOW_TRAINER_STATUS_CA_CERT"]
token = open(os.environ["KUBEFLOW_TRAINER_STATUS_TOKEN"]).read()
ssl_context = ssl.create_default_context(cafile=ca_file)
req = request.Request(url, data=payload, headers={"Authorization": f"Bearer {token}"})
request.urlopen(req, ssl_context=ssl_context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this instrumentation live in Kubeflow SDK?
Looking at other callbacks, they do the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in the transformers library? If so, then yes it would be great to have it there!

I'm not sure it belongs in the Kubeflow SDK though - I was deliberately trying to avoid requiring users to install the SDK into their runtime images to avoid the extra dependency and potential confusion (e.g. a different sdk version is installed client side to runtime side).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we still need to install kubeflow as a dependency to the transformers library.
Similar to other integrations, we might need to introduce TrainerClient().is_available() method which checks that the correct version of control plane is installed.

We discussed it with @kubeflow/kubeflow-sdk-team, that we want to introduce ConfigMap that shows the version of control plane, and SDK should check that it is available on the cluster.

Otherwise, instrumentation might fail due to incompatible versions between Application (PyTorch) and Control Plane (Trainer controller).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the presence of environment variables injected by the control plane to determine whether kft is available? That would avoid the dependency on kubeflow and I think would be more reliable than checking in the sdk (we need to make sure the instrumentation in the runtime environment is compatible with the control plane).

We could also inject an environment variable containing the control plane version, or a version of the progress api that's supported. The check would be maybe something like this:

def is_kubeflow_trainer_available():
  return os.environ.get("KUBEFLOW_TRAINER_STATUS_API_VERSION") == "1"

I think if we can we should try to avoid requiring kubeflow to be installed in the runtime because of the risk of confusion for users (e.g. it'll be confusing to debug if the runtime has a different version kubeflow version than what the user has locally) and the risk of dependency conflicts with the runtime code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the presence of environment variables injected by the control plane to determine whether kft is available?

But that will require for you to inject version of control plane into TrainJob, isn't?

I think if we can we should try to avoid requiring kubeflow to be installed in the runtime because of the risk of confusion for users

But why we don't want to control instrumentation in the kubeflow SDK itself? If we don't do this, we will require to always update Transformers library for any changes. From my understanding, if users want to use transformers with KubeflowCallback, they must install it as: pip install transformers[kubeflow], which will install kubeflow SDK in any case.

Same as for other integrations: https://github.com/huggingface/transformers/blob/main/setup.py#L248

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why we don't want to control instrumentation in the kubeflow SDK itself?

As I say, my concern is about needing the sdk installed in the runtime.

I know KFP are trying to remove the need to have the kfp sdk installed at runtime - see kubeflow/pipelines#9886 and kubeflow/pipelines#12081 (comment) - and I'm conscious that we may ultimately feel similar pains to what they're feeling, especially if the intention is for the kubeflow sdk to support a broader set of kubeflow components.

That all said, I don't have a strong objection to what you're proposing. It's certainly a pragmatic and easier approach to get things off the ground.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, the downside of not introducing this instrumentation into Kubeflow SDK is that we have to always update transformers library to fix/change the callback and release the newer version, right? So users will have to update transformers package in any case.

Could we check how other callbacks solve this issue in the transformers?

from transformers import TrainerCallback, trainer
from urllib import request

class KubeflowTrainerStatusCallback(transformers.TrainerCallback):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this can't live directly in HF Transformers library?
We can contribute our custom callback into their library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we should aim to get this to live in Transformers, and do similar in other frameworks if we can.

Comment on lines 477 to 452
* Adding a new "transformers-distributed" ClusterRuntime which will be included in the default set of cluster runtimes included in the manifests.
* Publish new docker images for the "transformers-distributed" runtime "ghcr.io/kubeflow/trainer/transformers-runtime". The docker image will include the transformers, accelerate and torch python packages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to just install transformers library, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I was thinking, though I'm not sure how helpful this is given users can just install the package at runtime. I'd be happy to drop this.

@andreyvelich
Copy link
Member

/ok-to-test

//
// +kubebuilder:validation:Minimum=0
// +optional
EstimatedRemainingSeconds *int64 `json:"estimatedRemainingSeconds,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any assumption to justify that we should use int64 instead of int32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with int64 to be on the safe side, but I'll happily update it to int32.

int32 gives a max remaining time of ~68 years which should be plenty big enough, but my thinking was users might see overflow for badly configured jobs - e.g. user picks 25k train steps and the first step takes 24hrs ==> estimated completion time would be 68 years.

The +kubebuilder:validation:Minimum=0 tag should mostly guard against overflow though so I'm happy to simplify it to int32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do you see any concerns with int64? I can see that core API also uses it: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L4189

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types

For integer fields, prefer int32 to int64 unless you need to represent values larger than int32. See other guidelines about limitations of int64 and language compatibility.

Copy link
Member

@tenzen-y tenzen-y Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to imagine the situations where we need to record parameter for over 68 years.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense, thanks for sharing! I am fine with int32 in that case.
TBH, even int64 won't help us to solve the problem with badly configured jobs, since we can't predict what values users will set.

// for the metric and the value is the corresponding numeric value serialized
// as a string.
// +optional
Values map[string]string `json:"values,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid map in CRD as much as possible.

values: [
  {
    key: "xyz",
    value: "abc",
  },
],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

values: [
  {
    name: "xyz",
    value: "abc",
  },
],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tenzen-y I'll update it to the name/value approach.

#### Sidecar container injection

To enable monitoring and sidecar injection, users must add the label `trainer.kubeflow.org/trainjob-monitoring-step: trainer` to one of the replicated jobs in their (Cluster)TrainingRuntime. This will cause the control plane to:
- inject a [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) into **one** of the trainer pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to specify resource requests for the metrics collector sidecar container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could get away with using the same request for all train jobs and read the values from controller configmap? Or we could add something to the TrainJob spec or let users override a default using annotations, but it adds more complexity.

What are you thoughts on the push based approach instead? That approach doesn't need any config like this.

…hanism

Signed-off-by: Rob Bell <robell@redhat.com>
…ased approach

Signed-off-by: Rob Bell <robell@redhat.com>
Add a simpler, more reliable approach to collect the final status.

Signed-off-by: Rob Bell <robell@redhat.com>
@robert-bell robert-bell force-pushed the kep-2779-trainjob-progress branch from 6d0a6ff to 6fca18d Compare December 24, 2025 15:17
@robert-bell
Copy link
Contributor Author

Hey @andreyvelich @tenzen-y @astefanutti I've done another update.

Main changes since last time -

  • updated the path and added versioning to the new endpoint and added an example payload.
  • clarified what changes when the feature gate is enabled/disabled.

I've also raised a question here #2905 (comment) which could do with your input.

Please take a look, thanks.


Users can choose not to instrument their runtime, in which case no progress and metrics will be available on the TrainJob. The feature is therefore optional and opt-in.

The feature will have an associated feature gate, defaulting to "enabled". Disabling the gate will disable the http service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we start with disabled by default initially? So once this feature is well tested, we can enable it by default.
I believe, this is how Kubernetes manage their features. Is that correct @kannon92 ?
WDYT @robert-bell @astefanutti @tenzen-y?

Also, we need to add the feature name: TrainJobProgress

FYI, here you can find implementation in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/features/features.go#L35C17-L35C32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to make it disabled by default initially. I'll update the text.

I think we'd previously talked about introducing the feature gates after this feature #2905 (comment). Is there another feature coming up that's wanting a feature gate? If not, it's probably actually slightly easier to add the feature gate scaffolding with this feature, rather than add implement the feature then retrofit it to use a feature gate later. I'm happy either way though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, it's probably actually slightly easier to add the feature gate scaffolding with this feature, rather than add implement the feature then retrofit it to use a feature gate later. I'm happy either way though.

Yes, I agree. If you want, we can create separate small PR to introduce feature gates capabilities to Trainer config.
Similarly to JobSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I'm happy to make that PR for adding feature gates scaffolding if you'd like?

Signed-off-by: Rob Bell <robell@redhat.com>
@robert-bell robert-bell force-pushed the kep-2779-trainjob-progress branch from cdc4f34 to e7be2ad Compare January 16, 2026 16:35
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @robert-bell!
I think, we should be good to move this forward, excited to see this feature!
/lgtm
/assign @tenzen-y @akshaychitneni @astefanutti

@robert-bell
Copy link
Contributor Author

Awesome, thanks @andreyvelich, and thank you for your detailed feedback and help moving this forward.

@akshaychitneni
Copy link
Contributor

/lgtm
Thanks @robert-bell

@astefanutti
Copy link
Contributor

/lgtm

Awesome work @robert-bell @abhijeet-dhumal thanks!

@andreyvelich
Copy link
Member

/retest

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

// name is a user-defined label for the metric, e.g. "loss", "eval_accuracy".
// +kubebuilder:validation:MinLength=1
// +required
Name string `json:"name,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Name string `json:"name,omitempty"`
Name string `json:"name"`

Why is omitempty even though this is a required field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of the checks we've got enabled KAL, but I don't entirely understand why it's necessary. See KAL docs. If it's not there KAL raises an issue.

I think in practice it doesn't make any difference because the field also has +kubebuilder:validation:MinLength=1 so the default empty value isn't valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// value of the metric. Values must be serialized as a string.
// +kubebuilder:validation:MinLength=1
// +required
Value string `json:"value,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. 😄

Comment on lines +425 to +427
trainer.kubeflow.org/trainjob-monitoring-step: trainer
trainer.kubeflow.org/trainjob-monitoring-port: 28080
trainer.kubeflow.org/trainjob-monitoring-interval: 30s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this system metadata to save the parameter for the controller.
Typically, such system metadata should be stored in annotations instead of labels.

But, I understand that this sidecar approach is just an alternative one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch. I've pushed an update.

Signed-off-by: Rob Bell <robell@redhat.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Jan 21, 2026
@robert-bell
Copy link
Contributor Author

@andreyvelich @astefanutti @akshaychitneni @tenzen-y I pushed a minor update in response to one of the latest comments forgetting that it would reset the lgtm labels. Apols. Can you please re-review?

Is there anything else that we need for this? The next steps for implementing it are looking good and clear from my end, and we should be able to start on the implementation mid next week assuming everyone's happy with the current plan.

@andreyvelich
Copy link
Member

/lgtm
/approve
/hold in case @tenzen-y has any additional comments.

@tenzen-y
Copy link
Member

@robert-bell Thank you for moving that forward! LGTM 👍
/lgtm
/approve

@tenzen-y
Copy link
Member

/hold cancel

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 03c7d93 into kubeflow:master Jan 22, 2026
35 of 36 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.2 milestone Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.