feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

robert-bell · 2025-10-28T14:28:09Z

What this PR does / why we need it:

This is a KEP for adding real-time progress and exposing training metrics on TrainJob.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part of #2779

astefanutti · 2025-10-29T13:05:54Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Runtime: emitting trainer status messages
+
+The "primary" training runtime pod will write trainer status messages to stdout in the following format:


We should consider taking inspiration / aligning with the metricsCollectorSpec mechanism from The Katib API.

We can prioritize the support for the StdOut collector type for the first implementation, but that'll give the API enough flexibility and consistency across components.

Hey @astefanutti , Please correct me here :
As described in this Katib's natively supported metrics-collector approaches here (pull/push based): https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/
There are 3 collector types (Stdout/TensorFlowEvent/file based) ..
So for initial phase we can start with adding a metricsCollectorSpec.kind field to our API now by keeping simple regex parsing handled by a metrics collector sidecar, something like this which is exact match for Katib's approach :

spec: metricsCollectorSpec: kind: StdOut source: filter: "([\w|-]+)=([+-]?\d+\.?\d*)"

and then we can introduce file/TensorFlowEvent and custom based metrics collector approaches too in next phases ?

@abhijeet-dhumal exactly, I suggest the KEP to define the API accordingly, but not all the collector types have to be supported in the first implementation.

Yeah that's really an awesome insight.. Thanks @astefanutti 🙌
We will update the proposal promptly !

cc: @robert-bell

Ah I wasn't aware of the metricsCollectorSpec in Katib. Thanks @astefanutti for pointing it out!

Are you thinking about aligning with the Katib approach for consistency and flexibility, or about making it easier for train jobs to send metrics to Katib (e.g. the same metric collection approach can be used by both Katib and KFT)? Or possibly both or something else entirely.

If making it easier for Katib to consume the metrics, would it make sense to consider extending Katib to watch the TrainJob resource instead, possibly from a sidecar or maybe directly in the Katib control plane? The control plane approach could be particularly appealing for user experience as it would mean they would not need to modify their training code at all (provided they use a supported runtime), or depend on the Katib SDK. The train job service account would also not need any additional RBAC permissions.

If you're thinking about consistency and flexibility, do you have an example in mind where the extra flexibility would benefit, or where the proposed approach would be limiting? I agree there's a need for flexibility, but I wonder whether that flexibility can be moved entirely into the runtime code? This may be easier to maintain and extend to add support for other ML frameworks, and would allow us to keep the KFT controller implementation simpler. We could even package some helper utilities to a python package (e.g. kubeflow-utils) to help users instrument their code.

Yes, I'm thinking metrics collection should become the responsibility of the trainer / TrainJob API so these are exposed in a standard way for Katib to consume.

The Trainer would collect metrics from runtimes according to the metricsCollectorSpec pattern, and expose them in a standard way, e.g. in the TrainJob status. Then Katib trial controller would watch the TrainJob to retrieve the metrics related to the experiment. This would make Katib being able to operate without a database.

The metricsCollectorSpec pattern could be configured in the ClusterTrainingRuntime is a way that's compatible with the runtimes. For custom trainers, the SDK would be responsible to customize it depending on the provided trainer.

Thanks for the clarification @astefanutti. I've looked at Katib's metricsCollectorSpec and thought about it in more detail, and I don't think we need to or should use the same pattern here.

Rather than putting flexibility in the control plane, we're proposing that we use the runtime pods to get flexibility. Each different ML framework would have different instrumentation code, but all frameworks would communicate using the same "protocol" (e.g. the specific log messages). This means that we don't need any flexibility in the control plane.

The advantages of using code in the runtime is

much more flexibility than can be provided by metricsCollectorSpec,

it's much easier for data scientists to consume: configuration is all in code, rather than split across a yaml file and code,

the control plane can be kept simpler. In particular, I think katib needs a sidecar to collect the metrics.

To make it easier for users to instrument their code, we can leverage the sdk to automatically instrument training jobs submitted if the framework supports it (e.g. for frameworks that have callback mechanisms).

For frameworks that don’t support callbacks, or for users that aren’t creating jobs via the sdk, we could provide a runtime utility library that implements the protocol and allow users to add the instrumentation themselves (e.g. it could be embedded in their runtime images). Or they are free to just not instrument their training loop.

@robert-bell thanks that sounds great!
I should have expressed the requirement for flexibility rather than jumping directly to the implementation details :)

I haven't included something like metricsCollectorSpec in the updated approach, but I think we have all the flexibility that means we don't need it. @astefanutti do take a look.

kannon92 · 2025-10-31T03:08:20Z

docs/proposals/2779-trainjob-progress/README.md

+}
+
+
+type TrainJobTrainerStatus struct {


So #2909 and this PR seem to conflict with each other.

In this one, we want to make TrainJob a general solution for deep learning training.

But in 2909, we are hoping to make TrainJob a general solution for HPC solutions (not just deep learning frameworks).

cc @vsoch

I'm not sure its blocking as this project is called TrainJob but I do want to call this out as a descrepancy.

cc @andreyvelich

I think the intention is to better support training jobs that use HPC technologies (MPI, topology and fine grained scheduling, etc.) It's no more strange than having the MPI plugin.

@kannon92 is the conflict that #2909 would allow TrainJob to be used for more general hpc/non-training workloads, but this proposal is specifically for model training workloads?

If so, I'm not sure its a problem because as you say TrainJob is designed for model training. Plus, the extra status information we're proposing is optional so users can use TrainJob for more general hpc workloads if they wanted.

Also just to clarify, this proposal isn't limited to just deep learning frameworks. Other training frameworks, e.g. XGBoost, LightGBM, should be able to use it too.

I agree with @robert-bell, the metrics status might be also very useful for HPC of jobs. We just need to figure out how to make these statuses agnostic to the job you run.

On the other hand, TrainJob's APIs are more appropriate for the training jobs (e.g. trainer, initializer, potentially exporter/checkpointer cc @rst0git).
At some point, we need to discuss whether these abstractions make sense for HPC jobs or we need to re-think something.

These should map quite well to HPC use cases! We tend to have Figures of Merit (FOM) that are derived from different steps or stages. There are often multiple FOM. The metrics look relevant here with the exception of epoch - not everything you run in HPC is going to have an epoch, but AI/ML jobs run with MPI will.

Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.

That could work! I was also thinking:

trainingStatus: metrics: - name: loss value: 0.8 type: train - name: auroc value: 0.7 type: train

I like your design because finding a type is a O(1) lookup instead of needing to loop through an entire list (which could be large).

@vsoch I've updated the schema to the metrics group approach.

@robert-bell @vsoch @astefanutti Shall we consider the list instead of map as @vsoch proposed?
I understand that map gives us O(1) access, but it has some other problems in Kubernetes API design.

Additionally, it would be hard to do strategic merge patches on maps.

Maybe @tenzen-y, @kannon92, or @ahg-g have any insights why we introduce slice for ReplicatedJobs API in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L245-L248

Kubernetes api conventions discourage the use of maps.

Kube api linter even has a linter for nomaps.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#lists-of-named-subobjects-preferred-over-maps

coveralls · 2025-10-31T19:37:51Z

Pull Request Test Coverage Report for Build 21214073338

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
19 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.1%) to 51.385%

Files with Coverage Reduction	New Missed Lines	%
pkg/runtime/core/trainingruntime.go	19	69.94%

Totals
Change from base Build 21036734925:	0.1%
Covered Lines:	1243
Relevant Lines:	2419

💛 - Coveralls

szaher

Thanks very much @robert-bell @abhijeet-dhumal
This is great proposal, it brings a lot of value like

Excellent User Experience: The goal is perfect. A data scientist seeing PROGRESS: 45% and ETA: 15m is a massive win.
No dependency on external components like prometheus or sidecars
Nice SDK integration with transformers library

but it has few problems too

Architectural anti-pattern by allowing control plane to access data plane and stream logs
While the footprint of go-routine is minimal on the operator itself but it abuses the Control Plane by maintaining constant GET request to /api/v1/namespaces/.../pods/.../log?follow=true connection for every single active TrainJob (so assuming scalability to 1000 job is questionable)
Broad Security Permissions for the training operator to access cluster wide pod logs
Error prone as if training is verbose some dataset might have very large line examples or there might be extremely long log message

Another proposed solution is as follows:
Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates.
this approach is much cleaner

No log parsing
No go-routines
No impact on the kubeapi-server for fetching and streaming logs
Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
Let the actual train job itself update itself

Overall, this is excellent and I think we should push it forward.

szaher · 2025-10-31T12:42:32Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Goals
+
+1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.


I would keep training and eval metrics outside the status as Data Scientists have dedicated tools for this like mlflow, tensorboard, wandb, ...etc.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.

Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.

I think the main target of this KEP is client integration, e.g. CLI (e.g. kubectl), GUI or Katib that would consume those metrics, not the data scientists.persona, and it's indeed not the goal to overlap with tools like Tensorboard.

This is unlike Katib or HPO where we have an optimization problem and final metrics are important.

Exactly Katib would indeed be a candidate client to consume those metrics to carry the optimizations.

I've updated the non-goals in the second draft to try clarify that this feature is not replacing these other tools, but is trying to provide something that works out-of-the-box without needing any additional components installed.

szaher · 2025-10-31T21:26:58Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Story 2: Data Scientist / ML Engineer monitoring training jobs
+
+As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required.


Will the controller always stream the logs to provide real-time information?

How does that scale?

Is this design anti-pattern since control plane access data plane ? The control plane's job is to make decisions. It includes the API server, schedulers, and your controller/Operator. Its job is to "establish and enforce policy." It is not designed for high-throughput, high-volume data processing. It's designed for state reconciliation.

It's a fair point that the log streaming approach is mixing control plane with data plane, though I wonder what the max throughput would be in practice? I'd possibly anticipate that applications would be adding at most a few lines to the logs per second, meaning throughput from a single job would be ~kB/s, and throughput from 1000 active train jobs may be ~MB/s which should not be particularly onerous on the controller. This is something that could be benchmarked of course.

The new http request based approach should hopefully address your scalability concerns.

szaher · 2025-10-31T22:29:53Z

docs/proposals/2779-trainjob-progress/README.md

+    trainMetrics: dict[str, float] | None
+    evalMetrics: dict[str, float] | None


Those are very generic and not well defined. What metrics are we expecting. Since we are going to develop the callback in the SDK, should this be more well defined or structured to identify which metrics will be available?

I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892

progressPercentage, estimatedRemainingSeconds, currentEpoch, ...etc. give very good information already about the training process so I suggest we get rid of trainingMetrics and evalMetrics in favor of MLFlow

I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892

As being discussed in the comment above, the main audience / target of this KEP is not the data scientists persona, and it's indeed not the goal to overlap with tools like Tensorboard nor with the scope of experimentation tracking.

The main target of this KEP is client integration, e.g. GUI or Katib that would consume those metrics.

@szaher I've updated the goals/non-goals to clarify that we're not trying to replace MLFlow. Does that address your concerns?

astefanutti · 2025-11-03T07:40:20Z

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

No log parsing

No go-routines

No impact on the kubeapi-server for fetching and streaming logs

Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status

Let the actual train job itself update itself

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

robert-bell · 2025-11-06T18:10:04Z

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

Granting permissions so the train job itself can directly update the training status comes with a few more downsides -

The controller would also need to create a rolebinding for the serviceaccount, which may need some care to avoid privilege escalations (I'm not too sure of the details).
The controller may need to provide additional ways to configure the service account so users can, e.g. grant additional permissions to the service account (e.g. if the TrainJob needs to call the K8s api), or configure out-of-cluster permissions, e.g. for AWS IRSA.
If users are using a custom service account for their train job, they would need to manually add a role binding to allow the service account to update the train job.
The training runtime pods would require a kubernetes client to be available. Whilst this could be done (e.g. installing at the start of the job, or by using a sidecar containing the client to export the progress updates) it adds some complexity.

andreyvelich · 2025-11-25T00:52:28Z

Hi Folks, thanks again for this effort! That feature should significantly improve experience of AI workloads on Kubernetes, and has a lot of future potential for checkpointing, scheduling, etc. 🎉

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

cc @romanbaron @EkinKarabulut @Ronkahn21 as we discussed, this should address problems to detect running status of TrainJob mentioning by Ron in this talk: https://sched.co/28D8Z

robert-bell · 2025-11-25T08:47:41Z

Hey @andreyvelich! Thanks for chasing, and apols for the silence.

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

I'm working on an update to bring in things from that discussion. I'm aiming to find some time this week.

robert-bell · 2025-12-01T15:04:38Z

Hi folk, @astefanutti @andreyvelich @tenzen-y,

I've updated the proposal with a new design based on the discussions from Kubecon, and looked to incorporate some of the ideas from the first draft. The changes are put into separate commits to make it easier to see what's changed.

The main change is that I've replaced the log watching mechanism with http calls. I've actually made 2 variants of the proposal - a push-based one (runtime sends to control plane) and pull-based one (control plane scrapes from runtime) - because they're similar but have different trade offs. We only need one of the approaches though, and I thought this might be the easiest way to weigh up the merits of each an help us reach consensus on how best to move forward.

The other changes I've made are:

replaced the trainMetrics and evalMetrics fields with a more general MetricsGroup struct.
clarified the goals to make it clearer that we're not trying to replace tools like MLFlow.
updated the other considered alternatives section with a few more details.

Please let me know if any extra info is needed to help this move forward.

robert-bell · 2025-12-01T15:26:11Z

docs/proposals/2779-trainjob-progress/README.md

+    // for the metric and the value is the corresponding numeric value serialized
+    // as a string.
+    // +optional
+    Values map[string]string `json:"values,omitempty"`


@vsoch following up on the other thread, I just wanted to highlight I'm making an assumption here that we only need to support scalar metrics - we can't (directly) support non-scalar things like confusion matrices etc.

Should we be considering more structure here to let us support non-scalar metrics in the future? E.g.

values: loss: type: scalar value: 0.9 confusion_matrix: type: ConfusionMatrix: ... more complicated struct

I know TFX has done some work here but it looks like it quickly gets very complicated -- see tf-metadata or tfma.

If we don't consider it now, though, it may be hard to add support for it later.

robert-bell · 2025-12-01T15:29:07Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Runtime: emitting trainer status messages
+
+The "primary" training runtime pod will write trainer status messages to stdout in the following format:


I haven't included something like metricsCollectorSpec in the updated approach, but I think we have all the flexibility that means we don't need it. @astefanutti do take a look.

robert-bell · 2025-12-01T15:30:56Z

docs/proposals/2779-trainjob-progress/README.md

+    trainMetrics: dict[str, float] | None
+    evalMetrics: dict[str, float] | None


@szaher I've updated the goals/non-goals to clarify that we're not trying to replace MLFlow. Does that address your concerns?

andreyvelich

Thanks for the updates @robert-bell!
I left a few comments.
Let's chat more about it during today's call if you can join: https://bit.ly/2PWVCkV

andreyvelich · 2025-12-10T10:33:47Z

docs/proposals/2779-trainjob-progress/README.md

+"""
+  read -r -d '' SCRIPT << EOM\n
+  import os
+  from transformers import TrainerCallback, trainer


Do we have open issue to introduce this Callback to transformers: https://huggingface.co/docs/transformers/en/main_classes/callback#callbacks?

I am curious whether we should call this Callback TrainerCallback, KFTrainerCallback, or KubeflowCallback ?
Can this callback be useful for other jobs like OptimizationJobs?

Yeah I think having a callback in transformers would be ideal.

I'd been holding off starting a discussion there until we'd got a rough consensus here on the general approach to take, particularly the idea of introducing these framework-specific custom trainers. I'd be keen to open an issue there though.

In terms of names, I'd gone with KubeflowTrainerStatusCallback but I appreciate it's a mouthful! Just FYI TrainerCallback is the name of the base class in the transformers library so we shouldn't use that.

I prefer we name it KubeflowCallback or KubeflowTrainerCallback, since we might also want to use it for checkpointing in the future (cc @kramaranya @rst0git)

Are we planning to use this callback for other things, like for HP algorithms (e.g. Hyperband) or early stopping?
I know that sometimes algorithms make decision during Trial execution.

Yeah I don't have a strong opinion on the naming. I'd be happy to use either KubeflowCallback or KubeflowTrainerCallback.

I wonder whether from a maintenance perspective we might want to have separate callbacks for each bit of functionality (status, checkpointing, early-stopping)? We could for sure wrap them up in a single "main" callback and allow users to enable/disable each callback individually.

I wonder whether from a maintenance perspective we might want to have separate callbacks for each bit of functionality (status, checkpointing, early-stopping)?

Could we ask HuggingFace community for that? I feel that starting with unified KubeflowCallback might be easier.
Additionally, we can use kwargs to configure callback in a way we want: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L332

Additionally, I know that Transformers have other integrations for HP search, for example: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/integration_utils.py#L296C5-L296C22

I prefer we name it KubeflowCallback or KubeflowTrainerCallback, since we might also want to use it for checkpointing in the future

Sounds like a good idea.

andreyvelich · 2025-12-10T12:18:05Z

docs/proposals/2779-trainjob-progress/README.md

+
+1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.
+2. **Have zero dependencies.** The user should not need to install any additional components into their cluster for the feature to work. It should work "out-of-the-box". 
+3. **Optional.** Users can choose to opt in to providing progress tracking, but are not required to use this feature.


Are we planning to control it via Trainer config or TrainJob API?

For the push-based approach, it'd be controlled by the runtime - the user would opt in by instrumenting their code. The user wouldn't need to configure anything on the Trainer or TrainJob.

For the pull-based approach, users would need instrument their runtime and also configure their (Cluster)TrainingRuntime so the control plane knows which pod to inject the sidecar and then scrape, so there'd be an extra step. We could set up this config for the standard ClusterTrainingRuntimes.

For the pull-based approach, users would need instrument their runtime and also configure their (Cluster)TrainingRuntime so the control plane knows which pod to inject the sidecar and then scrape

This could be done in the SDK via PodTemplateOverride by each specific trainers.

Yeah that'd be a nice approach and would keep the implementation clean.

We might need to update the PodTemplateOverrides implementations because some of the options can't currently be used more than once e.g. SpecLabels doesn't merge the labels it only keeps the last label set applied, so we might either overwrite user-applied label overrides, or the user-applied label overrides might overwrite the progress tracking ones.

For the push-based approach, it'd be controlled by the runtime - the user would opt in by instrumenting their code

Do you mean we will always serve the endpoint, and this endpoint is called, we will update TrainJob status?

This could be done in the SDK via PodTemplateOverride by each specific trainers.

We talked yesterday with @robert-bell that might not always work. Since we use single Job template for torch-distributed runtime, we can't inject sidecar only to the MASTER pod (e.g. Job index = 0).

Do you mean we will always serve the endpoint, and this endpoint is called, we will update TrainJob status?

Yeah, that's right. The endpoint would always be available. It's essentially the same model as mlflow - the server is always there available, and the client submits the metrics whenever it wants to update them. It's the same server that's used by all TrainJobs in the cluster.

We talked yesterday with @robert-bell that might not always work. Since we use single Job template for torch-distributed runtime, we can't inject sidecar only to the MASTER pod (e.g. Job index = 0).

Just to clarify - I think it may still be possible to make it work, e.g. maybe the torch-distributed runtime has to be changed so there's 2 Jobs in it - one for the master pod which gets the sidecar, and another for all the other workers. I'm concerned about the overhead this'd add, though, to creating runtimes, especially if users want to create their own runtimes.

Yeah, that's right. The endpoint would always be available. It's essentially the same model as mlflow - the server is always there available, and the client submits the metrics whenever it wants to update them.

I am wondering if we should make this feature optional, and enabled by default in the config: https://github.com/kubeflow/trainer/blob/master/manifests/base/manager/controller_manager_config.yaml

We can also think about feature gates flags, like we do in JobSet (cc @kannon92 @astefanutti @tenzen-y): kubernetes-sigs/jobset#1096

We should introduce FGs after beta graduation regardless of this feature.
Currently, Trainer is in the alpha stage. So, all features could be considered as an alpha, but after beta graduation, we want to have a feature stage.

I'd be happy to put this behind a feature gate, using the same approach in JobSet. It makes sense to me to add the scaffolding for FGs sooner rather than later so it's there and ready to be used by other features.

When it comes to the implementation, would folk prefer a separate PR for the FG or include it in the same PR for adding the progress tracking?

I am fine to introduce FG after we implement this feature. As @tenzen-y mentioned it would make sense to introduce it before we graduate Trainer API to beta.

andreyvelich · 2025-12-10T12:19:03Z

docs/proposals/2779-trainjob-progress/README.md

+1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.
+2. **Have zero dependencies.** The user should not need to install any additional components into their cluster for the feature to work. It should work "out-of-the-box". 
+3. **Optional.** Users can choose to opt in to providing progress tracking, but are not required to use this feature.
+4. **Provide built-in progress tracking support for selected ML frameworks (e.g. transformers, pytorch-lightning) in the kubeflow sdk.** Data scientists should be able to use the kubeflow sdk to create training jobs using these frameworks, and have progress tracking automatically instrumented.


Shall we add a goal to introduce callbacks to third party ML library like HuggingFace transformers to support our instrumentation ?

Yes, I think this should be a goal. I'll update it in a batch with any other changes.

andreyvelich · 2025-12-10T12:22:59Z

docs/proposals/2779-trainjob-progress/README.md

+    // Empty if the status is unknown, e.g. the job has just started
+    // or the job is not instrumented to report its status.
+    // +optional
+    TrainerStatus *TrainJobTrainerStatus `json:"trainerStatus,omitempty"`


What do we think about TrainerProgress API, since we can introduce InitializerProgress in the future?

Are you thinking renaming TrainerStatus => TrainerProgress? My thinking was "progress" didn't really include metrics, so status might be a better name.

We could have InitializerStatus to contain initializer progress. Wdyt?

Agree, let's finalize the API design here: #2905 (comment)

andreyvelich · 2025-12-10T12:27:16Z

docs/proposals/2779-trainjob-progress/README.md

+}
+
+
+type TrainJobTrainerStatus struct {


@robert-bell @vsoch @astefanutti Shall we consider the list instead of map as @vsoch proposed?
I understand that map gives us O(1) access, but it has some other problems in Kubernetes API design.

Additionally, it would be hard to do strategic merge patches on maps.

Maybe @tenzen-y, @kannon92, or @ahg-g have any insights why we introduce slice for ReplicatedJobs API in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L245-L248

andreyvelich · 2025-12-10T13:10:29Z

docs/proposals/2779-trainjob-progress/README.md

+
+#### Security considerations
+
+The control plane endpoint will be secured with TLS and token-based authentication.


@astefanutti @robert-bell Did you get a chance to check the in-place restart KEP from @GiuseppeTT ?
https://github.com/kubernetes-sigs/jobset/tree/main/keps/467-InPlaceRestart#implementation---permissions-for-the-agent-sidecar

Wondering if we can re-use similar ideas with ValidatingAdmissionPolicy

Thanks for sharing this - I wasn't aware of the ValidatingAdmissionPolicy api!

I've had a look but I'm not sure how we could secure this completely: we could use a ValidatingAdmissionPolicy to restrict it so only the trainer status field can be updated, but I'm not sure how we can restrict it so that nothing else running in the namespace can update the field? I think we'd need the train jobs to run using custom service accounts which is something I was trying to avoid for the reasons I outlined in this comment.

I might have misunderstood something in the ValidatingAdmissionPolicy api though.

andreyvelich · 2025-12-10T13:12:10Z

docs/proposals/2779-trainjob-progress/README.md

+      - configmap:
+        name: <train-job-name>-tls-config
+        items:
+          - key: ca.crt
+            path: ca.crt


Can other pods in this namespace re-use this Volume while TrainJob is running?
If yes, that might allow attacker to re-use certificate in other pods to call system endpoint.

This is just the public ca cert so the runtime trusts the control plane endpoint TLS encryption. It's not an issue if other pods have access to it, they can't do anything with it.

andreyvelich · 2025-12-10T13:16:26Z

docs/proposals/2779-trainjob-progress/README.md

+```python
+from urllib import request
+import os
+import ssl
+
+payload = ...
+
+url = os.environ["KUBEFLOW_TRAINER_STATUS_URL"]
+ca_file = os.environ["KUBEFLOW_TRAINER_STATUS_CA_CERT"]
+token = open(os.environ["KUBEFLOW_TRAINER_STATUS_TOKEN"]).read()
+ssl_context = ssl.create_default_context(cafile=ca_file)
+req = request.Request(url, data=payload, headers={"Authorization": f"Bearer {token}"})
+request.urlopen(req, ssl_context=ssl_context)


Can this instrumentation live in Kubeflow SDK?
Looking at other callbacks, they do the same.

Do you mean in the transformers library? If so, then yes it would be great to have it there!

I'm not sure it belongs in the Kubeflow SDK though - I was deliberately trying to avoid requiring users to install the SDK into their runtime images to avoid the extra dependency and potential confusion (e.g. a different sdk version is installed client side to runtime side).

I feel like we still need to install kubeflow as a dependency to the transformers library.
Similar to other integrations, we might need to introduce TrainerClient().is_available() method which checks that the correct version of control plane is installed.

We discussed it with @kubeflow/kubeflow-sdk-team, that we want to introduce ConfigMap that shows the version of control plane, and SDK should check that it is available on the cluster.

Otherwise, instrumentation might fail due to incompatible versions between Application (PyTorch) and Control Plane (Trainer controller).

Could we use the presence of environment variables injected by the control plane to determine whether kft is available? That would avoid the dependency on kubeflow and I think would be more reliable than checking in the sdk (we need to make sure the instrumentation in the runtime environment is compatible with the control plane).

We could also inject an environment variable containing the control plane version, or a version of the progress api that's supported. The check would be maybe something like this:

def is_kubeflow_trainer_available(): return os.environ.get("KUBEFLOW_TRAINER_STATUS_API_VERSION") == "1"

I think if we can we should try to avoid requiring kubeflow to be installed in the runtime because of the risk of confusion for users (e.g. it'll be confusing to debug if the runtime has a different version kubeflow version than what the user has locally) and the risk of dependency conflicts with the runtime code.

Could we use the presence of environment variables injected by the control plane to determine whether kft is available?

But that will require for you to inject version of control plane into TrainJob, isn't?

I think if we can we should try to avoid requiring kubeflow to be installed in the runtime because of the risk of confusion for users

But why we don't want to control instrumentation in the kubeflow SDK itself? If we don't do this, we will require to always update Transformers library for any changes. From my understanding, if users want to use transformers with KubeflowCallback, they must install it as: pip install transformers[kubeflow], which will install kubeflow SDK in any case.

Same as for other integrations: https://github.com/huggingface/transformers/blob/main/setup.py#L248

But why we don't want to control instrumentation in the kubeflow SDK itself?

As I say, my concern is about needing the sdk installed in the runtime.

I know KFP are trying to remove the need to have the kfp sdk installed at runtime - see kubeflow/pipelines#9886 and kubeflow/pipelines#12081 (comment) - and I'm conscious that we may ultimately feel similar pains to what they're feeling, especially if the intention is for the kubeflow sdk to support a broader set of kubeflow components.

That all said, I don't have a strong objection to what you're proposing. It's certainly a pragmatic and easier approach to get things off the ground.

I guess, the downside of not introducing this instrumentation into Kubeflow SDK is that we have to always update transformers library to fix/change the callback and release the newer version, right? So users will have to update transformers package in any case.

Could we check how other callbacks solve this issue in the transformers?

andreyvelich · 2025-12-10T13:18:15Z

docs/proposals/2779-trainjob-progress/README.md

+  from transformers import TrainerCallback, trainer
+  from urllib import request
+
+  class KubeflowTrainerStatusCallback(transformers.TrainerCallback):


Why this can't live directly in HF Transformers library?
We can contribute our custom callback into their library.

yeah I think we should aim to get this to live in Transformers, and do similar in other frameworks if we can.

andreyvelich · 2025-12-10T13:19:18Z

docs/proposals/2779-trainjob-progress/README.md

+* Adding a new "transformers-distributed" ClusterRuntime which will be included in the default set of cluster runtimes included in the manifests.
+* Publish new docker images for the "transformers-distributed"  runtime "ghcr.io/kubeflow/trainer/transformers-runtime". The docker image will include the transformers, accelerate and torch python packages.


We need this to just install transformers library, right ?

Yes that's what I was thinking, though I'm not sure how helpful this is given users can just install the package at runtime. I'd be happy to drop this.

andreyvelich · 2025-12-23T01:33:37Z

/ok-to-test

tenzen-y · 2025-12-23T17:26:36Z

docs/proposals/2779-trainjob-progress/README.md

+    //
+    // +kubebuilder:validation:Minimum=0
+    // +optional
+    EstimatedRemainingSeconds *int64 `json:"estimatedRemainingSeconds,omitempty"`


Any assumption to justify that we should use int64 instead of int32?

I went with int64 to be on the safe side, but I'll happily update it to int32.

int32 gives a max remaining time of ~68 years which should be plenty big enough, but my thinking was users might see overflow for badly configured jobs - e.g. user picks 25k train steps and the first step takes 24hrs ==> estimated completion time would be 68 years.

The +kubebuilder:validation:Minimum=0 tag should mostly guard against overflow though so I'm happy to simplify it to int32.

@tenzen-y Do you see any concerns with int64? I can see that core API also uses it: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L4189

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types

For integer fields, prefer int32 to int64 unless you need to represent values larger than int32. See other guidelines about limitations of int64 and language compatibility.

It's hard to imagine the situations where we need to record parameter for over 68 years.

That make sense, thanks for sharing! I am fine with int32 in that case.
TBH, even int64 won't help us to solve the problem with badly configured jobs, since we can't predict what values users will set.

tenzen-y · 2025-12-23T17:29:47Z

docs/proposals/2779-trainjob-progress/README.md

+    // for the metric and the value is the corresponding numeric value serialized
+    // as a string.
+    // +optional
+    Values map[string]string `json:"values,omitempty"`


We should avoid map in CRD as much as possible.

values: [ { key: "xyz", value: "abc", }, ],

Or

values: [ { name: "xyz", value: "abc", }, ],

Thanks @tenzen-y I'll update it to the name/value approach.

tenzen-y · 2025-12-23T17:33:13Z

docs/proposals/2779-trainjob-progress/README.md

+#### Sidecar container injection
+
+To enable monitoring and sidecar injection, users must add the label `trainer.kubeflow.org/trainjob-monitoring-step: trainer` to one of the replicated jobs in their (Cluster)TrainingRuntime. This will cause the control plane to:
+- inject a [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) into **one** of the trainer pods


How to specify resource requests for the metrics collector sidecar container?

Perhaps we could get away with using the same request for all train jobs and read the values from controller configmap? Or we could add something to the TrainJob spec or let users override a default using annotations, but it adds more complexity.

What are you thoughts on the push based approach instead? That approach doesn't need any config like this.

Signed-off-by: Rob Bell <robell@redhat.com>

…hanism Signed-off-by: Rob Bell <robell@redhat.com>

…ased approach Signed-off-by: Rob Bell <robell@redhat.com>

Add a simpler, more reliable approach to collect the final status. Signed-off-by: Rob Bell <robell@redhat.com>

Signed-off-by: Rob Bell <robell@redhat.com>

robert-bell · 2026-01-16T13:16:40Z

Hey @andreyvelich @tenzen-y @astefanutti I've done another update.

Main changes since last time -

updated the path and added versioning to the new endpoint and added an example payload.
clarified what changes when the feature gate is enabled/disabled.

I've also raised a question here #2905 (comment) which could do with your input.

Please take a look, thanks.

andreyvelich · 2026-01-16T15:17:21Z

docs/proposals/2779-trainjob-progress/README.md

+
+Users can choose not to instrument their runtime, in which case no progress and metrics will be available on the TrainJob. The feature is therefore optional and opt-in.
+
+The feature will have an associated feature gate, defaulting to "enabled". Disabling the gate will disable the http service.


Shall we start with disabled by default initially? So once this feature is well tested, we can enable it by default.
I believe, this is how Kubernetes manage their features. Is that correct @kannon92 ?
WDYT @robert-bell @astefanutti @tenzen-y?

Also, we need to add the feature name: TrainJobProgress

FYI, here you can find implementation in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/features/features.go#L35C17-L35C32

I'm happy to make it disabled by default initially. I'll update the text.

I think we'd previously talked about introducing the feature gates after this feature #2905 (comment). Is there another feature coming up that's wanting a feature gate? If not, it's probably actually slightly easier to add the feature gate scaffolding with this feature, rather than add implement the feature then retrofit it to use a feature gate later. I'm happy either way though.

If not, it's probably actually slightly easier to add the feature gate scaffolding with this feature, rather than add implement the feature then retrofit it to use a feature gate later. I'm happy either way though.

Yes, I agree. If you want, we can create separate small PR to introduce feature gates capabilities to Trainer config.
Similarly to JobSet.

Yeah that makes sense. I'm happy to make that PR for adding feature gates scaffolding if you'd like?

Signed-off-by: Rob Bell <robell@redhat.com>

andreyvelich

Thanks for the updates @robert-bell!
I think, we should be good to move this forward, excited to see this feature!
/lgtm
/assign @tenzen-y @akshaychitneni @astefanutti

robert-bell · 2026-01-19T15:32:25Z

Awesome, thanks @andreyvelich, and thank you for your detailed feedback and help moving this forward.

akshaychitneni · 2026-01-19T18:59:37Z

/lgtm
Thanks @robert-bell

astefanutti · 2026-01-20T18:17:38Z

/lgtm

Awesome work @robert-bell @abhijeet-dhumal thanks!

andreyvelich · 2026-01-20T19:00:12Z

/retest

tenzen-y

Overall LGTM

tenzen-y · 2026-01-21T12:54:26Z

docs/proposals/2779-trainjob-progress/README.md

+    // name is a user-defined label for the metric, e.g. "loss", "eval_accuracy".
+    // +kubebuilder:validation:MinLength=1
+	// +required
+    Name string `json:"name,omitempty"`


Suggested change

Name string `json:"name,omitempty"`

Name string `json:"name"`

Why is omitempty even though this is a required field?

It's one of the checks we've got enabled KAL, but I don't entirely understand why it's necessary. See KAL docs. If it's not there KAL raises an issue.

I think in practice it doesn't make any difference because the field also has +kubebuilder:validation:MinLength=1 so the default empty value isn't valid.

Ah, that sounds good to me.
I also checked the API convensions https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#serialization-of-optionalrequired-fields

Thank you.

tenzen-y · 2026-01-21T12:54:36Z

docs/proposals/2779-trainjob-progress/README.md

+    // value of the metric. Values must be serialized as a string.
+	// +kubebuilder:validation:MinLength=1
+    // +required
+    Value string `json:"value,omitempty"`


Same question.

As above. 😄

tenzen-y · 2026-01-21T12:57:07Z

docs/proposals/2779-trainjob-progress/README.md

+                trainer.kubeflow.org/trainjob-monitoring-step: trainer
+                trainer.kubeflow.org/trainjob-monitoring-port: 28080
+                trainer.kubeflow.org/trainjob-monitoring-interval: 30s


Isn't this system metadata to save the parameter for the controller.
Typically, such system metadata should be stored in annotations instead of labels.

But, I understand that this sidecar approach is just an alternative one.

Ah good catch. I've pushed an update.

Signed-off-by: Rob Bell <robell@redhat.com>

robert-bell · 2026-01-21T14:55:00Z

@andreyvelich @astefanutti @akshaychitneni @tenzen-y I pushed a minor update in response to one of the latest comments forgetting that it would reset the lgtm labels. Apols. Can you please re-review?

Is there anything else that we need for this? The next steps for implementing it are looking good and clear from my end, and we should be able to start on the implementation mid next week assuming everyone's happy with the current plan.

andreyvelich · 2026-01-21T16:09:15Z

/lgtm
/approve
/hold in case @tenzen-y has any additional comments.

tenzen-y · 2026-01-22T04:40:24Z

@robert-bell Thank you for moving that forward! LGTM 👍
/lgtm
/approve

tenzen-y · 2026-01-22T04:40:31Z

/hold cancel

google-oss-prow · 2026-01-22T04:40:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from jinchihe and kuizhiqing October 28, 2025 14:28

google-oss-prow bot added the size/L label Oct 28, 2025

robert-bell mentioned this pull request Oct 28, 2025

Track TrainJob progress and expose training metrics #2779

Open

astefanutti reviewed Oct 29, 2025

View reviewed changes

abhijeet-dhumal mentioned this pull request Oct 29, 2025

feat: Add TrainJob progression tracking with real-time status updates #2820

Closed

1 task

kannon92 reviewed Oct 31, 2025

View reviewed changes

robert-bell changed the title ~~KEP-2779: Track TrainJob progress and expose training metrics~~ feat(docs): KEP-2779: Track TrainJob progress and expose training metrics Oct 31, 2025

szaher reviewed Oct 31, 2025

View reviewed changes

kannon92 mentioned this pull request Nov 4, 2025

KEP 467: Fast failure recovery with in place restarts kubernetes-sigs/jobset#1083

Merged

google-oss-prow bot added size/XL and removed size/L labels Dec 1, 2025

robert-bell commented Dec 1, 2025

View reviewed changes

andreyvelich mentioned this pull request Dec 5, 2025

feat(runtimes): add Pending and Running status conditions for TrainJob #3019

Closed

andreyvelich reviewed Dec 10, 2025

View reviewed changes

google-oss-prow bot added the ok-to-test label Dec 23, 2025

tenzen-y reviewed Dec 23, 2025

View reviewed changes

robert-bell added 5 commits December 24, 2025 11:09

feat(docs): KEP-2779 Track TrainJob progress and expose training metrics

95b218c

Signed-off-by: Rob Bell <robell@redhat.com>

feat(docs): update progress tracking KEP to use a push-based HTTP mec…

8ba7c79

…hanism Signed-off-by: Rob Bell <robell@redhat.com>

feat(docs): update progress tracking KEP to add an alternative pull-b…

6236842

…ased approach Signed-off-by: Rob Bell <robell@redhat.com>

feat(docs): update progress tracking KEP pull-based approach

8317951

Add a simpler, more reliable approach to collect the final status. Signed-off-by: Rob Bell <robell@redhat.com>

feat(docs): update TrainerStatus api and minor formatting/text updates

6fca18d

Signed-off-by: Rob Bell <robell@redhat.com>

robert-bell force-pushed the kep-2779-trainjob-progress branch from 6d0a6ff to 6fca18d Compare December 24, 2025 15:17

google-oss-prow bot added the size/XL label Jan 16, 2026

andreyvelich reviewed Jan 16, 2026

View reviewed changes

andreyvelich mentioned this pull request Jan 16, 2026

Add more status conditions to TrainJob for better visibility of execution state，like Add Running/Pending #2713

Closed

feat(docs): update the endpoint spec and update feature gate spec

e7be2ad

Signed-off-by: Rob Bell <robell@redhat.com>

robert-bell force-pushed the kep-2779-trainjob-progress branch from cdc4f34 to e7be2ad Compare January 16, 2026 16:35

andreyvelich reviewed Jan 19, 2026

View reviewed changes

google-oss-prow bot assigned akshaychitneni and andreyvelich Jan 19, 2026

google-oss-prow bot added the lgtm label Jan 19, 2026

robert-bell mentioned this pull request Jan 19, 2026

feat: add scaffolding for feature gates #3102

Merged

1 task

akshaychitneni approved these changes Jan 19, 2026

View reviewed changes

tenzen-y reviewed Jan 21, 2026

View reviewed changes

feat(docs): update spec for sidecar

7c3b303

Signed-off-by: Rob Bell <robell@redhat.com>

google-oss-prow bot removed the lgtm label Jan 21, 2026

google-oss-prow bot added do-not-merge/hold lgtm approved labels Jan 21, 2026

google-oss-prow bot removed the do-not-merge/hold label Jan 22, 2026

google-oss-prow bot merged commit 03c7d93 into kubeflow:master Jan 22, 2026
35 of 36 checks passed

google-oss-prow bot added this to the v2.2 milestone Jan 22, 2026


		### Runtime: emitting trainer status messages

		The "primary" training runtime pod will write trainer status messages to stdout in the following format:


		### Goals

		1. Expose real-time progress information and training metrics through the TrainJobs CR. For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.


		### Story 2: Data Scientist / ML Engineer monitoring training jobs

		As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required.

		trainMetrics: dict[str, float] \| None
		evalMetrics: dict[str, float] \| None


		#### Security considerations

		The control plane endpoint will be secured with TLS and token-based authentication.

		* Adding a new "transformers-distributed" ClusterRuntime which will be included in the default set of cluster runtimes included in the manifests.
		* Publish new docker images for the "transformers-distributed" runtime "ghcr.io/kubeflow/trainer/transformers-runtime". The docker image will include the transformers, accelerate and torch python packages.


		Users can choose not to instrument their runtime, in which case no progress and metrics will be available on the TrainJob. The feature is therefore optional and opt-in.

		The feature will have an associated feature gate, defaulting to "enabled". Disabling the gate will disable the http service.

	Name string `json:"name,omitempty"`
	Name string `json:"name"`

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

Conversation

robert-bell commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21214073338

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

szaher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

abhijeet-dhumal Oct 29, 2025 •

edited

Loading

astefanutti Oct 31, 2025 •

edited

Loading

kannon92 Dec 23, 2025 •

edited

Loading

coveralls commented Oct 31, 2025 •

edited

Loading

astefanutti Nov 3, 2025 •

edited

Loading

andreyvelich Dec 14, 2025 •

edited

Loading