You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -54,9 +54,9 @@ I can use standard tools, like the Kubeflow Trainer Python SDK or `kubectl get t
54
54
55
55
As a data scientist or ML Engineer, I want to create a TrainJob but I do not want to have to work out how to integrate training monitoring.
56
56
57
-
## Proposal
57
+
## Proposal option 1: Push-based
58
58
59
-
We propose a **push-based** approach with the following high-level design:
59
+
As a first option, we propose a **push-based** approach with the following high-level design:
60
60
61
61
1. The TrainJob custom resource exposes the current training progress and metrics via a new optional field `status.trainerStatus`.
62
62
2. The trainer control plane exposes a new http service which can receive the trainer status from the trainer runtime pods.
@@ -477,6 +477,185 @@ class TrainJob:
477
477
* Adding a new "transformers-distributed" ClusterRuntime which will be included in the default set of cluster runtimes included in the manifests.
478
478
* Publish new docker images for the "transformers-distributed" runtime "ghcr.io/kubeflow/trainer/transformers-runtime". The docker image will include the transformers, accelerate and torch python packages.
479
479
480
+
## Proposal option 2: pull-based
481
+
482
+
As a second option to consider, we propose a **pull-based** approach with the following high-level design:
483
+
484
+
1. As in proposal 1, the TrainJob custom resource exposes the current training progress and metrics via a new optional field `status.trainerStatus`.
485
+
2. The user instruments their trainer runtime pod(s) so that the current trainer status is written to a local file.
486
+
3. The control plane injects a sidecar container into **one** of the runtime pods which has access to that local file through a shared volume. The sidecar exposes an http server that serves the training progress and metrics from the shared file.
487
+
4. The trainer control plane periodically scrapes the http server to fetch the progress and metrics and then updates the TrainJob custom resource.
488
+
5. When training is completed, the main container terminates but the sidecar container waits for a final scrape from the control plane before exiting. This ensures the final train status is collected.
489
+
490
+
As in Proposal 1, the feature is optional but available for all TrainJobs. Users opt in to the functionality by adding configuration to their TrainJob and instrumenting their runtime to write the metrics.
491
+
492
+
Also as in Proposal 1, we propose adding the same set of new customer trainers to the kubeflow-sdk to make it easier for users to instrument their runtime pods.
493
+
494
+
### Design Details
495
+
496
+
#### TrainJob CRD changes
497
+
498
+
In addition to the changes from Proposal 1, a new optional field `spec.trainer.monitoring` to the TrainJobSpec with the following schema:
499
+
500
+
```go
501
+
typeTrainerstruct {
502
+
// ... existing fields
503
+
504
+
// monitoring defines configuration for monitoring the progress and metrics
estimatedRemainingTimeSummary: "9 days 5 hours"# Human-readable
546
+
547
+
# Training iterations
548
+
currentStep: 4500# Completed 4500 steps
549
+
totalSteps: 10000# Out of 10000 total
550
+
currentEpoch: 2# On epoch 2
551
+
totalEpochs: 5# Of 5 epochs
552
+
553
+
metrics:
554
+
# Training metrics (serialized as strings)
555
+
- type: train
556
+
values:
557
+
loss: "0.2347"# Current training loss
558
+
learning_rate: "0.0001"# Current LR
559
+
grad_norm: "1.234"# Gradient norm
560
+
561
+
# Evaluation metrics (from validation set)
562
+
- type: eval
563
+
values:
564
+
eval_loss: "0.2451"# Validation loss
565
+
eval_accuracy: "0.8912"# Validation accuracy
566
+
eval_perplexity: "1.277"# Model perplexity
567
+
568
+
# Timestamp of last progress update
569
+
lastUpdatedTime: "2025-01-23T10:30:45Z"
570
+
```
571
+
572
+
#### Sidecar container
573
+
574
+
A lightweight http server will be injected as a [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) into **one** of the trainer pods. Only one sidecar will be injected to simplify collecting the final status (see [collecting the final metrics](#collecting-the-final-metrics) below).
575
+
576
+
The sidecar will be implemented as a lightweight http server which shares an `emptyDir` volume with all other containers in the pod. The server will read the metrics directly from the shared file when handling each request.
577
+
578
+
The user must instrument their runtime code so the main runtime container(s) will periodically write the current training status to a file in the shared volume. The file must contain a single json entry payload, with schema the same as `TrainJobTrainerStatus` but omitting the `estimatedRemainingTimeSummary` field, which will be calculated by the control plane for consistency.
579
+
580
+
When updating the training status, the main container must replace the file contents so the file only ever contains a single status. This replacement should be done atomically using a temporary file followed by a rename to avoid the race condition of the http server reading a partially updated file.
581
+
582
+
The control plane will inject the path where training status should be written using an environment variable:
- inject the sidecar into **one** of the trainer pods by creating a new `ReplicatedJob` with replicas=1 when creating the JobSet. This pod will be assigned a label `kubeflow-training-monitoring-pod`.
614
+
- create a new service `<train-job-name>-monitoring` in the train job namespace that points to the sidecar endpoint and uses that pod label as the selector.
615
+
616
+
While the train job is active, the control plane will scrape the metrics server as part of its reconciliation loop using the `<train-job-name>-monitoring` service and update the TrainJob `trainingStatus`. The reconciliation will be re-queued to automatically trigger the next scrape.
617
+
618
+
By default, the control plane will assume that all pods are equivalent and will select one pod arbitrarily to instrument with the sidecar. The user can override this by annotating one of the replicated jobs with `trainer.kubeflow.org/monitoring: enabled` which will cause the control plane to select a pod from that replicated job only.
619
+
620
+
#### Collecting the final metrics
621
+
622
+
To ensure the final training status is collected after training has completed successfully, the sidecar will wait for the control plane to make a final scrape:
623
+
- when training completes, the main container terminates successfully and kubelet sends sigterm to the sidecar container.
624
+
- when the sidecar receives the signal, it waits (with timeout) for the control plane to collect the final status and then terminates.
625
+
626
+
The sidecar is injected into only **one** pod so the sidecar can easily detect when the final training status has been collected allowing it to terminate more quickly.
627
+
628
+
#### Security considerations
629
+
630
+
Securing the http server with auth and TLS can be achieved as follows:
631
+
- the control plane creates a secret in the train job namespace containing an API key and a self-signed certificate. The control plane could periodically rotate these secrets.
632
+
- the secret is accessed by the sidecar container through a volume mount.
633
+
- before scraping the sidecar container, the control plane looks up the API key and certificate authority from the secret.
634
+
635
+
However, as the data exposed by the http server may not be considered particularly sensitive, it may be acceptable to expose the metrics server without auth or TLS which would avoid a lot of complexity.
636
+
637
+
#### RBAC changes
638
+
639
+
No RBAC changes are required.
640
+
641
+
#### SDK Changes: instrumenting runtime
642
+
643
+
We're proposing adding the same new `TransformersTrainer` to the kubeflow-sdk, as per Proposal 1, with the following differences:
644
+
- an additional parameter on the `TransformersTrainer` will allow users to specify the monitoring port.
645
+
- the `KubeflowTrainerStatusCallback` injected into the runtime code will write the metrics data to the shared file in the required format rather than making a web request.
| **User experience** | Simple user experience. No additional config, users opt-in entirely through runtime code. Easier to debug problems. | Harder to configure, e.g. determining which pod to inject the sidecar into. Harder to debug (e.g. sidecar logs). The wait for collecting the final metrics may be frustrating for users who are interactively experimenting with train jobs. |
652
+
| **Security** | New external endpoint creates an additional threat route for the control plane, e.g. for accidental/malicious denial-of-service. | Securing the server with TLS requires putting secrets into user namespace. The blast radius from compromised secrets is smaller. |
653
+
| **Robustness** | Users must ensure failed web requests are handled and do not terminate terminate the training loop. | Errors are less likely to terminate the training loop. |
654
+
| **Complexity and maintenance** | Significant complexity in the endpoint auth mechanism, but complexity is managed once in the control plane. | Significant complexity in ensuring the final status is scraped. This complexity needs to be handled different by each training runtime. |
655
+
| **Compatibility** | Compatible with any k8s version. | Relies on sidecar containers which is k8s 1.33+ (beta in 1.29+). |
656
+
| **Flexibility** | Highly flexible. Can support any framework that supports runtime instrumentation (e.g. using callbacks). | Highly flexible. Can support any framework that supports runtime instrumentation (e.g. using callbacks). |
657
+
| **Scalability** | Should be highly scalable to thousands of simultaneous train jobs. Each train job should call the endpoint relatively infrequently. | Highly scalable. The control plane can scrape the training status on best-effort. |
658
+
480
659
## Other considered alternatives
481
660
482
661
This section describes other approaches that were evaluated and the rationale for not selecting them.
@@ -495,21 +674,6 @@ Cons:
495
674
- The trainer pods cannot use the default service account. The control plane would need to automatically create a service account with the required permissions for a train job, or users would need to provide a service account and ensure it has the necessary permissions.
496
675
- The trainer runtime require a kubernetes client to be available, meaning it must either be pre-installed in the runtime or installed/injected at runtime.
497
676
498
-
### Runtime serves trainer status on a webserver; control plane periodically scrapes the status
499
-
500
-
Instrument the trainer pods to serve metrics via a small http server; the pod controller manager periodically scrapes the metrics by making a web request to a trainer pod.
501
-
502
-
Pros:
503
-
504
-
- No need for new control plane service.
505
-
- Potentially more scalable: the control plane can decide when the endpoints are scraped, meaning it can be best-effort based and/or the scraping frequency can be automatically reduced when a large number of train jobs are executing.
506
-
507
-
Cons:
508
-
509
-
- Pull based: trainer status updates may be delayed. Care must also be taken when a train job completes to allow for the "final" trainer status to be scraped before pod terminates.
510
-
- Adds considerable extra complexity on the runtime instrumentation code. E.g. auth, tls, creating a webserver. The complexity must be handled by each ML framework.
511
-
- Requires additional control plane configuration. E.g. auth, tls config, port, scrape interval.
512
-
513
677
### Exposing metrics via MLFlow or Prometheus
514
678
515
679
The runtime is instrumented with an MLFlow or Prometheus client which tracks and exposes the metrics. The controller manager reads the metrics from MLFlow or Prometheus and updates the custom resource.
0 commit comments