-
Notifications
You must be signed in to change notification settings - Fork 884
feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905
Conversation
|
|
||
| ### Runtime: emitting trainer status messages | ||
|
|
||
| The "primary" training runtime pod will write trainer status messages to stdout in the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider taking inspiration / aligning with the metricsCollectorSpec mechanism from The Katib API.
We can prioritize the support for the StdOut collector type for the first implementation, but that'll give the API enough flexibility and consistency across components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @astefanutti , Please correct me here :
As described in this Katib's natively supported metrics-collector approaches here (pull/push based): https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/
There are 3 collector types (Stdout/TensorFlowEvent/file based) ..
So for initial phase we can start with adding a metricsCollectorSpec.kind field to our API now by keeping simple regex parsing handled by a metrics collector sidecar, something like this which is exact match for Katib's approach :
spec:
metricsCollectorSpec:
kind: StdOut
source:
filter: "([\w|-]+)=([+-]?\d+\.?\d*)"
and then we can introduce file/TensorFlowEvent and custom based metrics collector approaches too in next phases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal exactly, I suggest the KEP to define the API accordingly, but not all the collector types have to be supported in the first implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's really an awesome insight.. Thanks @astefanutti 🙌
We will update the proposal promptly !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @robert-bell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I wasn't aware of the metricsCollectorSpec in Katib. Thanks @astefanutti for pointing it out!
Are you thinking about aligning with the Katib approach for consistency and flexibility, or about making it easier for train jobs to send metrics to Katib (e.g. the same metric collection approach can be used by both Katib and KFT)? Or possibly both or something else entirely.
If making it easier for Katib to consume the metrics, would it make sense to consider extending Katib to watch the TrainJob resource instead, possibly from a sidecar or maybe directly in the Katib control plane? The control plane approach could be particularly appealing for user experience as it would mean they would not need to modify their training code at all (provided they use a supported runtime), or depend on the Katib SDK. The train job service account would also not need any additional RBAC permissions.
If you're thinking about consistency and flexibility, do you have an example in mind where the extra flexibility would benefit, or where the proposed approach would be limiting? I agree there's a need for flexibility, but I wonder whether that flexibility can be moved entirely into the runtime code? This may be easier to maintain and extend to add support for other ML frameworks, and would allow us to keep the KFT controller implementation simpler. We could even package some helper utilities to a python package (e.g. kubeflow-utils) to help users instrument their code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm thinking metrics collection should become the responsibility of the trainer / TrainJob API so these are exposed in a standard way for Katib to consume.
The Trainer would collect metrics from runtimes according to the metricsCollectorSpec pattern, and expose them in a standard way, e.g. in the TrainJob status. Then Katib trial controller would watch the TrainJob to retrieve the metrics related to the experiment. This would make Katib being able to operate without a database.
The metricsCollectorSpec pattern could be configured in the ClusterTrainingRuntime is a way that's compatible with the runtimes. For custom trainers, the SDK would be responsible to customize it depending on the provided trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification @astefanutti. I've looked at Katib's metricsCollectorSpec and thought about it in more detail, and I don't think we need to or should use the same pattern here.
Rather than putting flexibility in the control plane, we're proposing that we use the runtime pods to get flexibility. Each different ML framework would have different instrumentation code, but all frameworks would communicate using the same "protocol" (e.g. the specific log messages). This means that we don't need any flexibility in the control plane.
The advantages of using code in the runtime is
- much more flexibility than can be provided by
metricsCollectorSpec, - it's much easier for data scientists to consume: configuration is all in code, rather than split across a yaml file and code,
- the control plane can be kept simpler. In particular, I think katib needs a sidecar to collect the metrics.
To make it easier for users to instrument their code, we can leverage the sdk to automatically instrument training jobs submitted if the framework supports it (e.g. for frameworks that have callback mechanisms).
For frameworks that don’t support callbacks, or for users that aren’t creating jobs via the sdk, we could provide a runtime utility library that implements the protocol and allow users to add the instrumentation themselves (e.g. it could be embedded in their runtime images). Or they are free to just not instrument their training loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robert-bell thanks that sounds great!
I should have expressed the requirement for flexibility rather than jumping directly to the implementation details :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't included something like metricsCollectorSpec in the updated approach, but I think we have all the flexibility that means we don't need it. @astefanutti do take a look.
| } | ||
|
|
||
|
|
||
| type TrainJobTrainerStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So #2909 and this PR seem to conflict with each other.
In this one, we want to make TrainJob a general solution for deep learning training.
But in 2909, we are hoping to make TrainJob a general solution for HPC solutions (not just deep learning frameworks).
cc @vsoch
I'm not sure its blocking as this project is called TrainJob but I do want to call this out as a descrepancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention is to better support training jobs that use HPC technologies (MPI, topology and fine grained scheduling, etc.) It's no more strange than having the MPI plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannon92 is the conflict that #2909 would allow TrainJob to be used for more general hpc/non-training workloads, but this proposal is specifically for model training workloads?
If so, I'm not sure its a problem because as you say TrainJob is designed for model training. Plus, the extra status information we're proposing is optional so users can use TrainJob for more general hpc workloads if they wanted.
Also just to clarify, this proposal isn't limited to just deep learning frameworks. Other training frameworks, e.g. XGBoost, LightGBM, should be able to use it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @robert-bell, the metrics status might be also very useful for HPC of jobs. We just need to figure out how to make these statuses agnostic to the job you run.
On the other hand, TrainJob's APIs are more appropriate for the training jobs (e.g. trainer, initializer, potentially exporter/checkpointer cc @rst0git).
At some point, we need to discuss whether these abstractions make sense for HPC jobs or we need to re-think something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should map quite well to HPC use cases! We tend to have Figures of Merit (FOM) that are derived from different steps or stages. There are often multiple FOM. The metrics look relevant here with the exception of epoch - not everything you run in HPC is going to have an epoch, but AI/ML jobs run with MPI will.
Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work! I was also thinking:
trainingStatus:
metrics:
- name: loss
value: 0.8
type: train
- name: auroc
value: 0.7
type: trainI like your design because finding a type is a O(1) lookup instead of needing to loop through an entire list (which could be large).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsoch I've updated the schema to the metrics group approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robert-bell @vsoch @astefanutti Shall we consider the list instead of map as @vsoch proposed?
I understand that map gives us O(1) access, but it has some other problems in Kubernetes API design.
Additionally, it would be hard to do strategic merge patches on maps.
Maybe @tenzen-y, @kannon92, or @ahg-g have any insights why we introduce slice for ReplicatedJobs API in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L245-L248
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes api conventions discourage the use of maps.
Kube api linter even has a linter for nomaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Test Coverage Report for Build 21214073338Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
szaher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much @robert-bell @abhijeet-dhumal
This is great proposal, it brings a lot of value like
- Excellent User Experience: The goal is perfect. A data scientist seeing PROGRESS: 45% and ETA: 15m is a massive win.
- No dependency on external components like prometheus or sidecars
- Nice SDK integration with transformers library
but it has few problems too
- Architectural anti-pattern by allowing control plane to access data plane and stream logs
- While the footprint of go-routine is minimal on the operator itself but it abuses the Control Plane by maintaining constant
GETrequest to/api/v1/namespaces/.../pods/.../log?follow=trueconnection for every single active TrainJob (so assuming scalability to 1000 job is questionable) - Broad Security Permissions for the training operator to access cluster wide pod logs
- Error prone as if training is verbose some dataset might have very large line examples or there might be extremely long log message
Another proposed solution is as follows:
Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates.
this approach is much cleaner
- No log parsing
- No go-routines
- No impact on the kubeapi-server for fetching and streaming logs
- Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
- Let the actual train job itself update itself
Overall, this is excellent and I think we should push it forward.
|
|
||
| ### Goals | ||
|
|
||
| 1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep training and eval metrics outside the status as Data Scientists have dedicated tools for this like mlflow, tensorboard, wandb, ...etc.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
I think the main target of this KEP is client integration, e.g. CLI (e.g. kubectl), GUI or Katib that would consume those metrics, not the data scientists.persona, and it's indeed not the goal to overlap with tools like Tensorboard.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.
Exactly Katib would indeed be a candidate client to consume those metrics to carry the optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the non-goals in the second draft to try clarify that this feature is not replacing these other tools, but is trying to provide something that works out-of-the-box without needing any additional components installed.
|
|
||
| ### Story 2: Data Scientist / ML Engineer monitoring training jobs | ||
|
|
||
| As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the controller always stream the logs to provide real-time information?
- How does that scale?
- Is this design anti-pattern since control plane access data plane ? The control plane's job is to make decisions. It includes the API server, schedulers, and your controller/Operator. Its job is to "establish and enforce policy." It is not designed for high-throughput, high-volume data processing. It's designed for state reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point that the log streaming approach is mixing control plane with data plane, though I wonder what the max throughput would be in practice? I'd possibly anticipate that applications would be adding at most a few lines to the logs per second, meaning throughput from a single job would be ~kB/s, and throughput from 1000 active train jobs may be ~MB/s which should not be particularly onerous on the controller. This is something that could be benchmarked of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new http request based approach should hopefully address your scalability concerns.
| trainMetrics: dict[str, float] | None | ||
| evalMetrics: dict[str, float] | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are very generic and not well defined. What metrics are we expecting. Since we are going to develop the callback in the SDK, should this be more well defined or structured to identify which metrics will be available?
I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892
progressPercentage, estimatedRemainingSeconds, currentEpoch, ...etc. give very good information already about the training process so I suggest we get rid of trainingMetrics and evalMetrics in favor of MLFlow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892
As being discussed in the comment above, the main audience / target of this KEP is not the data scientists persona, and it's indeed not the goal to overlap with tools like Tensorboard nor with the scope of experimentation tracking.
The main target of this KEP is client integration, e.g. GUI or Katib that would consume those metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szaher I've updated the goals/non-goals to clarify that we're not trying to replace MLFlow. Does that address your concerns?
To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing. |
Granting permissions so the train job itself can directly update the training status comes with a few more downsides -
|
|
Hi Folks, thanks again for this effort! That feature should significantly improve experience of AI workloads on Kubernetes, and has a lot of future potential for checkpointing, scheduling, etc. 🎉 @astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward? cc @romanbaron @EkinKarabulut @Ronkahn21 as we discussed, this should address problems to detect running status of TrainJob mentioning by Ron in this talk: https://sched.co/28D8Z |
|
Hey @andreyvelich! Thanks for chasing, and apols for the silence.
I'm working on an update to bring in things from that discussion. I'm aiming to find some time this week. |
|
Hi folk, @astefanutti @andreyvelich @tenzen-y, I've updated the proposal with a new design based on the discussions from Kubecon, and looked to incorporate some of the ideas from the first draft. The changes are put into separate commits to make it easier to see what's changed. The main change is that I've replaced the log watching mechanism with http calls. I've actually made 2 variants of the proposal - a push-based one (runtime sends to control plane) and pull-based one (control plane scrapes from runtime) - because they're similar but have different trade offs. We only need one of the approaches though, and I thought this might be the easiest way to weigh up the merits of each an help us reach consensus on how best to move forward. The other changes I've made are:
Please let me know if any extra info is needed to help this move forward. |
| // for the metric and the value is the corresponding numeric value serialized | ||
| // as a string. | ||
| // +optional | ||
| Values map[string]string `json:"values,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsoch following up on the other thread, I just wanted to highlight I'm making an assumption here that we only need to support scalar metrics - we can't (directly) support non-scalar things like confusion matrices etc.
Should we be considering more structure here to let us support non-scalar metrics in the future? E.g.
values:
loss:
type: scalar
value: 0.9
confusion_matrix:
type: ConfusionMatrix:
... more complicated structI know TFX has done some work here but it looks like it quickly gets very complicated -- see tf-metadata or tfma.
If we don't consider it now, though, it may be hard to add support for it later.
|
|
||
| ### Runtime: emitting trainer status messages | ||
|
|
||
| The "primary" training runtime pod will write trainer status messages to stdout in the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't included something like metricsCollectorSpec in the updated approach, but I think we have all the flexibility that means we don't need it. @astefanutti do take a look.
| trainMetrics: dict[str, float] | None | ||
| evalMetrics: dict[str, float] | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szaher I've updated the goals/non-goals to clarify that we're not trying to replace MLFlow. Does that address your concerns?
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @robert-bell!
I left a few comments.
Let's chat more about it during today's call if you can join: https://bit.ly/2PWVCkV
| """ | ||
| read -r -d '' SCRIPT << EOM\n | ||
| import os | ||
| from transformers import TrainerCallback, trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have open issue to introduce this Callback to transformers: https://huggingface.co/docs/transformers/en/main_classes/callback#callbacks?
I am curious whether we should call this Callback TrainerCallback, KFTrainerCallback, or KubeflowCallback ?
Can this callback be useful for other jobs like OptimizationJobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think having a callback in transformers would be ideal.
I'd been holding off starting a discussion there until we'd got a rough consensus here on the general approach to take, particularly the idea of introducing these framework-specific custom trainers. I'd be keen to open an issue there though.
In terms of names, I'd gone with KubeflowTrainerStatusCallback but I appreciate it's a mouthful! Just FYI TrainerCallback is the name of the base class in the transformers library so we shouldn't use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we name it KubeflowCallback or KubeflowTrainerCallback, since we might also want to use it for checkpointing in the future (cc @kramaranya @rst0git)
Are we planning to use this callback for other things, like for HP algorithms (e.g. Hyperband) or early stopping?
I know that sometimes algorithms make decision during Trial execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't have a strong opinion on the naming. I'd be happy to use either KubeflowCallback or KubeflowTrainerCallback.
I wonder whether from a maintenance perspective we might want to have separate callbacks for each bit of functionality (status, checkpointing, early-stopping)? We could for sure wrap them up in a single "main" callback and allow users to enable/disable each callback individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether from a maintenance perspective we might want to have separate callbacks for each bit of functionality (status, checkpointing, early-stopping)?
Could we ask HuggingFace community for that? I feel that starting with unified KubeflowCallback might be easier.
Additionally, we can use kwargs to configure callback in a way we want: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L332
Additionally, I know that Transformers have other integrations for HP search, for example: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/integration_utils.py#L296C5-L296C22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we name it KubeflowCallback or KubeflowTrainerCallback, since we might also want to use it for checkpointing in the future
Sounds like a good idea.
|
|
||
| 1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics. | ||
| 2. **Have zero dependencies.** The user should not need to install any additional components into their cluster for the feature to work. It should work "out-of-the-box". | ||
| 3. **Optional.** Users can choose to opt in to providing progress tracking, but are not required to use this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to control it via Trainer config or TrainJob API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the push-based approach, it'd be controlled by the runtime - the user would opt in by instrumenting their code. The user wouldn't need to configure anything on the Trainer or TrainJob.
For the pull-based approach, users would need instrument their runtime and also configure their (Cluster)TrainingRuntime so the control plane knows which pod to inject the sidecar and then scrape, so there'd be an extra step. We could set up this config for the standard ClusterTrainingRuntimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the pull-based approach, users would need instrument their runtime and also configure their (Cluster)TrainingRuntime so the control plane knows which pod to inject the sidecar and then scrape
This could be done in the SDK via PodTemplateOverride by each specific trainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that'd be a nice approach and would keep the implementation clean.
We might need to update the PodTemplateOverrides implementations because some of the options can't currently be used more than once e.g. SpecLabels doesn't merge the labels it only keeps the last label set applied, so we might either overwrite user-applied label overrides, or the user-applied label overrides might overwrite the progress tracking ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the push-based approach, it'd be controlled by the runtime - the user would opt in by instrumenting their code
Do you mean we will always serve the endpoint, and this endpoint is called, we will update TrainJob status?
This could be done in the SDK via PodTemplateOverride by each specific trainers.
We talked yesterday with @robert-bell that might not always work. Since we use single Job template for torch-distributed runtime, we can't inject sidecar only to the MASTER pod (e.g. Job index = 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we will always serve the endpoint, and this endpoint is called, we will update TrainJob status?
Yeah, that's right. The endpoint would always be available. It's essentially the same model as mlflow - the server is always there available, and the client submits the metrics whenever it wants to update them. It's the same server that's used by all TrainJobs in the cluster.
We talked yesterday with @robert-bell that might not always work. Since we use single Job template for torch-distributed runtime, we can't inject sidecar only to the MASTER pod (e.g. Job index = 0).
Just to clarify - I think it may still be possible to make it work, e.g. maybe the torch-distributed runtime has to be changed so there's 2 Jobs in it - one for the master pod which gets the sidecar, and another for all the other workers. I'm concerned about the overhead this'd add, though, to creating runtimes, especially if users want to create their own runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's right. The endpoint would always be available. It's essentially the same model as mlflow - the server is always there available, and the client submits the metrics whenever it wants to update them.
I am wondering if we should make this feature optional, and enabled by default in the config: https://github.com/kubeflow/trainer/blob/master/manifests/base/manager/controller_manager_config.yaml
We can also think about feature gates flags, like we do in JobSet (cc @kannon92 @astefanutti @tenzen-y): kubernetes-sigs/jobset#1096
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should introduce FGs after beta graduation regardless of this feature.
Currently, Trainer is in the alpha stage. So, all features could be considered as an alpha, but after beta graduation, we want to have a feature stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to put this behind a feature gate, using the same approach in JobSet. It makes sense to me to add the scaffolding for FGs sooner rather than later so it's there and ready to be used by other features.
When it comes to the implementation, would folk prefer a separate PR for the FG or include it in the same PR for adding the progress tracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to introduce FG after we implement this feature. As @tenzen-y mentioned it would make sense to introduce it before we graduate Trainer API to beta.
| 1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics. | ||
| 2. **Have zero dependencies.** The user should not need to install any additional components into their cluster for the feature to work. It should work "out-of-the-box". | ||
| 3. **Optional.** Users can choose to opt in to providing progress tracking, but are not required to use this feature. | ||
| 4. **Provide built-in progress tracking support for selected ML frameworks (e.g. transformers, pytorch-lightning) in the kubeflow sdk.** Data scientists should be able to use the kubeflow sdk to create training jobs using these frameworks, and have progress tracking automatically instrumented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a goal to introduce callbacks to third party ML library like HuggingFace transformers to support our instrumentation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this should be a goal. I'll update it in a batch with any other changes.
| // Empty if the status is unknown, e.g. the job has just started | ||
| // or the job is not instrumented to report its status. | ||
| // +optional | ||
| TrainerStatus *TrainJobTrainerStatus `json:"trainerStatus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about TrainerProgress API, since we can introduce InitializerProgress in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking renaming TrainerStatus => TrainerProgress? My thinking was "progress" didn't really include metrics, so status might be a better name.
We could have InitializerStatus to contain initializer progress. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let's finalize the API design here: #2905 (comment)
| } | ||
|
|
||
|
|
||
| type TrainJobTrainerStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robert-bell @vsoch @astefanutti Shall we consider the list instead of map as @vsoch proposed?
I understand that map gives us O(1) access, but it has some other problems in Kubernetes API design.
Additionally, it would be hard to do strategic merge patches on maps.
Maybe @tenzen-y, @kannon92, or @ahg-g have any insights why we introduce slice for ReplicatedJobs API in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L245-L248
|
|
||
| #### Security considerations | ||
|
|
||
| The control plane endpoint will be secured with TLS and token-based authentication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti @robert-bell Did you get a chance to check the in-place restart KEP from @GiuseppeTT ?
https://github.com/kubernetes-sigs/jobset/tree/main/keps/467-InPlaceRestart#implementation---permissions-for-the-agent-sidecar
Wondering if we can re-use similar ideas with ValidatingAdmissionPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this - I wasn't aware of the ValidatingAdmissionPolicy api!
I've had a look but I'm not sure how we could secure this completely: we could use a ValidatingAdmissionPolicy to restrict it so only the trainer status field can be updated, but I'm not sure how we can restrict it so that nothing else running in the namespace can update the field? I think we'd need the train jobs to run using custom service accounts which is something I was trying to avoid for the reasons I outlined in this comment.
I might have misunderstood something in the ValidatingAdmissionPolicy api though.
| - configmap: | ||
| name: <train-job-name>-tls-config | ||
| items: | ||
| - key: ca.crt | ||
| path: ca.crt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can other pods in this namespace re-use this Volume while TrainJob is running?
If yes, that might allow attacker to re-use certificate in other pods to call system endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the public ca cert so the runtime trusts the control plane endpoint TLS encryption. It's not an issue if other pods have access to it, they can't do anything with it.
| ```python | ||
| from urllib import request | ||
| import os | ||
| import ssl | ||
|
|
||
| payload = ... | ||
|
|
||
| url = os.environ["KUBEFLOW_TRAINER_STATUS_URL"] | ||
| ca_file = os.environ["KUBEFLOW_TRAINER_STATUS_CA_CERT"] | ||
| token = open(os.environ["KUBEFLOW_TRAINER_STATUS_TOKEN"]).read() | ||
| ssl_context = ssl.create_default_context(cafile=ca_file) | ||
| req = request.Request(url, data=payload, headers={"Authorization": f"Bearer {token}"}) | ||
| request.urlopen(req, ssl_context=ssl_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this instrumentation live in Kubeflow SDK?
Looking at other callbacks, they do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in the transformers library? If so, then yes it would be great to have it there!
I'm not sure it belongs in the Kubeflow SDK though - I was deliberately trying to avoid requiring users to install the SDK into their runtime images to avoid the extra dependency and potential confusion (e.g. a different sdk version is installed client side to runtime side).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we still need to install kubeflow as a dependency to the transformers library.
Similar to other integrations, we might need to introduce TrainerClient().is_available() method which checks that the correct version of control plane is installed.
We discussed it with @kubeflow/kubeflow-sdk-team, that we want to introduce ConfigMap that shows the version of control plane, and SDK should check that it is available on the cluster.
Otherwise, instrumentation might fail due to incompatible versions between Application (PyTorch) and Control Plane (Trainer controller).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the presence of environment variables injected by the control plane to determine whether kft is available? That would avoid the dependency on kubeflow and I think would be more reliable than checking in the sdk (we need to make sure the instrumentation in the runtime environment is compatible with the control plane).
We could also inject an environment variable containing the control plane version, or a version of the progress api that's supported. The check would be maybe something like this:
def is_kubeflow_trainer_available():
return os.environ.get("KUBEFLOW_TRAINER_STATUS_API_VERSION") == "1"I think if we can we should try to avoid requiring kubeflow to be installed in the runtime because of the risk of confusion for users (e.g. it'll be confusing to debug if the runtime has a different version kubeflow version than what the user has locally) and the risk of dependency conflicts with the runtime code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the presence of environment variables injected by the control plane to determine whether kft is available?
But that will require for you to inject version of control plane into TrainJob, isn't?
I think if we can we should try to avoid requiring kubeflow to be installed in the runtime because of the risk of confusion for users
But why we don't want to control instrumentation in the kubeflow SDK itself? If we don't do this, we will require to always update Transformers library for any changes. From my understanding, if users want to use transformers with KubeflowCallback, they must install it as: pip install transformers[kubeflow], which will install kubeflow SDK in any case.
Same as for other integrations: https://github.com/huggingface/transformers/blob/main/setup.py#L248
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why we don't want to control instrumentation in the kubeflow SDK itself?
As I say, my concern is about needing the sdk installed in the runtime.
I know KFP are trying to remove the need to have the kfp sdk installed at runtime - see kubeflow/pipelines#9886 and kubeflow/pipelines#12081 (comment) - and I'm conscious that we may ultimately feel similar pains to what they're feeling, especially if the intention is for the kubeflow sdk to support a broader set of kubeflow components.
That all said, I don't have a strong objection to what you're proposing. It's certainly a pragmatic and easier approach to get things off the ground.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, the downside of not introducing this instrumentation into Kubeflow SDK is that we have to always update transformers library to fix/change the callback and release the newer version, right? So users will have to update transformers package in any case.
Could we check how other callbacks solve this issue in the transformers?
| from transformers import TrainerCallback, trainer | ||
| from urllib import request | ||
|
|
||
| class KubeflowTrainerStatusCallback(transformers.TrainerCallback): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this can't live directly in HF Transformers library?
We can contribute our custom callback into their library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think we should aim to get this to live in Transformers, and do similar in other frameworks if we can.
| * Adding a new "transformers-distributed" ClusterRuntime which will be included in the default set of cluster runtimes included in the manifests. | ||
| * Publish new docker images for the "transformers-distributed" runtime "ghcr.io/kubeflow/trainer/transformers-runtime". The docker image will include the transformers, accelerate and torch python packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to just install transformers library, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's what I was thinking, though I'm not sure how helpful this is given users can just install the package at runtime. I'd be happy to drop this.
|
/ok-to-test |
| // | ||
| // +kubebuilder:validation:Minimum=0 | ||
| // +optional | ||
| EstimatedRemainingSeconds *int64 `json:"estimatedRemainingSeconds,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any assumption to justify that we should use int64 instead of int32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with int64 to be on the safe side, but I'll happily update it to int32.
int32 gives a max remaining time of ~68 years which should be plenty big enough, but my thinking was users might see overflow for badly configured jobs - e.g. user picks 25k train steps and the first step takes 24hrs ==> estimated completion time would be 68 years.
The +kubebuilder:validation:Minimum=0 tag should mostly guard against overflow though so I'm happy to simplify it to int32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Do you see any concerns with int64? I can see that core API also uses it: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L4189
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For integer fields, prefer int32 to int64 unless you need to represent values larger than int32. See other guidelines about limitations of int64 and language compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to imagine the situations where we need to record parameter for over 68 years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That make sense, thanks for sharing! I am fine with int32 in that case.
TBH, even int64 won't help us to solve the problem with badly configured jobs, since we can't predict what values users will set.
| // for the metric and the value is the corresponding numeric value serialized | ||
| // as a string. | ||
| // +optional | ||
| Values map[string]string `json:"values,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid map in CRD as much as possible.
values: [
{
key: "xyz",
value: "abc",
},
],There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
values: [
{
name: "xyz",
value: "abc",
},
],There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tenzen-y I'll update it to the name/value approach.
| #### Sidecar container injection | ||
|
|
||
| To enable monitoring and sidecar injection, users must add the label `trainer.kubeflow.org/trainjob-monitoring-step: trainer` to one of the replicated jobs in their (Cluster)TrainingRuntime. This will cause the control plane to: | ||
| - inject a [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) into **one** of the trainer pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to specify resource requests for the metrics collector sidecar container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could get away with using the same request for all train jobs and read the values from controller configmap? Or we could add something to the TrainJob spec or let users override a default using annotations, but it adds more complexity.
What are you thoughts on the push based approach instead? That approach doesn't need any config like this.
Signed-off-by: Rob Bell <robell@redhat.com>
…hanism Signed-off-by: Rob Bell <robell@redhat.com>
…ased approach Signed-off-by: Rob Bell <robell@redhat.com>
Add a simpler, more reliable approach to collect the final status. Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
6d0a6ff to
6fca18d
Compare
|
Hey @andreyvelich @tenzen-y @astefanutti I've done another update. Main changes since last time -
I've also raised a question here #2905 (comment) which could do with your input. Please take a look, thanks. |
|
|
||
| Users can choose not to instrument their runtime, in which case no progress and metrics will be available on the TrainJob. The feature is therefore optional and opt-in. | ||
|
|
||
| The feature will have an associated feature gate, defaulting to "enabled". Disabling the gate will disable the http service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we start with disabled by default initially? So once this feature is well tested, we can enable it by default.
I believe, this is how Kubernetes manage their features. Is that correct @kannon92 ?
WDYT @robert-bell @astefanutti @tenzen-y?
Also, we need to add the feature name: TrainJobProgress
FYI, here you can find implementation in JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/features/features.go#L35C17-L35C32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to make it disabled by default initially. I'll update the text.
I think we'd previously talked about introducing the feature gates after this feature #2905 (comment). Is there another feature coming up that's wanting a feature gate? If not, it's probably actually slightly easier to add the feature gate scaffolding with this feature, rather than add implement the feature then retrofit it to use a feature gate later. I'm happy either way though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, it's probably actually slightly easier to add the feature gate scaffolding with this feature, rather than add implement the feature then retrofit it to use a feature gate later. I'm happy either way though.
Yes, I agree. If you want, we can create separate small PR to introduce feature gates capabilities to Trainer config.
Similarly to JobSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I'm happy to make that PR for adding feature gates scaffolding if you'd like?
Signed-off-by: Rob Bell <robell@redhat.com>
cdc4f34 to
e7be2ad
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @robert-bell!
I think, we should be good to move this forward, excited to see this feature!
/lgtm
/assign @tenzen-y @akshaychitneni @astefanutti
|
Awesome, thanks @andreyvelich, and thank you for your detailed feedback and help moving this forward. |
|
/lgtm |
|
/lgtm Awesome work @robert-bell @abhijeet-dhumal thanks! |
|
/retest |
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
| // name is a user-defined label for the metric, e.g. "loss", "eval_accuracy". | ||
| // +kubebuilder:validation:MinLength=1 | ||
| // +required | ||
| Name string `json:"name,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Name string `json:"name,omitempty"` | |
| Name string `json:"name"` |
Why is omitempty even though this is a required field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one of the checks we've got enabled KAL, but I don't entirely understand why it's necessary. See KAL docs. If it's not there KAL raises an issue.
I think in practice it doesn't make any difference because the field also has +kubebuilder:validation:MinLength=1 so the default empty value isn't valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that sounds good to me.
I also checked the API convensions https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#serialization-of-optionalrequired-fields
Thank you.
| // value of the metric. Values must be serialized as a string. | ||
| // +kubebuilder:validation:MinLength=1 | ||
| // +required | ||
| Value string `json:"value,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above. 😄
| trainer.kubeflow.org/trainjob-monitoring-step: trainer | ||
| trainer.kubeflow.org/trainjob-monitoring-port: 28080 | ||
| trainer.kubeflow.org/trainjob-monitoring-interval: 30s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this system metadata to save the parameter for the controller.
Typically, such system metadata should be stored in annotations instead of labels.
But, I understand that this sidecar approach is just an alternative one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch. I've pushed an update.
Signed-off-by: Rob Bell <robell@redhat.com>
|
@andreyvelich @astefanutti @akshaychitneni @tenzen-y I pushed a minor update in response to one of the latest comments forgetting that it would reset the lgtm labels. Apols. Can you please re-review? Is there anything else that we need for this? The next steps for implementing it are looking good and clear from my end, and we should be able to start on the implementation mid next week assuming everyone's happy with the current plan. |
|
/lgtm |
|
@robert-bell Thank you for moving that forward! LGTM 👍 |
|
/hold cancel |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This is a KEP for adding real-time progress and exposing training metrics on TrainJob.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Part of #2779