You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/proposals/2779-trainjob-progress/README.md
+53-29Lines changed: 53 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ We propose an approach with the following high-level **push-based** design:
63
63
64
64
Users can choose not to instrument their runtime, in which case no progress and metrics will be available on the TrainJob. The feature is therefore optional and opt-in.
65
65
66
-
The feature will have an associated feature gate, defaulting to "enabled". Disabling the gate will disable the http service.
66
+
The feature will have an associated feature gate`TrainJobProgress`, defaulting to "disabled". Disabling the gate will disable the http service.
67
67
68
68
### CRD changes
69
69
@@ -79,6 +79,10 @@ type TrainJobStatus struct {
79
79
// or the job is not instrumented to report its status.
@@ -130,6 +134,8 @@ All fields (apart from lastUpdatedTime) are optional meaning that a runtime need
130
134
131
135
The design deliberately does not make any changes to the `TrainJobSpec`: the control plane does not require any configuration. Users opt in to the training status by instrumenting their runtime pods to send the training status to the control plane.
132
136
137
+
The design may be extended in future to add equivalent progress and metric statuses for the model and data initializer components.
138
+
133
139
```yaml
134
140
# Sample TrainJob example with TrainerStatus implemented
If the feature gate is enabled, the control plane will expose a new http server endpoint where trainer pods can submit the trainer status. The http server will be added as a new port in the existing `kubeflow-trainer-controller-manager` service.
174
180
175
-
The endpoint will be `POST: /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status/trainerStatus`, where `{namespace}` and `{name}` are the namespace and name of the TrainJob.
181
+
The endpoint will be `POST: /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status`, where `{namespace}` and `{name}` are the namespace and name of the TrainJob.
182
+
183
+
The payload for the endpoint will have the following schema:
184
+
```go
185
+
type ProgressStatus {
186
+
// trainerStatus provides a summary of the status of the training
187
+
// part of the TrainJob.
188
+
// Empty if the status is unknown, e.g. the job has just started
189
+
// or the job is not instrumented to report its status.
The payload will be the same as `TrainJobTrainerStatus`.
199
+
The schema uses a nested structure for future extensibility (e.g. the same endpoint could be used to receiver progress updates from a data initializer or model initializer).
178
200
179
201
On receiving requests to this endpoint, the control plane will validate the source of the request (see [Security considerations](#security-considerations)) and then directly update the `status.trainerStatus` field.
180
202
181
203
The control plane does not need to be highly available: the runtime can retry the status update request with some delay whilst continuing the training, or skip the update entirely.
182
204
183
205
An example payload is:
184
206
```
185
-
POST /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status/trainerStatus HTTP/1.1
207
+
POST /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status HTTP/1.1
0 commit comments