Skip to content

Commit cdc4f34

Browse files
committed
feat(docs): update the endpoint spec and update feature gate spec
Signed-off-by: Rob Bell <[email protected]>
1 parent 3249516 commit cdc4f34

File tree

1 file changed

+53
-29
lines changed
  • docs/proposals/2779-trainjob-progress

1 file changed

+53
-29
lines changed

docs/proposals/2779-trainjob-progress/README.md

Lines changed: 53 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ We propose an approach with the following high-level **push-based** design:
6363

6464
Users can choose not to instrument their runtime, in which case no progress and metrics will be available on the TrainJob. The feature is therefore optional and opt-in.
6565

66-
The feature will have an associated feature gate, defaulting to "enabled". Disabling the gate will disable the http service.
66+
The feature will have an associated feature gate `TrainJobProgress`, defaulting to "disabled". Disabling the gate will disable the http service.
6767

6868
### CRD changes
6969

@@ -79,6 +79,10 @@ type TrainJobStatus struct {
7979
// or the job is not instrumented to report its status.
8080
// +optional
8181
TrainerStatus *TrainJobTrainerStatus `json:"trainerStatus,omitempty"`
82+
83+
// Future extension (out of scope):
84+
// DataInitializerStatus *TrainJobDataInitializerStatus `json:"dataInitializerStatus,omitempty"`
85+
// ModelInitializerStatus *TrainJobModelInitializerStatus `json:"modelInitializerStatus,omitempty"`
8286
}
8387

8488

@@ -130,6 +134,8 @@ All fields (apart from lastUpdatedTime) are optional meaning that a runtime need
130134

131135
The design deliberately does not make any changes to the `TrainJobSpec`: the control plane does not require any configuration. Users opt in to the training status by instrumenting their runtime pods to send the training status to the control plane.
132136

137+
The design may be extended in future to add equivalent progress and metric statuses for the model and data initializer components.
138+
133139
```yaml
134140
# Sample TrainJob example with TrainerStatus implemented
135141

@@ -172,47 +178,65 @@ another-example Complete 100 50m
172178
173179
If the feature gate is enabled, the control plane will expose a new http server endpoint where trainer pods can submit the trainer status. The http server will be added as a new port in the existing `kubeflow-trainer-controller-manager` service.
174180
175-
The endpoint will be `POST: /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status/trainerStatus`, where `{namespace}` and `{name}` are the namespace and name of the TrainJob.
181+
The endpoint will be `POST: /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status`, where `{namespace}` and `{name}` are the namespace and name of the TrainJob.
182+
183+
The payload for the endpoint will have the following schema:
184+
```go
185+
type ProgressStatus {
186+
// trainerStatus provides a summary of the status of the training
187+
// part of the TrainJob.
188+
// Empty if the status is unknown, e.g. the job has just started
189+
// or the job is not instrumented to report its status.
190+
// +optional
191+
TrainerStatus *TrainJobTrainerStatus `json:"trainerStatus,omitempty"`
192+
193+
// Future extension (out of scope):
194+
// DataInitializerStatus *TrainJobDataInitializerStatus `json:"dataInitializerStatus,omitempty"`
195+
// ModelInitializerStatus *TrainJobModelInitializerStatus `json:"modelInitializerStatus,omitempty"`
196+
}
197+
```
176198

177-
The payload will be the same as `TrainJobTrainerStatus`.
199+
The schema uses a nested structure for future extensibility (e.g. the same endpoint could be used to receiver progress updates from a data initializer or model initializer).
178200

179201
On receiving requests to this endpoint, the control plane will validate the source of the request (see [Security considerations](#security-considerations)) and then directly update the `status.trainerStatus` field.
180202

181203
The control plane does not need to be highly available: the runtime can retry the status update request with some delay whilst continuing the training, or skip the update entirely.
182204

183205
An example payload is:
184206
```
185-
POST /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status/trainerStatus HTTP/1.1
207+
POST /apis/trainer.kubeflow.org/v1alpha1/namespaces/{namespace}/trainjobs/{name}/status HTTP/1.1
186208
Host: kubeflow-trainer-controller-manager.kubeflow:8082
187209
Authorization: Bearer {jwt}
188210
Content-Type: application/json
189211
190212
{
191-
"progressPercentage": 45,
192-
"estimatedRemainingSeconds": 795649,
193-
"metrics": [
194-
{
195-
"name": "loss",
196-
"value": "0.2347"
197-
},
198-
{
199-
"name": "eval_loss",
200-
"value": "0.2451"
201-
},
202-
{
203-
"name": "accuracy",
204-
"value": "0.9876"
205-
},
206-
{
207-
"name": "currentEpoch",
208-
"value": "2"
209-
},
210-
{
211-
"name": "totalEpochs",
212-
"value": "5"
213-
}
214-
],
215-
"lastUpdatedTime": "2025-01-23T10:30:45Z"
213+
"trainerStatus": {
214+
"progressPercentage": 45,
215+
"estimatedRemainingSeconds": 795649,
216+
"metrics": [
217+
{
218+
"name": "loss",
219+
"value": "0.2347"
220+
},
221+
{
222+
"name": "eval_loss",
223+
"value": "0.2451"
224+
},
225+
{
226+
"name": "accuracy",
227+
"value": "0.9876"
228+
},
229+
{
230+
"name": "currentEpoch",
231+
"value": "2"
232+
},
233+
{
234+
"name": "totalEpochs",
235+
"value": "5"
236+
}
237+
],
238+
"lastUpdatedTime": "2025-01-23T10:30:45Z"
239+
}
216240
}
217241
```
218242

0 commit comments

Comments
 (0)