Skip to content

Commit 2a29452

Browse files
committed
Add design doc for the backend components
Signed-off-by: Yihong Wang <yh.wang@ibm.com>
1 parent 9bac6f9 commit 2a29452

File tree

1 file changed

+191
-0
lines changed

1 file changed

+191
-0
lines changed

backend/README.md

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
2+
# The Backend of LM-Eval-aaS #
3+
4+
The backend of LM-Eval-aaS provides the functionalities to handle the LM-Eval tasks
5+
received from the API server and the details of the APIs can be found [here](../api/OpenAPI.yaml).
6+
Currently, the backend can be deployed on the OpenShift/Kubernetes cluster and here are the key components:
7+
- CustomResourceDefinition: Kind: `LMEvalJob`, Group: `foundation-model-stack.github.com.github.com`, Version: `v1beta1`
8+
This CRD carries the parameters of `submit_job` API and the status fields that are used by
9+
the controller to populate the job status and results.
10+
- Controller: The controller reconciles `LMEvalJob` custom resources, creates corresponding Pods to run the lm-eval
11+
tasks, collects results when lm-eval jobs finish, and cancels the jobs when a `cancel_job` request is received. The
12+
controller also registers the admission webhooks of the `LMEvalJob` as the validator. The controller also serves
13+
gRPC API to update LMEvalJob's status.
14+
- Driver: A lightweight program to wrap the `lm-eval + unitxt`, run the lm-eval program, collect outputs and results,
15+
and update `LMEvalJob` status via the gRPC API in the controller. When the controller creates a pod to run the
16+
LMEvalJob, An init container is used to copy the driver binary into the main container. In the main container,
17+
the `Commands` are the driver and the original job's commands are converted into the `Args`.
18+
19+
## High-Level Architecture ##
20+
```mermaid
21+
---
22+
title: High-Level Architecture Diagram
23+
---
24+
flowchart RL
25+
A((fa:fa-user Client))
26+
classDef client fill:#9900ff,stroke:#9900ff,stroke-width:2px
27+
A:::client --> |LM-Eval Requests| OpenShift
28+
OpenShift --> |Response| A
29+
subgraph OpenShift
30+
direction TB
31+
subgraph ingress
32+
direction LR
33+
B[Load Balancer]
34+
classDef ocingress fill:#cc6600,stroke:#ff9900,stroke-width:2px
35+
end
36+
B:::ocingress <--> C
37+
subgraph LM-Eval-aaS
38+
direction RL
39+
subgraph Deployments
40+
direction LR
41+
C[[API Server]]
42+
D[Controller]
43+
classDef deploy fill:#0033cc,stroke:#0066cc,stroke-width:2px
44+
end
45+
subgraph Pods
46+
G1[job1]
47+
G2[job2]
48+
G3[job3]
49+
classDef pod fill:#990000,stroke:#990000,stroke-width:2px
50+
end
51+
D --> |Create/Delete pod| G1:::pod & G2:::pod & G3:::pod
52+
end
53+
subgraph Control-Plane
54+
E[(etcd)]
55+
F([kube-apiserver])
56+
classDef control fill:#339966,stroke:#669999,stroke-width:2px
57+
end
58+
D:::deploy <--> |reconcile LMEvalJob| F:::control
59+
C:::deploy <--> |Create/Get/Update LMEvalJob| F
60+
G1 & G2 & G3 --> |Collect results and update LMEvalJob| D
61+
F <--> E:::control
62+
end
63+
```
64+
65+
## State Transition of a LMEvalJob
66+
67+
```mermaid
68+
---
69+
title: State Transition of a LMEvalJob
70+
---
71+
stateDiagram-v2
72+
[*] --> New
73+
New --> Scheduled : Prepare resources and create a pod to run the job
74+
Scheduled --> Running : Get update from the driver
75+
Running --> Complete : Collect results
76+
Scheduled --> Failed : Time-out or fail to initialize the pod
77+
Running --> Failed : Program error or time-out
78+
Failed --> Complete : Collect logs
79+
Complete --> [*]
80+
81+
```
82+
83+
## Design
84+
85+
### Cusotm Resource Definition: LMEvalJob
86+
87+
Since the LM-Eval-aaS is a wrapper of the `lm-evaluation-harness + unitxt`, most of the data fields of the `LMEvalJob`
88+
CRD can be mapped to the arguments of the lm-evaluation-harness. The [data struct](../api/v1beta1/evaljob_types.go) for
89+
the LMEvalJob contains the following fields:
90+
91+
| LMEvalJob | Data Type | Optional |Parameter in lm-evaluation-harness | Description
92+
| --- | --- | --- | --- | -- |
93+
| Model | string | | --model | Model type or model provider |
94+
| ModelArgs | [][Arg](../api/v1beta1/evaljob_types.go#L57-L60) | X | --model_args | Parameters to the selected model type or model provider. The data is converted to s string in this format and pass to lm-evaluation-harness: `arg1=val1,arg2=val2` |
95+
| Tasks | []string | | --tasks | Specify the tasks or task groups to evaluate |
96+
| NumFewShot | int | X | num_fewshot | Sets the number of few-shot examples to place in context |
97+
| Limit | string | X | --limit | Limit the number of documents to evaluate. Use integer string to specify an explicit number or a float between 0.0 and 1.0 in the string format for a specific portion |
98+
| LogSamples | boolean | X | --log_samples | If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. |
99+
100+
101+
The `status` subresource of the `LMEvalJob` custom resources contains the following information:
102+
- `PodName`: the controller uses this field to store the name of the Pod that runs the lm-eval job.
103+
- `State`: records the lm-eval job's status in this field. Possible values are:
104+
- `New`: means the lm-eval job is created and not processed by the controller yet
105+
- `Scheduled`: means a Pod is created by the controller for the job
106+
- `Running`: the driver in the Pod reports the job is running.
107+
- `Complete`: the job finishes or fails and the driver reports the job is complete
108+
- `Canceled`: means the job cancellation is initiated, the controller is going to cancel the job
109+
and change to Complete state when the job is canceled
110+
- `Reason`: the information about the current state:
111+
- `NoReason`: No information about the current state
112+
- `Succeeded`: The job finished successfully
113+
- `Failed`: The job fails
114+
- `Cancelled`: the job is canceled
115+
- `Message`: more details about the final state
116+
- `LastScheduleTime`: the time the job's Pod is scheduled
117+
- `CompleteTime`: the time the job's state becomes `Complete`
118+
- `Results`: store the lm-eval job's results. Since the etcd has the size limitation currently,
119+
the results JSON file shall not hit the limitation (including the CR's other field). We may move
120+
the results to another data store in the future.
121+
122+
### The Controller
123+
124+
The controller is responsible for monitoring the `LMEvalJob` CRs and reconciling the corresponding resources -
125+
the Pods in the current design. If a more complex/flexible job scheduling is needed, the controller will watch
126+
other resources instead. The skeleton of the controller is generated by the [kubebuilder](https://book.kubebuilder.io/).
127+
To eliminate the reconciliation triggered by the `LMEvalJob` CRs and Pods, the controller doesn't register the
128+
`Deletion` events of the `LMEvalJob` CRs and only monitors the `Deletion` events of the corresponding Pods.
129+
Here are the details of how the controller handles an `LMEvalJob` CR:
130+
131+
- Admission Webhooks: The controller implements the admission webhooks for the `LMEvalJob` specifically for
132+
validation. Currently, it only validates the `Limit` field which should be either an Integer or Float string
133+
- ConfigMap: The controller uses a ConfigMap for its settings, including:
134+
- driver-image: This is used in the init container which contains the driver binary.
135+
- pod-image: This is the image for the main container of the job's Pod. It contains the
136+
`lm-evaluation-harness + unitxt` Python packages and is used to run the lm-eval jobs.
137+
- pod-checking-interval: The container checks the scheduled Pods with a fixed interval from this value.
138+
It uses the `time.Duration` [format](https://pkg.go.dev/time#ParseDuration). The default value is `10s`.
139+
- image-pull-policy: This is used for the ImagePullPolicy of the Pod. The Pods created by the controller
140+
use this config value as the ImagePullPolicy. The default value is `Always`
141+
- Arguments: The controller supports the following command line arguments:
142+
- `--namespace`: Where you deploy the controller, by default the namespace of the controller deployment
143+
is used
144+
- `--configmap`: Specify the ConfigMap's name that stores the config settings
145+
- kubebuilder's built-in arguments: `--metrics-bind-address`, `--health-probe-bind-address`, `--leader-elect`
146+
, `--metrics-secure`, and `--enable-http2`
147+
- Finalizer: The controller put itself as one of the `LMEvalJob`'s finalizers, using
148+
`lm-eval-job.foundation-model-stack.github.com.github.com/finalizer`. This makes sure the controller
149+
reconciles the LMEvalJob CRs before deletion.
150+
- Workflow: The normal flow of a `LMEvalJob` CR is:
151+
- New: Update CR's finalizer and insert the controller's finalizer ID.
152+
- New (Reconcile for the previous update of the finalizer): prepare and create a Pod for the job meanwhile
153+
recording down the time and Pod name into the `LMEvalJob` CR, and transiting to the `Scheduled` state.
154+
The Pod contains the OwnerReference pointing back to the LMEvalJob CR as well.
155+
- Scheduled: Periodically check the Pod and transit the state to Complete if the Pod fails to start and
156+
store the error message in the status's `Message` field.
157+
158+
TODO: Need a timeout mechanism here to stop the check and mark the job as failed.
159+
160+
- Running: Similar to the `Scheduled` state, check the Pod's status to see if the job fails or not.
161+
- Complete: Records the time into the status
162+
- Canceled: Receive the cancel request and revoke the Pod for the LMEvalJob, then transit to the
163+
`Complete` state when the Pod is deleted.
164+
165+
The working flow on the controller side is quite easy because some of the works are off-loaded to the driver.
166+
Let's get to the driver and complete the whole picture.
167+
168+
### The Driver
169+
170+
The driver is a light-weight program that wraps the `lm-evalulation-harness + unitxt` and actively updates
171+
job statuses through the gRPC API the controller provides, so the controller doesn't have to keep monitoring
172+
the Pod CRs and doing the reconciliation because of a bunch of Pod's changes. Here is how the driver plays
173+
the role in the LMEvalJob workflow:
174+
175+
- Scheduled: This is the state that a Pod created for the job, the driver binary is copied to the main container,
176+
and is launched to run the lm-eval job. Once the driver is ready to spawn a sub-process to run the
177+
lm-eval job, it transits the state into the Running state. Otherwise, it marks the job as Complete with
178+
failure information.
179+
- Running: Once the job is done, the driver collects the results, invokes gRPC API to update the job's status and result,
180+
and updates its status to the Complete state.
181+
182+
183+
## Code Structure
184+
185+
- [api](../api): contains the REST APIs definition and go pkg for the LMEValJOb's data struct, group, kind information
186+
- [backend](../backend/): containers the controller and driver's implementation
187+
- [controller](../backend/controller/): the controller's code
188+
- [driver](../backend/driver/): the driver's code
189+
- [cmd](../cmd/): main programs for the controller and driver
190+
- [config](../config/): manifests for the controller's deployment
191+
- [docker](../docker/): Dockerfile for building controller, driver, and `lm-eval + unitxt` images

0 commit comments

Comments
 (0)