|
| 1 | + |
| 2 | +# The Backend of LM-Eval-aaS # |
| 3 | + |
| 4 | +The backend of LM-Eval-aaS provides the functionalities to handle the LM-Eval tasks |
| 5 | +received from the API server and the details of the APIs can be found [here](../api/OpenAPI.yaml). |
| 6 | +Currently, the backend can be deployed on the OpenShift/Kubernetes cluster and here are the key components: |
| 7 | +- CustomResourceDefinition: Kind: `LMEvalJob`, Group: `foundation-model-stack.github.com.github.com`, Version: `v1beta1` |
| 8 | + This CRD carries the parameters of `submit_job` API and the status fields that are used by |
| 9 | + the controller to populate the job status and results. |
| 10 | +- Controller: The controller reconciles `LMEvalJob` custom resources, creates corresponding Pods to run the lm-eval |
| 11 | + tasks, collects results when lm-eval jobs finish, and cancels the jobs when a `cancel_job` request is received. The |
| 12 | + controller also registers the admission webhooks of the `LMEvalJob` as the validator. The controller also serves |
| 13 | + gRPC API to update LMEvalJob's status. |
| 14 | +- Driver: A lightweight program to wrap the `lm-eval + unitxt`, run the lm-eval program, collect outputs and results, |
| 15 | + and update `LMEvalJob` status via the gRPC API in the controller. When the controller creates a pod to run the |
| 16 | + LMEvalJob, An init container is used to copy the driver binary into the main container. In the main container, |
| 17 | + the `Commands` are the driver and the original job's commands are converted into the `Args`. |
| 18 | + |
| 19 | +## High-Level Architecture ## |
| 20 | +```mermaid |
| 21 | +--- |
| 22 | +title: High-Level Architecture Diagram |
| 23 | +--- |
| 24 | +flowchart RL |
| 25 | + A((fa:fa-user Client)) |
| 26 | + classDef client fill:#9900ff,stroke:#9900ff,stroke-width:2px |
| 27 | + A:::client --> |LM-Eval Requests| OpenShift |
| 28 | + OpenShift --> |Response| A |
| 29 | + subgraph OpenShift |
| 30 | + direction TB |
| 31 | + subgraph ingress |
| 32 | + direction LR |
| 33 | + B[Load Balancer] |
| 34 | + classDef ocingress fill:#cc6600,stroke:#ff9900,stroke-width:2px |
| 35 | + end |
| 36 | + B:::ocingress <--> C |
| 37 | + subgraph LM-Eval-aaS |
| 38 | + direction RL |
| 39 | + subgraph Deployments |
| 40 | + direction LR |
| 41 | + C[[API Server]] |
| 42 | + D[Controller] |
| 43 | + classDef deploy fill:#0033cc,stroke:#0066cc,stroke-width:2px |
| 44 | + end |
| 45 | + subgraph Pods |
| 46 | + G1[job1] |
| 47 | + G2[job2] |
| 48 | + G3[job3] |
| 49 | + classDef pod fill:#990000,stroke:#990000,stroke-width:2px |
| 50 | + end |
| 51 | + D --> |Create/Delete pod| G1:::pod & G2:::pod & G3:::pod |
| 52 | + end |
| 53 | + subgraph Control-Plane |
| 54 | + E[(etcd)] |
| 55 | + F([kube-apiserver]) |
| 56 | + classDef control fill:#339966,stroke:#669999,stroke-width:2px |
| 57 | + end |
| 58 | + D:::deploy <--> |reconcile LMEvalJob| F:::control |
| 59 | + C:::deploy <--> |Create/Get/Update LMEvalJob| F |
| 60 | + G1 & G2 & G3 --> |Collect results and update LMEvalJob| D |
| 61 | + F <--> E:::control |
| 62 | + end |
| 63 | +``` |
| 64 | + |
| 65 | +## State Transition of a LMEvalJob |
| 66 | + |
| 67 | +```mermaid |
| 68 | +--- |
| 69 | +title: State Transition of a LMEvalJob |
| 70 | +--- |
| 71 | +stateDiagram-v2 |
| 72 | + [*] --> New |
| 73 | + New --> Scheduled : Prepare resources and create a pod to run the job |
| 74 | + Scheduled --> Running : Get update from the driver |
| 75 | + Running --> Complete : Collect results |
| 76 | + Scheduled --> Failed : Time-out or fail to initialize the pod |
| 77 | + Running --> Failed : Program error or time-out |
| 78 | + Failed --> Complete : Collect logs |
| 79 | + Complete --> [*] |
| 80 | +
|
| 81 | +``` |
| 82 | + |
| 83 | +## Design |
| 84 | + |
| 85 | +### Cusotm Resource Definition: LMEvalJob |
| 86 | + |
| 87 | +Since the LM-Eval-aaS is a wrapper of the `lm-evaluation-harness + unitxt`, most of the data fields of the `LMEvalJob` |
| 88 | +CRD can be mapped to the arguments of the lm-evaluation-harness. The [data struct](../api/v1beta1/evaljob_types.go) for |
| 89 | +the LMEvalJob contains the following fields: |
| 90 | + |
| 91 | +| LMEvalJob | Data Type | Optional |Parameter in lm-evaluation-harness | Description |
| 92 | +| --- | --- | --- | --- | -- | |
| 93 | +| Model | string | | --model | Model type or model provider | |
| 94 | +| ModelArgs | [][Arg](../api/v1beta1/evaljob_types.go#L57-L60) | X | --model_args | Parameters to the selected model type or model provider. The data is converted to s string in this format and pass to lm-evaluation-harness: `arg1=val1,arg2=val2` | |
| 95 | +| Tasks | []string | | --tasks | Specify the tasks or task groups to evaluate | |
| 96 | +| NumFewShot | int | X | num_fewshot | Sets the number of few-shot examples to place in context | |
| 97 | +| Limit | string | X | --limit | Limit the number of documents to evaluate. Use integer string to specify an explicit number or a float between 0.0 and 1.0 in the string format for a specific portion | |
| 98 | +| LogSamples | boolean | X | --log_samples | If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. | |
| 99 | + |
| 100 | + |
| 101 | +The `status` subresource of the `LMEvalJob` custom resources contains the following information: |
| 102 | +- `PodName`: the controller uses this field to store the name of the Pod that runs the lm-eval job. |
| 103 | +- `State`: records the lm-eval job's status in this field. Possible values are: |
| 104 | + - `New`: means the lm-eval job is created and not processed by the controller yet |
| 105 | + - `Scheduled`: means a Pod is created by the controller for the job |
| 106 | + - `Running`: the driver in the Pod reports the job is running. |
| 107 | + - `Complete`: the job finishes or fails and the driver reports the job is complete |
| 108 | + - `Canceled`: means the job cancellation is initiated, the controller is going to cancel the job |
| 109 | + and change to Complete state when the job is canceled |
| 110 | +- `Reason`: the information about the current state: |
| 111 | + - `NoReason`: No information about the current state |
| 112 | + - `Succeeded`: The job finished successfully |
| 113 | + - `Failed`: The job fails |
| 114 | + - `Cancelled`: the job is canceled |
| 115 | +- `Message`: more details about the final state |
| 116 | +- `LastScheduleTime`: the time the job's Pod is scheduled |
| 117 | +- `CompleteTime`: the time the job's state becomes `Complete` |
| 118 | +- `Results`: store the lm-eval job's results. Since the etcd has the size limitation currently, |
| 119 | + the results JSON file shall not hit the limitation (including the CR's other field). We may move |
| 120 | + the results to another data store in the future. |
| 121 | + |
| 122 | +### The Controller |
| 123 | + |
| 124 | +The controller is responsible for monitoring the `LMEvalJob` CRs and reconciling the corresponding resources - |
| 125 | +the Pods in the current design. If a more complex/flexible job scheduling is needed, the controller will watch |
| 126 | +other resources instead. The skeleton of the controller is generated by the [kubebuilder](https://book.kubebuilder.io/). |
| 127 | +To eliminate the reconciliation triggered by the `LMEvalJob` CRs and Pods, the controller doesn't register the |
| 128 | +`Deletion` events of the `LMEvalJob` CRs and only monitors the `Deletion` events of the corresponding Pods. |
| 129 | +Here are the details of how the controller handles an `LMEvalJob` CR: |
| 130 | + |
| 131 | +- Admission Webhooks: The controller implements the admission webhooks for the `LMEvalJob` specifically for |
| 132 | + validation. Currently, it only validates the `Limit` field which should be either an Integer or Float string |
| 133 | +- ConfigMap: The controller uses a ConfigMap for its settings, including: |
| 134 | + - driver-image: This is used in the init container which contains the driver binary. |
| 135 | + - pod-image: This is the image for the main container of the job's Pod. It contains the |
| 136 | + `lm-evaluation-harness + unitxt` Python packages and is used to run the lm-eval jobs. |
| 137 | + - pod-checking-interval: The container checks the scheduled Pods with a fixed interval from this value. |
| 138 | + It uses the `time.Duration` [format](https://pkg.go.dev/time#ParseDuration). The default value is `10s`. |
| 139 | + - image-pull-policy: This is used for the ImagePullPolicy of the Pod. The Pods created by the controller |
| 140 | + use this config value as the ImagePullPolicy. The default value is `Always` |
| 141 | +- Arguments: The controller supports the following command line arguments: |
| 142 | + - `--namespace`: Where you deploy the controller, by default the namespace of the controller deployment |
| 143 | + is used |
| 144 | + - `--configmap`: Specify the ConfigMap's name that stores the config settings |
| 145 | + - kubebuilder's built-in arguments: `--metrics-bind-address`, `--health-probe-bind-address`, `--leader-elect` |
| 146 | + , `--metrics-secure`, and `--enable-http2` |
| 147 | +- Finalizer: The controller put itself as one of the `LMEvalJob`'s finalizers, using |
| 148 | + `lm-eval-job.foundation-model-stack.github.com.github.com/finalizer`. This makes sure the controller |
| 149 | + reconciles the LMEvalJob CRs before deletion. |
| 150 | +- Workflow: The normal flow of a `LMEvalJob` CR is: |
| 151 | + - New: Update CR's finalizer and insert the controller's finalizer ID. |
| 152 | + - New (Reconcile for the previous update of the finalizer): prepare and create a Pod for the job meanwhile |
| 153 | + recording down the time and Pod name into the `LMEvalJob` CR, and transiting to the `Scheduled` state. |
| 154 | + The Pod contains the OwnerReference pointing back to the LMEvalJob CR as well. |
| 155 | + - Scheduled: Periodically check the Pod and transit the state to Complete if the Pod fails to start and |
| 156 | + store the error message in the status's `Message` field. |
| 157 | + |
| 158 | + TODO: Need a timeout mechanism here to stop the check and mark the job as failed. |
| 159 | + |
| 160 | + - Running: Similar to the `Scheduled` state, check the Pod's status to see if the job fails or not. |
| 161 | + - Complete: Records the time into the status |
| 162 | + - Canceled: Receive the cancel request and revoke the Pod for the LMEvalJob, then transit to the |
| 163 | + `Complete` state when the Pod is deleted. |
| 164 | + |
| 165 | +The working flow on the controller side is quite easy because some of the works are off-loaded to the driver. |
| 166 | +Let's get to the driver and complete the whole picture. |
| 167 | + |
| 168 | +### The Driver |
| 169 | + |
| 170 | +The driver is a light-weight program that wraps the `lm-evalulation-harness + unitxt` and actively updates |
| 171 | +job statuses through the gRPC API the controller provides, so the controller doesn't have to keep monitoring |
| 172 | +the Pod CRs and doing the reconciliation because of a bunch of Pod's changes. Here is how the driver plays |
| 173 | +the role in the LMEvalJob workflow: |
| 174 | + |
| 175 | +- Scheduled: This is the state that a Pod created for the job, the driver binary is copied to the main container, |
| 176 | + and is launched to run the lm-eval job. Once the driver is ready to spawn a sub-process to run the |
| 177 | + lm-eval job, it transits the state into the Running state. Otherwise, it marks the job as Complete with |
| 178 | + failure information. |
| 179 | +- Running: Once the job is done, the driver collects the results, invokes gRPC API to update the job's status and result, |
| 180 | + and updates its status to the Complete state. |
| 181 | + |
| 182 | + |
| 183 | +## Code Structure |
| 184 | + |
| 185 | +- [api](../api): contains the REST APIs definition and go pkg for the LMEValJOb's data struct, group, kind information |
| 186 | +- [backend](../backend/): containers the controller and driver's implementation |
| 187 | + - [controller](../backend/controller/): the controller's code |
| 188 | + - [driver](../backend/driver/): the driver's code |
| 189 | +- [cmd](../cmd/): main programs for the controller and driver |
| 190 | +- [config](../config/): manifests for the controller's deployment |
| 191 | +- [docker](../docker/): Dockerfile for building controller, driver, and `lm-eval + unitxt` images |
0 commit comments