Skip to content

Commit db7461e

Browse files
authored
Merge pull request #5 from compspec/add-ml-sidecar
feat: adding mlserver for selection
2 parents c4eb8d0 + 72816ab commit db7461e

31 files changed

+3991
-173
lines changed

Makefile

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,33 @@ build:
1616
$(BUILD_CONTEXT)
1717
@echo "Docker image $(FULL_IMAGE_NAME) built successfully."
1818

19+
# The mlserver
20+
mlserver:
21+
make -C ./mlserver
22+
23+
# Kind setup - we want to build both and push
24+
kind: mlserver mlserver-push build push
25+
26+
1927
# Push the docker image
2028
push:
2129
@echo "Pushing image $(FULL_IMAGE_NAME)..."
2230
docker push $(FULL_IMAGE_NAME)
2331

32+
mlserver-push:
33+
@echo "Pushing image ghcr.io/converged-computing/aws-performance-study:model-server..."
34+
docker push ghcr.io/converged-computing/aws-performance-study:model-server
35+
2436
# Install the webhook
2537
install:
2638
@echo "Installing $(FULL_IMAGE_NAME)..."
2739
kubectl apply -f ./deploy/webhook.yaml
2840

41+
# Install the webhook
42+
uninstall-mlserver:
43+
@echo "Installing $(FULL_IMAGE_NAME)..."
44+
kubectl apply -f ./deploy/webhook-with-mlserver.yaml
45+
2946
# Install the webhook
3047
uninstall:
3148
@echo "Uninstalling $(FULL_IMAGE_NAME)..."
@@ -37,5 +54,5 @@ clean:
3754
docker rmi $(FULL_IMAGE_NAME) || true
3855
@echo "Docker image $(FULL_IMAGE_NAME) removed (if it existed)."
3956

40-
.PHONY: all build clean
57+
.PHONY: all build clean mlserver
4158

README.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,31 @@
1-
21
# OCIFit
32

43
<p align="center">
54
<img src="docs/ocifit-k8s.png" height="500" alt="OCIFit Kubernetes">
65
</p>
76

7+
This is a Kubernetes controller that can:
8+
9+
- Select a container for a pod based on a compatibility artifact.
10+
- Select an instance type for a pod based on a machine learning model.
11+
12+
For the latter, we serve a sidecar to the controller that provides models. The node features are served to the model with a request to use a specific one to determine the optimal instance type.
813

14+
## Details
915

10-
This is a Kubernetes controller that will do the following:
16+
We do the following:
1117

1218
* Start running in a cluster with NFD, and retrieving metadata about nodes in the cluster, along with being updated when nodes are added and removed.
1319
* Receiving pods and checking if they are flagged for image selection.
1420
* Being flagged means having the label "oci.image.compatibilities.selection/enabled" and (optionally) a node selector
15-
* If the cluster is not homogenous, a node selector is required, and should be the instance type that the pod is intended for.
21+
* If the cluster is not homogeneous, a node selector is required, and should be the instance type that the pod is intended for.
22+
23+
When a pod (or abstraction that creates them) is created:
24+
1625
* If enabled, a URI is provided that points to a compatibility artifact
1726
* The artifact describes several images (and criteria for checking) that can be used for the Pod
1827
* The controller checks known nodes for the instance type against the spec,
19-
28+
* If an ML server model is specified in the artifact, the entire set of node metadata is sent to it.
2029

2130
## Notes
2231

@@ -41,6 +50,7 @@ but needs further discussion and thinking.
4150
| oci.image.compatibilities.selection/target-image | annotation | placeholder:latest | yes | image URI to replace in pod |
4251
| oci.image.compatibilities.selection/image-ref| annotation | placeholder:latest | no | artifact reference "image" in OCI registry |
4352
| oci.image.compatibilities.selection/enabled | label | unset | yes | Flag to indicate we want to do compatibility image selection |
53+
| oci.image.compatibilities.selection/model | annotation | unset | no | If retrieving a model specification, choose this model. |
4454

4555
Note that if you remove enabled, the webhook won't trigger, so it is required.
4656

@@ -62,12 +72,21 @@ make push
6272
kind load docker-image ghcr.io/compspec/ocifit-k8s:latest
6373
```
6474

75+
You'll need the certificate manager.
76+
77+
```bash
78+
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml
79+
```
80+
6581
And install the deployment manifest (assuming you are sitting in the cloned repository)
6682

6783
```bash
6884
kubectl apply -f deploy/webhook.yaml
6985

70-
# The same
86+
# or with the ml server
87+
kubectl apply -f deploy/webhook-with-mlserver.yaml
88+
89+
# The same (just the webhook)
7190
make install
7291
```
7392

@@ -94,11 +113,11 @@ And install NFD. This will add node feature discovery labels to each node.
94113
kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.17.3
95114
```
96115

97-
### 2. Add Custom Labels
116+
### 2. Custom Labels (Optional)
98117

99118
Since we want to test that NFD is working, we are going to add custom labels. We just want to test and don't need the labels to persist with recreations, so we can just use `kubectl label`. However,
100119
if we do it the right (persistent) way we would write a configuration file to `/etc/kubernetes/node-feature-discovery/features.d` on the node.
101-
In our real world use case we would select based on operating system and kernel version. For our test case, we will just use a script that will programaticallly update worker nodes. In this example,
120+
In our real world use case we would select based on operating system and kernel version. For our test case, we will just use a script that will programatically update worker nodes. In this example,
102121
we are just going to add the same label to all nodes and then check our controller based on the image selected. Let's first add "vanilla":
103122

104123
```bash
@@ -144,6 +163,10 @@ kubectl logs ocifit-k8s-deployment-68d5bf5865-494mg -f
144163

145164
### 4. Test Compatibilility
146165

166+
#### Custom Labels
167+
168+
**Requires generation of custom labels above**
169+
147170
At this point, we want to test compatibility. This step is already done, but I'll show you how I designed the compatibility spec. The logic for this dummy case is the following:
148171

149172
1. If our custom label "feature.node.ocifit-k8s.flavor" is vanilla, we want to choose a debian container.
@@ -155,6 +178,9 @@ here we are flipping the logic a bit. We don't know the image, and instead we ar
155178

156179
```bash
157180
oras push ghcr.io/compspec/ocifit-k8s-compatibility:kind-example ./example/compatibility-test.json:application/vnd.oci.image.compatibilities.v1+json
181+
182+
# For the ml model spec
183+
oras push ghcr.io/compspec/ocifit-k8s-compatibility:ml-example ./example/ml-compatibility-artifact.json:application/vnd.oci.image.model-compatibilities.v1+json
158184
```
159185

160186
We aren't going to be using any referrers API or linking this to an image. The target images are in the artifact, and we get there directly from the associated manifest.
@@ -234,6 +260,10 @@ REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
234260

235261
Boum! Conceptually, we are selecting a different image depending on the rules in the compatibility spec. Our node features were dummy, but they could be real attributes related to kernel, networking, etc.
236262

263+
#### ML Server Decision
264+
265+
See [these experiments](https://github.com/converged-computing/aws-performance-study/tree/main/experiment/eks/cpu/models) for an example of using models.
266+
237267
## License
238268

239269
HPCIC DevTools is distributed under the terms of the MIT license.
@@ -245,4 +275,4 @@ See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICE
245275

246276
SPDX-License-Identifier: (MIT)
247277

248-
LLNL-CODE- 842614
278+
LLNL-CODE- 842614

0 commit comments

Comments
 (0)