Skip to content

Commit 97a98d7

Browse files
committed
AI inference: demonstrate in-cluster storage of models
This example demonstrates how we can serve models from inside the cluster, without needing to bake them into the container images. We may also in future want to support storing models in GCS or S3, but this example focuses on storing models without cloud dependencies. We may also want to investigate serving models from container images, particularly given the upcoming support for mounting container images as volumes, but this approach works today and allows for more dynamic model loading (e.g. loading new models without restarting pods). Moreover, a container image server is backed by a blob server, as introduced here.
1 parent 75b7b4f commit 97a98d7

File tree

20 files changed

+1513
-0
lines changed

20 files changed

+1513
-0
lines changed

AI/modelcloud/README.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# modelcloud example
2+
3+
## Goal: Production-grade inference on AI-conformant kubernetes clusters
4+
5+
This goal of this example is to build up production-grade inference
6+
on AI-conformant kubernetes clusters.
7+
8+
We (aspirationally) aim to demonstrate the capabilities of the AI-conformance
9+
profile. Where we cannot achieve production-grade inference, we hope to
10+
motivate discussion of extensions to the AI-conformance profile to plug those gaps.
11+
12+
## Walkthrough
13+
14+
### Deploying to a kubernetes cluster
15+
16+
Create a kubernetes cluster, we currently test with GKE and gcr.io but do not aim
17+
to depend on non-conformant functionality; PRs to add support for deployment
18+
to other conformant environments are very welcome.
19+
20+
1. From the modelcloud directory, run `dev/tools/push-images` to push to `gcr.io/$(gcloud config get project)/...`
21+
22+
1. Run `dev/tools/deploy-to-kube` to deploy.
23+
24+
We deploy two workloads:
25+
26+
1. `blob-server`, a statefulset with a persistent volume to hold the model blobs (files)
27+
28+
1. `gemma3`, a deployment running vLLM, with a frontend go process that will download the model from `blob-server`.
29+
30+
### Uploading a model
31+
32+
For now, we will likely be dealing with models from huggingface.
33+
34+
Begin by cloning the model locally (and note that currently only google/gemma-3-1b-it is supported):
35+
36+
```
37+
git clone https://huggingface.co/google/gemma-3-1b-it
38+
```
39+
40+
If you now run `go run ./cmd/model-hasher --src gemma-3-1b-it` you should see it print the hashes for each file:
41+
42+
```
43+
spec:
44+
files:
45+
- hash: 1468e03275c15d522c721bd987f03c4ee21b8e246b901803c67448bbc25acf58
46+
path: .gitattributes
47+
- hash: 60be259533fe3acba21d109a51673815a4c29aefdb7769862695086fcedbeb7a
48+
path: README.md
49+
- hash: 50b2f405ba56a26d4913fd772089992252d7f942123cc0a034d96424221ba946
50+
path: added_tokens.json
51+
- hash: 19cb5d28c97778271ba2b3c3df47bf76bdd6706724777a2318b3522230afe91e
52+
path: config.json
53+
- hash: fd9324becc53c4be610db39e13a613006f09fd6ef71a95fb6320dc33157490a3
54+
path: generation_config.json
55+
- hash: 3d4ef8d71c14db7e448a09ebe891cfb6bf32c57a9b44499ae0d1c098e48516b6
56+
path: model.safetensors
57+
- hash: 2f7b0adf4fb469770bb1490e3e35df87b1dc578246c5e7e6fc76ecf33213a397
58+
path: special_tokens_map.json
59+
- hash: 4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
60+
path: tokenizer.json
61+
- hash: 1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
62+
path: tokenizer.model
63+
- hash: bfe25c2735e395407beb78456ea9a6984a1f00d8c16fa04a8b75f2a614cf53e1
64+
path: tokenizer_config.json
65+
```
66+
67+
Inside the vllm-frontend, this [list of files is currently embedded](cmd/vllm-frontend/models/google/gemma-3-1b-it/model.yaml).
68+
This is why we only support gemma-3-1b-it today, though we plan to relax this in future (e.g. a CRD?)
69+
70+
The blob-server accepts uploads, and we can upload the blobs using a port-forward:
71+
72+
```
73+
kubectl port-forward blob-server-0 8081:8080 &
74+
go run ./cmd/model-hasher/ --src gemma-3-1b-it/ --upload http://127.0.0.1:8081
75+
```
76+
77+
This will then store the blobs on the persistent disk of blob-server, so they are now available in-cluster,
78+
you can verify this with `kubectl debug`:
79+
80+
```
81+
> kubectl debug blob-server-0 -it --image=debian:latest --profile=general --share-processes --target blob-server
82+
root@blob-server-0:/# cat /proc/1/mounts | grep blob
83+
/dev/sdb /blobs ext4 rw,relatime 0 0
84+
root@blob-server-0:/# ls -l /proc/1/root/blobs/
85+
total 1991324
86+
-rw------- 1 root root 4689074 Sep 8 21:40 1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
87+
-rw------- 1 root root 1676 Sep 8 21:39 1468e03275c15d522c721bd987f03c4ee21b8e246b901803c67448bbc25acf58
88+
-rw------- 1 root root 899 Sep 8 21:39 19cb5d28c97778271ba2b3c3df47bf76bdd6706724777a2318b3522230afe91e
89+
-rw------- 1 root root 662 Sep 8 21:40 2f7b0adf4fb469770bb1490e3e35df87b1dc578246c5e7e6fc76ecf33213a397
90+
-rw------- 1 root root 1999811208 Sep 8 21:40 3d4ef8d71c14db7e448a09ebe891cfb6bf32c57a9b44499ae0d1c098e48516b6
91+
-rw------- 1 root root 33384568 Sep 8 21:40 4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
92+
-rw------- 1 root root 35 Sep 8 21:39 50b2f405ba56a26d4913fd772089992252d7f942123cc0a034d96424221ba946
93+
-rw------- 1 root root 24265 Sep 8 21:39 60be259533fe3acba21d109a51673815a4c29aefdb7769862695086fcedbeb7a
94+
-rw------- 1 root root 1156999 Sep 8 21:40 bfe25c2735e395407beb78456ea9a6984a1f00d8c16fa04a8b75f2a614cf53e1
95+
-rw------- 1 root root 215 Sep 8 21:39 fd9324becc53c4be610db39e13a613006f09fd6ef71a95fb6320dc33157490a3
96+
drwx------ 2 root root 16384 Sep 8 21:38 lost+found
97+
```
98+
99+
## Using the inference server
100+
101+
At this point the vLLM process should (hopefully) have downloaded the model and started.
102+
103+
```bash
104+
kubectl wait --for=condition=Available --timeout=10s deployment/gemma3
105+
kubectl get pods -l app=gemma3
106+
```
107+
108+
To check logs (particularly if this is not already ready)
109+
```bash
110+
kubectl logs -f -l app=gemma3
111+
```
112+
113+
114+
## Verification / Seeing it Work
115+
116+
Forward local requests to vLLM service:
117+
118+
```bash
119+
# Forward a local port (e.g., 8080) to the service port (e.g., 8080)
120+
kubectl port-forward service/gemma3 8080:80 &
121+
```
122+
123+
2. Send request to local forwarding port:
124+
125+
```bash
126+
curl -X POST http://localhost:8080/v1/chat/completions \
127+
-H "Content-Type: application/json" \
128+
-d '{
129+
"model": "google/gemma-3-1b-it",
130+
"messages": [{"role": "user", "content": "Explain Quantum Computing in simple terms."}],
131+
"max_tokens": 100
132+
}'
133+
```
134+
135+
Expected output (or similar):
136+
137+
```json
138+
{"id":"chatcmpl-462b3e153fd34e5ca7f5f02f3bcb6b0c","object":"chat.completion","created":1753164476,"model":"google/gemma-3-1b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, let’s break down quantum computing in a way that’s hopefully understandable without getting lost in too much jargon. Here's the gist:\n\n**1. Classical Computers vs. Quantum Computers:**\n\n* **Classical Computers:** These are the computers you use every day – laptops, phones, servers. They store information as *bits*. A bit is like a light switch: it's either on (1) or off (0). Everything a classical computer does – from playing games","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}
139+
```
140+
141+
---
142+
143+
## Cleanup
144+
145+
```bash
146+
kubectl delete deployment gemma3
147+
kubectl delete statefulset blob-server
148+
```
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
// Copyright 2025 The Kubernetes Authors
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
package main
16+
17+
import (
18+
"context"
19+
"flag"
20+
"fmt"
21+
"io"
22+
"net/http"
23+
"os"
24+
"path/filepath"
25+
"strings"
26+
27+
"google.golang.org/grpc/codes"
28+
"google.golang.org/grpc/status"
29+
"k8s.io/examples/AI/modelcloud/pkg/blobs"
30+
"k8s.io/klog/v2"
31+
)
32+
33+
func main() {
34+
if err := run(context.Background()); err != nil {
35+
fmt.Fprintf(os.Stderr, "%v\n", err)
36+
os.Exit(1)
37+
}
38+
}
39+
40+
func run(ctx context.Context) error {
41+
log := klog.FromContext(ctx)
42+
43+
listen := ":8080"
44+
cacheDir := os.Getenv("CACHE_DIR")
45+
if cacheDir == "" {
46+
// We expect CACHE_DIR to be set when running on kubernetes, but default sensibly for local dev
47+
cacheDir = "~/.cache/blob-server/blobs"
48+
}
49+
flag.StringVar(&listen, "listen", listen, "listen address")
50+
flag.StringVar(&cacheDir, "cache-dir", cacheDir, "cache directory")
51+
flag.Parse()
52+
53+
if strings.HasPrefix(cacheDir, "~/") {
54+
homeDir, err := os.UserHomeDir()
55+
if err != nil {
56+
return fmt.Errorf("getting home directory: %w", err)
57+
}
58+
cacheDir = filepath.Join(homeDir, strings.TrimPrefix(cacheDir, "~/"))
59+
}
60+
61+
if err := os.MkdirAll(cacheDir, 0755); err != nil {
62+
return fmt.Errorf("creating cache directory %q: %w", cacheDir, err)
63+
}
64+
65+
blobStore := &blobs.LocalBlobStore{
66+
LocalDir: cacheDir,
67+
}
68+
69+
blobCache := &blobCache{
70+
CacheDir: cacheDir,
71+
blobStore: blobStore,
72+
}
73+
74+
s := &httpServer{
75+
blobCache: blobCache,
76+
tmpDir: filepath.Join(cacheDir, "tmp"),
77+
}
78+
79+
log.Info("serving http", "endpoint", listen)
80+
if err := http.ListenAndServe(listen, s); err != nil {
81+
return fmt.Errorf("serving on %q: %w", listen, err)
82+
}
83+
84+
return nil
85+
}
86+
87+
type httpServer struct {
88+
blobCache *blobCache
89+
tmpDir string
90+
}
91+
92+
func (s *httpServer) ServeHTTP(w http.ResponseWriter, r *http.Request) {
93+
tokens := strings.Split(strings.TrimPrefix(r.URL.Path, "/"), "/")
94+
if len(tokens) == 1 {
95+
if r.Method == "GET" {
96+
hash := tokens[0]
97+
s.serveGETBlob(w, r, hash)
98+
return
99+
}
100+
if r.Method == "PUT" {
101+
hash := tokens[0]
102+
s.servePUTBlob(w, r, hash)
103+
return
104+
}
105+
http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
106+
return
107+
}
108+
109+
http.Error(w, "not found", http.StatusNotFound)
110+
}
111+
112+
func (s *httpServer) serveGETBlob(w http.ResponseWriter, r *http.Request, hash string) {
113+
ctx := r.Context()
114+
115+
log := klog.FromContext(ctx)
116+
117+
// TODO: Validate hash is hex, right length etc
118+
119+
f, err := s.blobCache.GetBlob(ctx, hash)
120+
if err != nil {
121+
if status.Code(err) == codes.NotFound {
122+
log.Info("blob not found", "hash", hash)
123+
http.Error(w, "not found", http.StatusNotFound)
124+
return
125+
}
126+
log.Error(err, "error getting blob")
127+
http.Error(w, "internal server error", http.StatusInternalServerError)
128+
return
129+
}
130+
defer f.Close()
131+
p := f.Name()
132+
133+
log.Info("serving blob", "hash", hash, "path", p)
134+
http.ServeFile(w, r, p)
135+
}
136+
137+
func (s *httpServer) servePUTBlob(w http.ResponseWriter, r *http.Request, hash string) {
138+
ctx := r.Context()
139+
140+
log := klog.FromContext(ctx)
141+
142+
// TODO: Download to temp file first?
143+
144+
if err := s.blobCache.PutBlob(ctx, hash, r.Body); err != nil {
145+
log.Error(err, "error stoing blob")
146+
http.Error(w, "internal server error", http.StatusInternalServerError)
147+
return
148+
}
149+
150+
log.Info("uploaded blob", "hash", hash)
151+
152+
w.WriteHeader(http.StatusCreated)
153+
}
154+
155+
type blobCache struct {
156+
CacheDir string
157+
blobStore blobs.BlobStore
158+
}
159+
160+
func (c *blobCache) GetBlob(ctx context.Context, hash string) (*os.File, error) {
161+
log := klog.FromContext(ctx)
162+
163+
localPath := filepath.Join(c.CacheDir, hash)
164+
f, err := os.Open(localPath)
165+
if err == nil {
166+
return f, nil
167+
} else if !os.IsNotExist(err) {
168+
return nil, fmt.Errorf("opening blob %q: %w", hash, err)
169+
}
170+
171+
log.Info("blob not found in cache, downloading", "hash", hash)
172+
173+
err = c.blobStore.Download(ctx, blobs.BlobInfo{Hash: hash}, localPath)
174+
if err == nil {
175+
f, err := os.Open(localPath)
176+
if err != nil {
177+
return nil, fmt.Errorf("opening blob %q after download: %w", hash, err)
178+
}
179+
return f, nil
180+
}
181+
182+
return nil, err
183+
}
184+
185+
func (c *blobCache) PutBlob(ctx context.Context, hash string, r io.Reader) error {
186+
log := klog.FromContext(ctx)
187+
188+
if err := c.blobStore.Upload(ctx, r, blobs.BlobInfo{Hash: hash}); err != nil {
189+
log.Error(err, "error uploading blob")
190+
return fmt.Errorf("uploading blob %q: %w", hash, err)
191+
}
192+
193+
// TODO: Side-load into local cache too?
194+
195+
return nil
196+
}

0 commit comments

Comments
 (0)