Skip to content

Commit 24de683

Browse files
Merge pull request #23 from jgarciao/add-overlay-remote-model
Add example overlay to use an inference model deployed remotely when creating a LlamaStackDistribution
2 parents 5ed1f57 + 64ea827 commit 24de683

File tree

4 files changed

+99
-1
lines changed

4 files changed

+99
-1
lines changed

DEPLOYMENT.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ oc new-project rag-stack
2424

2525
### 3. Deploy the Stack
2626

27-
The project offers three deployment options:
27+
The project offers multiple deployment options:
2828

2929
#### Option A: Default Setup (KServe vLLM + Llama 3.2)
3030
```bash
@@ -42,6 +42,31 @@ oc apply -k stack/overlays/vllm-standalone-llama3.2
4242
oc patch secret hf-token-secret --type='merge' -p='{"data":{"HF_TOKEN":"'$(echo -n "hf_your_token" | base64)'"}}'
4343
```
4444

45+
#### Option D: setup using an inference model deployed remotely
46+
47+
48+
```bash
49+
# Create secret llama-stack-inference-model-secret providing model info
50+
# Important:
51+
# - Make sure that the value for INFERENCE_MODEL is correct (it doesn't have points)
52+
# - In VLLM_URL you can use internal or external endpoints for the model. Add /v1 at the end
53+
# - NEVER set VLLM_TLS_VERIFY=false in production
54+
export INFERENCE_MODEL="llama-3-2-3b"
55+
export VLLM_URL="https://llama-3-2-3b.apps.remote-cluster.com:443/v1"
56+
export VLLM_TLS_VERIFY="false"
57+
export VLLM_API_TOKEN="XXXXXXXXXXXXXXXXXXXXXXX"
58+
59+
oc create secret generic llama-stack-inference-model-secret \
60+
--from-literal INFERENCE_MODEL="$INFERENCE_MODEL" \
61+
--from-literal VLLM_URL="$VLLM_URL" \
62+
--from-literal VLLM_TLS_VERIFY="$VLLM_TLS_VERIFY" \
63+
--from-literal VLLM_API_TOKEN="$VLLM_API_TOKEN"
64+
65+
# Deploy the LlamaStackDistribution
66+
oc apply -k stack/overlays/vllm-remote-inference-model
67+
```
68+
69+
4570
### 4. Verify Deployment
4671

4772
Check if all pods are running:
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
4+
resources:
5+
- llama-stack-distribution.yaml
6+
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
apiVersion: llamastack.io/v1alpha1
3+
kind: LlamaStackDistribution
4+
metadata:
5+
name: lsd-llama-milvus
6+
spec:
7+
replicas: 1
8+
server:
9+
containerSpec:
10+
resources:
11+
requests:
12+
cpu: "250m"
13+
memory: "500Mi"
14+
limits:
15+
cpu: "2"
16+
memory: "12Gi"
17+
env:
18+
- name: INFERENCE_MODEL
19+
valueFrom:
20+
secretKeyRef:
21+
key: INFERENCE_MODEL
22+
name: llama-stack-inference-model-secret
23+
optional: true
24+
- name: VLLM_URL
25+
valueFrom:
26+
secretKeyRef:
27+
key: VLLM_URL
28+
name: llama-stack-inference-model-secret
29+
optional: true
30+
- name: VLLM_TLS_VERIFY
31+
valueFrom:
32+
secretKeyRef:
33+
key: VLLM_TLS_VERIFY
34+
name: llama-stack-inference-model-secret
35+
optional: true
36+
- name: VLLM_API_TOKEN
37+
valueFrom:
38+
secretKeyRef:
39+
key: VLLM_API_TOKEN
40+
name: llama-stack-inference-model-secret
41+
optional: true
42+
- name: MILVUS_DB_PATH
43+
value: ~/.llama/milvus.db
44+
- name: FMS_ORCHESTRATOR_URL
45+
value: "http://localhost"
46+
name: llama-stack
47+
port: 8321
48+
distribution:
49+
image: quay.io/opendatahub/llama-stack:odh
50+
storage:
51+
size: "5Gi"
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Secret added as example. It should be manually created with the right values via
2+
# oc create secret generic ... before creating the llama-stack-distribution
3+
# Important:
4+
# - Make sure that the value for INFERENCE_MODEL is correct (it doesn't have points)
5+
# - In VLLM_URL you can use internal or external endpoints for the model. Add /v1 at the end
6+
# - NEVER set VLLM_TLS_VERIFY=false in production
7+
apiVersion: v1
8+
kind: Secret
9+
metadata:
10+
name: llama-stack-inference-model-secret
11+
type: Opaque
12+
stringData:
13+
INFERENCE_MODEL: "<your-model-id>"
14+
VLLM_API_TOKEN: "<paste-api-token>"
15+
VLLM_TLS_VERIFY: "true" # or "false"
16+
VLLM_URL: "https://your-model-id.example.com/v1"

0 commit comments

Comments
 (0)