Skip to content

Commit 67bd89d

Browse files
committed
add vllm for gpu based inference
1 parent 7d6f21d commit 67bd89d

16 files changed

+1065
-813
lines changed

README.md

Lines changed: 130 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,156 @@
11
# Cost effective and Scalable Model Inference on AWS Graviton with Ray on EKS
22

33
## Overview
4-
The solution implements a scalable ML inference architecture using Amazon EKS, leveraging Graviton processors. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
4+
The solution implements a scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for CPU-based inference and GPU instances for accelerated inference. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
55

66
## Prerequisites
7-
1. EKS cluster with KubeRay Operator installed
8-
2. Karpenter node pool is setup for Graviton instance, the node pool label is "kubernetes.io/arch: arm64" in this example
9-
3. Make sure running following command under the llamacpp-rayserve-graviton directory
107

11-
## Deployment
8+
### 1. EKS cluster with KubeRay Operator installed
9+
10+
You can set up the EKS cluster and install the KubeRay Operator using the provided script in the `base_eks_setup` directory:
11+
12+
```bash
13+
cd base_eks_setup
14+
./provision-v2.sh
15+
```
16+
17+
This script performs the following operations:
18+
- Creates an EKS cluster (version 1.31) with 2 initial nodes
19+
- Installs required EKS add-ons (vpc-cni, coredns, eks-pod-identity-agent)
20+
- Installs cert-manager
21+
- Sets up Karpenter for auto-scaling
22+
- Installs the KubeRay Operator
23+
- Configures Prometheus and Grafana for monitoring
24+
25+
You can modify the following variables in the script to customize your deployment:
26+
- `REGION`: AWS region (default: us-east-1)
27+
- `CLUSTER_NAME`: EKS cluster name (default: llm-eks-cluster)
28+
29+
### 2. Karpenter node pool setup for both Graviton and x86 based GPU instances
30+
31+
The `karpenter-pools` directory contains YAML files for configuring Karpenter node pools:
32+
33+
- **karpenter-cpu-inference-Graviton.yaml**: Configures a node pool for Graviton-based CPU inference
34+
- Uses ARM64 architecture (Graviton)
35+
- Targets m7g/c7g instance types (4xlarge)
36+
- Configured for on-demand instances
37+
- Includes appropriate taints and labels for workload targeting
38+
39+
- **karpenter-cpu-inference-Graviton-Spot.yaml**: Similar to above but configured for spot instances
40+
41+
- **karpenter-cpu-agent-Graviton.yaml**: Configures a node pool for agent workloads on Graviton
42+
43+
- **Karpenter-gpu-inference-x86.yaml**: Configures a node pool for GPU-based inference
44+
- Uses x86_64 architecture
45+
- Targets NVIDIA GPU instances (g5, g6 families)
46+
- Configured with appropriate EBS storage and system resources
47+
48+
To apply these configurations after the EKS cluster is set up:
49+
50+
```bash
51+
kubectl apply -f karpenter-pools/karpenter-cpu-inference-Graviton.yaml
52+
kubectl apply -f karpenter-pools/Karpenter-gpu-inference-x86.yaml
53+
kubectl apply -f karpenter-pools/Karpenter-agent-Graviton.yaml
54+
```
55+
56+
### 3. Make sure to run all commands from the appropriate directory
57+
58+
## Deployment Options
59+
60+
### Option 1: CPU-based Inference with llama.cpp on Graviton
61+
1262
Deploy an elastic Ray service hosting llama 3.2 model on Graviton:
1363

14-
### 1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp.yaml`
64+
#### 1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp-with-function-calling.yaml`
1565

16-
### 2. Configure model and inference parameters in the yaml file:
66+
#### 2. Configure model and inference parameters in the yaml file:
1767
- `MODEL_ID`: Hugging Face model repository
1868
- `MODEL_FILENAME`: Model file name in the Hugging Face repo
1969
- `N_THREADS`: Number of threads for inference (recommended: match host EC2 instance vCPU count)
2070
- `CMAKE_ARGS`: C/C++ compile flags for llama.cpp on Graviton
2171

22-
> Note: The example model uses GGUF format, optimized for llama.cpp. See [GGUF documentation](https://huggingface.co/docs/hub/en/gguf) for details.
72+
> Note: The example model uses GGUF format, optimized for llama.cpp. See [GGUF documentation](https://huggingface.co/docs/hub/en/gguf) for details. You can find out different quantization version for the model, check these hugging face repo: https://huggingface.co/bartowski or https://huggingface.co/unsloth
73+
> Note: To run function call, better with reasoning model like Qwen-QwQ-32B in this example
74+
2375

24-
### 3. Create the Kubernetes service:
76+
#### 3. Create the Kubernetes service:
2577
```bash
26-
kubectl create -f ray-service-llamacpp.yaml
78+
kubectl create -f ray-service-llamacpp-with-function-calling.yaml
2779
```
2880

29-
### 4.Get the Kubernetes service name:
81+
#### 4. Get the Kubernetes service name:
3082
```bash
3183
kubectl get svc
3284
```
3385

86+
### Option 2: GPU-based Inference with vLLM
87+
88+
Deploy an elastic Ray service hosting models on GPU instances using vLLM:
89+
90+
#### 1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-vllm-with-function-calling.yaml`
91+
92+
#### 2. Configure model and inference parameters in the yaml file:
93+
- `MODEL_ID`: Hugging Face model repository (default: mistralai/Mistral-7B-Instruct-v0.2)
94+
- `GPU_MEMORY_UTILIZATION`: Percentage of GPU memory to utilize (default: 0.9)
95+
- `MAX_MODEL_LEN`: Maximum sequence length for the model (default: 8192)
96+
- `MAX_NUM_SEQ`: Maximum number of sequences to process in parallel (default: 4)
97+
- `MAX_NUM_BATCHED_TOKENS`: Maximum number of tokens in a batch (default: 32768)
98+
- `ENABLE_FUNCTION_CALLING`: Set to "true" to enable function calling support
99+
100+
#### 3. Create the Kubernetes service:
101+
```bash
102+
kubectl create namespace rayserve-vllm
103+
kubectl create -f ray-service-vllm-with-function-calling.yaml
104+
```
105+
106+
#### 4. Get the Kubernetes service name:
107+
```bash
108+
kubectl get svc -n rayserve-vllm
109+
```
110+
111+
## Agentic AI with Function Calling
112+
113+
This solution supports building agentic AI applications that can leverage either CPU-based (llama.cpp) or GPU-based (vLLM) model inference backends. The agent architecture enables models to call external functions and services.
114+
115+
### Deploying the Function Service
116+
117+
#### 1. Configure the function service:
118+
The function service is defined in `agent/kubernetes/combined.yaml` and includes:
119+
- A Kubernetes Secret for API credentials
120+
- A Deployment for the function service (weather service example)
121+
- A LoadBalancer Service to expose the function API
122+
123+
#### 2. Deploy the function service:
124+
```bash
125+
kubectl apply -f agent/kubernetes/combined.yaml
126+
```
127+
128+
#### 3. Configure your model backend for function calling:
129+
- For CPU-based inference: Use `ray-service-llamacpp-with-function-calling.yaml`
130+
- For GPU-based inference: Use `ray-service-vllm-with-function-calling.yaml` with `ENABLE_FUNCTION_CALLING: "true"`
131+
132+
#### 4. Test function calling:
133+
Once deployed, you can test the weather function service using a simple curl command:
134+
135+
```bash
136+
curl -X POST http://<YOUR-LOAD-BALANCER-URL>/chat \
137+
-H "Content-Type: application/json" \
138+
-d '{"message": "What is the current weather in London?"}'
139+
```
140+
```bash
141+
curl -X POST http://<YOUR-LOAD-BALANCER-URL>/chat \
142+
-H "Content-Type: application/json" \
143+
-d '{"message": "What is the future 2 days weather in London?"}'
144+
```
145+
146+
The service will:
147+
1. Process your natural language query
148+
2. Identify the need to call the weather function
149+
3. Make the appropriate API call
150+
4. Return the weather information in a conversational format
151+
34152
## How do we measure
35-
Our client program will generate 20 different prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
153+
Our client program will generate prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
36154

37155
Follow this guidance if you want to set it up and replicate the experiment
38156

agent/kubernetes/combined.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ spec:
2424
spec:
2525
nodeSelector:
2626
kubernetes.io/arch: arm64
27+
karpenter.sh/nodepool: karpenter-cpu-agent-Graviton
2728
affinity:
2829
nodeAffinity:
2930
requiredDuringSchedulingIgnoredDuringExecution:
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
# https://karpenter.sh/docs/concepts/scheduling/
3+
# https://karpenter.sh/docs/concepts/nodepools/
4+
# https://github.com/awslabs/amazon-eks-ami/releases
5+
# https://marcincuber.medium.com/amazon-eks-implementing-and-using-gpu-nodes-with-nvidia-drivers-08d50fd637fe
6+
apiVersion: karpenter.sh/v1
7+
kind: NodePool
8+
metadata:
9+
name: gpu-general
10+
spec:
11+
limits:
12+
# cpu: 10
13+
# memory: 512Gi
14+
cpu: 1024
15+
memory: 8192Gi
16+
# nvidia.com/gpu: "2"
17+
disruption:
18+
consolidationPolicy: WhenEmptyOrUnderutilized #or WhenEmptyOrUnderutilized
19+
consolidateAfter: 30s
20+
template:
21+
metadata:
22+
labels:
23+
model-inferencing: "gpu-general"
24+
ray-control-plane: "false"
25+
spec:
26+
nodeClassRef:
27+
group: karpenter.k8s.aws
28+
kind: EC2NodeClass
29+
name: gpu-general
30+
taints:
31+
- key: "model-inferencing"
32+
value: "gpu-general"
33+
effect: NoSchedule
34+
expireAfter: 8h
35+
requirements:
36+
# - key: karpenter.k8s.aws/instance-size
37+
# operator: In
38+
# values:
39+
# - 4xlarge
40+
# - 8xlarge
41+
# https://karpenter.sh/docs/reference/instance-types/#g5-family
42+
- key: karpenter.k8s.aws/instance-category
43+
operator: In
44+
values:
45+
- g
46+
# - p
47+
- key: karpenter.k8s.aws/instance-family
48+
operator: In
49+
values: ["g5", "g6"]
50+
- key: kubernetes.io/arch
51+
operator: In
52+
values: ["amd64"]
53+
- key: kubernetes.io/os
54+
operator: In
55+
values: ["linux"]
56+
- key: karpenter.sh/capacity-type
57+
operator: In
58+
values: ["on-demand"]
59+
- key: karpenter.k8s.aws/instance-gpu-manufacturer
60+
operator: In
61+
values: ["nvidia"]
62+
- key: karpenter.k8s.aws/instance-gpu-count
63+
operator: In
64+
values: ["1", "2", "4", "8"]
65+
# - key: "node.kubernetes.io/instance-type"
66+
# operator: In
67+
# values: ["p3.2xlarge", "p3.8xlarge", "p3.16xlarge", "g4dn.xlarge", "g4dn.2xlarge", "g4dn.4xlarge", "g4dn.8xlarge", "g4dn.12xlarge", "g4dn.16xlarge"]
68+
---
69+
apiVersion: karpenter.k8s.aws/v1
70+
kind: EC2NodeClass
71+
metadata:
72+
name: gpu-general
73+
spec:
74+
kubelet:
75+
podsPerCore: 2
76+
maxPods: 20
77+
systemReserved:
78+
cpu: 500m
79+
memory: 900Mi
80+
subnetSelectorTerms:
81+
- tags:
82+
eksctl.cluster.k8s.io/v1alpha1/cluster-name: "llm-eks-cluster"
83+
# - id: "subnet-06cec24e5bcb56f31"
84+
securityGroupSelectorTerms:
85+
- tags:
86+
eksctl.cluster.k8s.io/v1alpha1/cluster-name: "llm-eks-cluster"
87+
# - id: "sg-08658ba17c0fe1ad0"
88+
# amiFamily: "AL2023"
89+
amiFamily: "AL2023"
90+
# # acquired from https://github.com/awslabs/amazon-eks-ami/releases
91+
# - name: "amazon-eks-gpu-node-1.30-v*"
92+
amiSelectorTerms:
93+
- name: "amazon-eks-node-al2023-x86_64-nvidia-1.30-v*"
94+
#alias: al2023@latest
95+
# name: "amazon-eks-gpu-node-1.31-v*"
96+
#alias: al2023@latest
97+
# name: "amazon-eks-node-al2023-x86_64-nvidia-560-1.30-v20241011" #"al2023-ami-minimal-2023.5.20241001.1-kernel-6.1-x86_64"
98+
# id : "ami-0770ab88ec35aa875"
99+
# - name: "amazon-eks-gpu-node-1.30-v20241011"
100+
# id: "ami-07c27f5bd7921bea1"
101+
# id: "ami-01637a5ffbb75ef5c" EKS Image for CPU
102+
role: eksctl-llm-eks-cluster-nodegroup-n-NodeInstanceRole-y411lzob4Y8u
103+
tags:
104+
model-inferencing: "gpu-general"
105+
ray-control-plane: "false"
106+
blockDeviceMappings:
107+
- deviceName: /dev/xvda
108+
ebs:
109+
volumeSize: 700Gi
110+
volumeType: gp3
111+
iops: 10000
112+
encrypted: false
113+
# kmsKeyID: "1234abcd-12ab-34cd-56ef-1234567890ab"
114+
deleteOnTermination: true
115+
throughput: 512
116+
# snapshotID: snap-0123456789
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
# https://karpenter.sh/docs/concepts/scheduling/
3+
# https://karpenter.sh/docs/concepts/nodepools/
4+
# aws ssm get-parameters --names /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 --region us-east-1
5+
# https://github.com/awslabs/amazon-eks-ami/releases
6+
7+
apiVersion: karpenter.sh/v1
8+
kind: NodePool
9+
metadata:
10+
name: karpenter-cpu-agent-Graviton
11+
spec:
12+
limits:
13+
cpu: 512
14+
memory: 8192Gi
15+
disruption:
16+
consolidationPolicy: WhenEmpty
17+
consolidateAfter: 30s
18+
template:
19+
metadata:
20+
labels:
21+
model-inferencing: "agent-arm"
22+
ray-control-plane: "false"
23+
spec:
24+
nodeClassRef:
25+
group: karpenter.k8s.aws
26+
kind: EC2NodeClass
27+
name: karpenter-cpu-agent-Graviton
28+
taints:
29+
- key: "model-inferencing"
30+
value: "agent-arm"
31+
effect: NoSchedule
32+
expireAfter: 1h
33+
requirements:
34+
- key: karpenter.k8s.aws/instance-category
35+
operator: In
36+
values:
37+
- m
38+
- c
39+
- key: karpenter.sh/capacity-type
40+
operator: In
41+
values: ["on-demand"]
42+
- key: karpenter.k8s.aws/instance-size
43+
operator: In
44+
values: ["large", "xlarge"]
45+
- key: kubernetes.io/arch
46+
operator: In
47+
values: ["arm64"]
48+
- key: karpenter.k8s.aws/instance-generation
49+
operator: In
50+
values: ["6", "7", "8"]
51+
- key: node.kubernetes.io/instance-type
52+
operator: In
53+
values: ["c7g.4xlarge"]
54+
---
55+
apiVersion: karpenter.k8s.aws/v1
56+
kind: EC2NodeClass
57+
metadata:
58+
name: karpenter-cpu-agent-Graviton
59+
spec:
60+
kubelet:
61+
podsPerCore: 2
62+
maxPods: 20
63+
systemReserved:
64+
cpu: 100m
65+
memory: 100Mi
66+
subnetSelectorTerms:
67+
- tags:
68+
eksctl.cluster.k8s.io/v1alpha1/cluster-name: "llm-eks-cluster"
69+
70+
securityGroupSelectorTerms:
71+
- tags:
72+
eksctl.cluster.k8s.io/v1alpha1/cluster-name: "llm-eks-cluster"
73+
amiFamily: "AL2023"
74+
amiSelectorTerms:
75+
- name: "amazon-eks-node-al2023-arm64-standard-1.30-*"
76+
role: "eksctl-llm-eks-cluster-nodegroup-n-NodeInstanceRole-y411lzob4Y8u"
77+
tags:
78+
model-inferencing: "agent-arm"
79+
ray-control-plane: "false"
80+
detailedMonitoring: true
81+
blockDeviceMappings:
82+
- deviceName: /dev/xvda
83+
ebs:
84+
volumeSize: 100Gi
85+
volumeType: gp3
86+
iops: 10000
87+
encrypted: false
88+
deleteOnTermination: true
89+
throughput: 256

0 commit comments

Comments
 (0)