|
1 | 1 | # Cost effective and Scalable Model Inference on AWS Graviton with Ray on EKS |
2 | 2 |
|
3 | 3 | ## Overview |
4 | | -The solution implements a scalable ML inference architecture using Amazon EKS, leveraging Graviton processors. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment. |
| 4 | +The solution implements a scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for CPU-based inference and GPU instances for accelerated inference. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment. |
5 | 5 |
|
6 | 6 | ## Prerequisites |
7 | | -1. EKS cluster with KubeRay Operator installed |
8 | | -2. Karpenter node pool is setup for Graviton instance, the node pool label is "kubernetes.io/arch: arm64" in this example |
9 | | -3. Make sure running following command under the llamacpp-rayserve-graviton directory |
10 | 7 |
|
11 | | -## Deployment |
| 8 | +### 1. EKS cluster with KubeRay Operator installed |
| 9 | + |
| 10 | +You can set up the EKS cluster and install the KubeRay Operator using the provided script in the `base_eks_setup` directory: |
| 11 | + |
| 12 | +```bash |
| 13 | +cd base_eks_setup |
| 14 | +./provision-v2.sh |
| 15 | +``` |
| 16 | + |
| 17 | +This script performs the following operations: |
| 18 | +- Creates an EKS cluster (version 1.31) with 2 initial nodes |
| 19 | +- Installs required EKS add-ons (vpc-cni, coredns, eks-pod-identity-agent) |
| 20 | +- Installs cert-manager |
| 21 | +- Sets up Karpenter for auto-scaling |
| 22 | +- Installs the KubeRay Operator |
| 23 | +- Configures Prometheus and Grafana for monitoring |
| 24 | + |
| 25 | +You can modify the following variables in the script to customize your deployment: |
| 26 | +- `REGION`: AWS region (default: us-east-1) |
| 27 | +- `CLUSTER_NAME`: EKS cluster name (default: llm-eks-cluster) |
| 28 | + |
| 29 | +### 2. Karpenter node pool setup for both Graviton and x86 based GPU instances |
| 30 | + |
| 31 | +The `karpenter-pools` directory contains YAML files for configuring Karpenter node pools: |
| 32 | + |
| 33 | +- **karpenter-cpu-inference-Graviton.yaml**: Configures a node pool for Graviton-based CPU inference |
| 34 | + - Uses ARM64 architecture (Graviton) |
| 35 | + - Targets m7g/c7g instance types (4xlarge) |
| 36 | + - Configured for on-demand instances |
| 37 | + - Includes appropriate taints and labels for workload targeting |
| 38 | + |
| 39 | +- **karpenter-cpu-inference-Graviton-Spot.yaml**: Similar to above but configured for spot instances |
| 40 | + |
| 41 | +- **karpenter-cpu-agent-Graviton.yaml**: Configures a node pool for agent workloads on Graviton |
| 42 | + |
| 43 | +- **Karpenter-gpu-inference-x86.yaml**: Configures a node pool for GPU-based inference |
| 44 | + - Uses x86_64 architecture |
| 45 | + - Targets NVIDIA GPU instances (g5, g6 families) |
| 46 | + - Configured with appropriate EBS storage and system resources |
| 47 | + |
| 48 | +To apply these configurations after the EKS cluster is set up: |
| 49 | + |
| 50 | +```bash |
| 51 | +kubectl apply -f karpenter-pools/karpenter-cpu-inference-Graviton.yaml |
| 52 | +kubectl apply -f karpenter-pools/Karpenter-gpu-inference-x86.yaml |
| 53 | +kubectl apply -f karpenter-pools/Karpenter-agent-Graviton.yaml |
| 54 | +``` |
| 55 | + |
| 56 | +### 3. Make sure to run all commands from the appropriate directory |
| 57 | + |
| 58 | +## Deployment Options |
| 59 | + |
| 60 | +### Option 1: CPU-based Inference with llama.cpp on Graviton |
| 61 | + |
12 | 62 | Deploy an elastic Ray service hosting llama 3.2 model on Graviton: |
13 | 63 |
|
14 | | -### 1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp.yaml` |
| 64 | +#### 1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp-with-function-calling.yaml` |
15 | 65 |
|
16 | | -### 2. Configure model and inference parameters in the yaml file: |
| 66 | +#### 2. Configure model and inference parameters in the yaml file: |
17 | 67 | - `MODEL_ID`: Hugging Face model repository |
18 | 68 | - `MODEL_FILENAME`: Model file name in the Hugging Face repo |
19 | 69 | - `N_THREADS`: Number of threads for inference (recommended: match host EC2 instance vCPU count) |
20 | 70 | - `CMAKE_ARGS`: C/C++ compile flags for llama.cpp on Graviton |
21 | 71 |
|
22 | | -> Note: The example model uses GGUF format, optimized for llama.cpp. See [GGUF documentation](https://huggingface.co/docs/hub/en/gguf) for details. |
| 72 | +> Note: The example model uses GGUF format, optimized for llama.cpp. See [GGUF documentation](https://huggingface.co/docs/hub/en/gguf) for details. You can find out different quantization version for the model, check these hugging face repo: https://huggingface.co/bartowski or https://huggingface.co/unsloth |
| 73 | +> Note: To run function call, better with reasoning model like Qwen-QwQ-32B in this example |
| 74 | +
|
23 | 75 |
|
24 | | -### 3. Create the Kubernetes service: |
| 76 | +#### 3. Create the Kubernetes service: |
25 | 77 | ```bash |
26 | | -kubectl create -f ray-service-llamacpp.yaml |
| 78 | +kubectl create -f ray-service-llamacpp-with-function-calling.yaml |
27 | 79 | ``` |
28 | 80 |
|
29 | | -### 4.Get the Kubernetes service name: |
| 81 | +#### 4. Get the Kubernetes service name: |
30 | 82 | ```bash |
31 | 83 | kubectl get svc |
32 | 84 | ``` |
33 | 85 |
|
| 86 | +### Option 2: GPU-based Inference with vLLM |
| 87 | + |
| 88 | +Deploy an elastic Ray service hosting models on GPU instances using vLLM: |
| 89 | + |
| 90 | +#### 1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-vllm-with-function-calling.yaml` |
| 91 | + |
| 92 | +#### 2. Configure model and inference parameters in the yaml file: |
| 93 | + - `MODEL_ID`: Hugging Face model repository (default: mistralai/Mistral-7B-Instruct-v0.2) |
| 94 | + - `GPU_MEMORY_UTILIZATION`: Percentage of GPU memory to utilize (default: 0.9) |
| 95 | + - `MAX_MODEL_LEN`: Maximum sequence length for the model (default: 8192) |
| 96 | + - `MAX_NUM_SEQ`: Maximum number of sequences to process in parallel (default: 4) |
| 97 | + - `MAX_NUM_BATCHED_TOKENS`: Maximum number of tokens in a batch (default: 32768) |
| 98 | + - `ENABLE_FUNCTION_CALLING`: Set to "true" to enable function calling support |
| 99 | + |
| 100 | +#### 3. Create the Kubernetes service: |
| 101 | +```bash |
| 102 | +kubectl create namespace rayserve-vllm |
| 103 | +kubectl create -f ray-service-vllm-with-function-calling.yaml |
| 104 | +``` |
| 105 | + |
| 106 | +#### 4. Get the Kubernetes service name: |
| 107 | +```bash |
| 108 | +kubectl get svc -n rayserve-vllm |
| 109 | +``` |
| 110 | + |
| 111 | +## Agentic AI with Function Calling |
| 112 | + |
| 113 | +This solution supports building agentic AI applications that can leverage either CPU-based (llama.cpp) or GPU-based (vLLM) model inference backends. The agent architecture enables models to call external functions and services. |
| 114 | + |
| 115 | +### Deploying the Function Service |
| 116 | + |
| 117 | +#### 1. Configure the function service: |
| 118 | +The function service is defined in `agent/kubernetes/combined.yaml` and includes: |
| 119 | +- A Kubernetes Secret for API credentials |
| 120 | +- A Deployment for the function service (weather service example) |
| 121 | +- A LoadBalancer Service to expose the function API |
| 122 | + |
| 123 | +#### 2. Deploy the function service: |
| 124 | +```bash |
| 125 | +kubectl apply -f agent/kubernetes/combined.yaml |
| 126 | +``` |
| 127 | + |
| 128 | +#### 3. Configure your model backend for function calling: |
| 129 | +- For CPU-based inference: Use `ray-service-llamacpp-with-function-calling.yaml` |
| 130 | +- For GPU-based inference: Use `ray-service-vllm-with-function-calling.yaml` with `ENABLE_FUNCTION_CALLING: "true"` |
| 131 | + |
| 132 | +#### 4. Test function calling: |
| 133 | +Once deployed, you can test the weather function service using a simple curl command: |
| 134 | + |
| 135 | +```bash |
| 136 | +curl -X POST http://<YOUR-LOAD-BALANCER-URL>/chat \ |
| 137 | + -H "Content-Type: application/json" \ |
| 138 | + -d '{"message": "What is the current weather in London?"}' |
| 139 | +``` |
| 140 | +```bash |
| 141 | +curl -X POST http://<YOUR-LOAD-BALANCER-URL>/chat \ |
| 142 | + -H "Content-Type: application/json" \ |
| 143 | + -d '{"message": "What is the future 2 days weather in London?"}' |
| 144 | +``` |
| 145 | + |
| 146 | +The service will: |
| 147 | +1. Process your natural language query |
| 148 | +2. Identify the need to call the weather function |
| 149 | +3. Make the appropriate API call |
| 150 | +4. Return the weather information in a conversational format |
| 151 | + |
34 | 152 | ## How do we measure |
35 | | -Our client program will generate 20 different prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance. |
| 153 | +Our client program will generate prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance. |
36 | 154 |
|
37 | 155 | Follow this guidance if you want to set it up and replicate the experiment |
38 | 156 |
|
|
0 commit comments