Skip to content

Commit 40b5326

Browse files
committed
revise readme
1 parent 5a9cec9 commit 40b5326

File tree

4 files changed

+102
-0
lines changed

4 files changed

+102
-0
lines changed

.DS_Store

0 Bytes
Binary file not shown.

README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,33 @@
33
## Overview
44
The solution implements a scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for CPU-based inference and GPU instances for accelerated inference. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
55

6+
## Architecture
7+
![Architecture Diagram](image/Diagram.png)
8+
9+
The architecture diagram illustrates our scalable ML inference solution with the following components:
10+
11+
1. **Amazon EKS Cluster**: The foundation of our architecture, providing a managed Kubernetes environment.
12+
13+
2. **Karpenter Auto-scaling**: Dynamically provisions and scales compute resources based on workload demands.
14+
15+
3. **Node Pools**:
16+
- **Graviton-based nodes (ARM64)**: Cost-effective CPU inference using m8g/c8g instances
17+
- **GPU-based nodes (x86_64)**: High-performance inference using NVIDIA GPU instances (g5, g6 families)
18+
19+
4. **Ray Serve Deployment**:
20+
- **Ray Head**: Manages the Ray cluster and coordinates workload distribution
21+
- **Ray Workers**: Execute the inference tasks with either llama.cpp (on Graviton) or vLLM (on GPU)
22+
23+
5. **LiteLLM Proxy**: Acts as a unified inference API gateway, providing standardized OpenAI-compatible endpoints and handling request routing, load balancing, and fallback mechanisms across multiple model backends.
24+
25+
6. **Function Calling Service**: Enables agentic AI capabilities by allowing models to interact with external APIs and services.
26+
27+
7. **Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
28+
29+
This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.
30+
31+
For networking-intensive workloads such as agent orchestration and API proxying, we deploy these components on Graviton instances to leverage their excellent price-performance ratio for concurrent connection handling. Graviton processors excel at handling high-throughput networking tasks, making them ideal for the LiteLLM proxy and agent services that manage numerous concurrent requests and API calls. This approach optimizes cost efficiency while maintaining responsive performance for these connection-oriented workloads.
32+
633
## Prerequisites
734

835
### 1. EKS cluster with KubeRay Operator installed

image/Diagram.drawio

Lines changed: 75 additions & 0 deletions
Large diffs are not rendered by default.

image/Diagram.png

112 KB
Loading

0 commit comments

Comments
 (0)