You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+176-2Lines changed: 176 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
The solution implements a scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for CPU-based inference and GPU instances for accelerated inference. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
5
5
6
6
## Architecture
7
-

7
+

8
8
9
9
The architecture diagram illustrates our scalable ML inference solution with the following components:
10
10
@@ -24,7 +24,9 @@ The architecture diagram illustrates our scalable ML inference solution with the
24
24
25
25
6.**Function Calling Service**: Enables agentic AI capabilities by allowing models to interact with external APIs and services.
26
26
27
-
7.**Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
27
+
7.**MCP (Model Context Protocol)**: Provides augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness.
28
+
29
+
8.**Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
28
30
29
31
This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.
30
32
@@ -199,6 +201,178 @@ The service will:
199
201
3. Make the appropriate API call
200
202
4. Return the weather information in a conversational format
201
203
204
+
## Deploy LLM Gateway
205
+
206
+
The LLM Gateway serves as a unified API layer for accessing multiple model backends through a standardized OpenAI-compatible interface. This section guides you through deploying LiteLLM as a proxy service on your EKS cluster.
207
+
208
+
### Overview
209
+
210
+
LiteLLM Proxy provides:
211
+
- A unified OpenAI-compatible API for multiple model backends
212
+
- Load balancing and routing between different models
213
+
- Fallback mechanisms for high availability
214
+
- Observability and monitoring capabilities
215
+
- Authentication and rate limiting
216
+
217
+
### Deployment Steps
218
+
219
+
#### 1. Configure the LiteLLM service:
220
+
The LiteLLM service is defined in `dockerfiles/litellm/combined.yaml` and includes:
221
+
- A ConfigMap for the LiteLLM configuration
222
+
- A Secret for API keys and database connection
223
+
- A Deployment for the LiteLLM proxy service
224
+
- A ClusterIP Service to expose the proxy internally
225
+
- An Ingress to expose the service externally via an ALB
226
+
227
+
#### 2. Customize the model configuration:
228
+
Edit the `config.yaml` section in the ConfigMap to specify your model backends:
229
+
```yaml
230
+
model_list:
231
+
- model_name: your-model-name
232
+
litellm_params:
233
+
model: openai/your-model-name
234
+
api_base: http://your-model-endpoint/v1
235
+
api_key: os.environ/OPENAI_API_KEY
236
+
```
237
+
238
+
#### 3. Update the secrets:
239
+
Replace the base64-encoded API keys in the Secret section with your actual keys.
"messages": [{"role": "user", "content": "Hello, how are you?"}]
259
+
}'
260
+
```
261
+
262
+
The LiteLLM proxy will route your request to the appropriate model backend based on the model name specified in the request.
263
+
264
+
## Installing Milvus Vector Database in EKS
265
+
266
+
Milvus is an open-source vector database that powers embedding similarity search and AI applications. This section guides you through deploying Milvus on your EKS cluster with Graviton processors.
267
+
268
+
### Prerequisites
269
+
270
+
- Your EKS cluster is already set up with Graviton (ARM64) nodes
271
+
- Cert-manager is installed on the cluster
272
+
- AWS EBS CSI driver is configured for persistent storage
273
+
274
+
### Deployment Steps
275
+
276
+
#### 1. Install cert-manager (if not already installed):
Or through the Network Load Balancer if you deployed the NLB service.
310
+
311
+
## Deploying MCP (Model Context Protocol) Service
312
+
313
+
The MCP service enables augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness. This implementation is framework-independent, not relying on LangChain or LlamaIndex.
314
+
315
+
### Architecture
316
+
317
+
The MCP service consists of several modular components:
318
+
-**Agent**: Coordinates workflow and manages tool usage
319
+
-**ChatOpenAI**: Handles interactions with the language model and tool calling
320
+
-**MCPClient**: Connects to MCP servers and manages tool calls
321
+
-**EmbeddingRetriever**: Creates and searches vector embeddings for relevant context
322
+
-**VectorStore**: Interfaces with Milvus for storing and retrieving embeddings
323
+
324
+
### Workflow
325
+
326
+
1.**Knowledge Embedding**
327
+
- Documents from the `knowledge` directory are converted to vector embeddings
328
+
- Embeddings and source documents are stored in Milvus vector database
329
+
330
+
2.**Context Retrieval (RAG)**
331
+
- User queries are converted to embeddings
332
+
- The system finds relevant documents by calculating similarity between embeddings
333
+
- Top matching documents form context for the LLM
334
+
335
+
3.**MCP Tool Setup**
336
+
- MCP clients connect to tool servers (e.g., filesystem operations)
337
+
- Tools are registered with the agent
338
+
339
+
4.**Task Execution**
340
+
- User tasks are processed by the LLM with retrieved context
341
+
- The LLM may use tools via MCP clients
342
+
- Tool results are fed back to the LLM to continue the conversation
- Adding more MCP servers for additional tool capabilities
371
+
- Implementing advanced Milvus features like filtering and hybrid search
372
+
- Adding more sophisticated RAG techniques
373
+
- Implementing conversation history for multi-turn interactions
374
+
- Deploying as a service with API endpoints
375
+
202
376
## How do we measure
203
377
Our client program will generate prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
0 commit comments