Skip to content

Commit e3362c7

Browse files
committed
Remove node_modules from git tracking
1 parent 68703f1 commit e3362c7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+5401
-85
lines changed

.DS_Store

2 KB
Binary file not shown.

.gitignore

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Node.js dependencies
2+
node_modules/
3+
npm-debug.log
4+
yarn-debug.log
5+
yarn-error.log
6+
.pnpm-debug.log
7+
8+
# Environment variables
9+
.env
10+
.env.local
11+
.env.development.local
12+
.env.test.local
13+
.env.production.local
14+
15+
# Build outputs
16+
dist/
17+
build/
18+
out/
19+
.next/
20+
21+
# Cache directories
22+
.npm/
23+
.pnpm-store/
24+
.yarn/cache
25+
.yarn/unplugged
26+
.yarn/build-state.yml
27+
.yarn/install-state.gz
28+
29+
# Editor directories and files
30+
.idea/
31+
.vscode/
32+
*.swp
33+
*.swo
34+
*~
35+
36+
# OS generated files
37+
.DS_Store
38+
.DS_Store?
39+
._*
40+
.Spotlight-V100
41+
.Trashes
42+
ehthumbs.db
43+
Thumbs.db

README.md

Lines changed: 176 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
The solution implements a scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for CPU-based inference and GPU instances for accelerated inference. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
55

66
## Architecture
7-
![Architecture Diagram](image/Diagram.png)
7+
![Architecture Diagram](image/arch.png)
88

99
The architecture diagram illustrates our scalable ML inference solution with the following components:
1010

@@ -24,7 +24,9 @@ The architecture diagram illustrates our scalable ML inference solution with the
2424

2525
6. **Function Calling Service**: Enables agentic AI capabilities by allowing models to interact with external APIs and services.
2626

27-
7. **Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
27+
7. **MCP (Model Context Protocol)**: Provides augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness.
28+
29+
8. **Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
2830

2931
This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.
3032

@@ -199,6 +201,178 @@ The service will:
199201
3. Make the appropriate API call
200202
4. Return the weather information in a conversational format
201203

204+
## Deploy LLM Gateway
205+
206+
The LLM Gateway serves as a unified API layer for accessing multiple model backends through a standardized OpenAI-compatible interface. This section guides you through deploying LiteLLM as a proxy service on your EKS cluster.
207+
208+
### Overview
209+
210+
LiteLLM Proxy provides:
211+
- A unified OpenAI-compatible API for multiple model backends
212+
- Load balancing and routing between different models
213+
- Fallback mechanisms for high availability
214+
- Observability and monitoring capabilities
215+
- Authentication and rate limiting
216+
217+
### Deployment Steps
218+
219+
#### 1. Configure the LiteLLM service:
220+
The LiteLLM service is defined in `dockerfiles/litellm/combined.yaml` and includes:
221+
- A ConfigMap for the LiteLLM configuration
222+
- A Secret for API keys and database connection
223+
- A Deployment for the LiteLLM proxy service
224+
- A ClusterIP Service to expose the proxy internally
225+
- An Ingress to expose the service externally via an ALB
226+
227+
#### 2. Customize the model configuration:
228+
Edit the `config.yaml` section in the ConfigMap to specify your model backends:
229+
```yaml
230+
model_list:
231+
- model_name: your-model-name
232+
litellm_params:
233+
model: openai/your-model-name
234+
api_base: http://your-model-endpoint/v1
235+
api_key: os.environ/OPENAI_API_KEY
236+
```
237+
238+
#### 3. Update the secrets:
239+
Replace the base64-encoded API keys in the Secret section with your actual keys.
240+
241+
#### 4. Deploy the LiteLLM proxy:
242+
```bash
243+
kubectl apply -f dockerfiles/litellm/combined.yaml
244+
```
245+
246+
#### 5. Access the LiteLLM proxy:
247+
Once deployed, you can access the LiteLLM proxy through the ALB created by the Ingress:
248+
```bash
249+
# Get the ALB URL
250+
kubectl get ingress litellm-ingress
251+
252+
# Test the API
253+
curl -X POST https://<YOUR-ALB-URL>/v1/chat/completions \
254+
-H "Content-Type: application/json" \
255+
-H "Authorization: Bearer sk-1234" \
256+
-d '{
257+
"model": "unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF",
258+
"messages": [{"role": "user", "content": "Hello, how are you?"}]
259+
}'
260+
```
261+
262+
The LiteLLM proxy will route your request to the appropriate model backend based on the model name specified in the request.
263+
264+
## Installing Milvus Vector Database in EKS
265+
266+
Milvus is an open-source vector database that powers embedding similarity search and AI applications. This section guides you through deploying Milvus on your EKS cluster with Graviton processors.
267+
268+
### Prerequisites
269+
270+
- Your EKS cluster is already set up with Graviton (ARM64) nodes
271+
- Cert-manager is installed on the cluster
272+
- AWS EBS CSI driver is configured for persistent storage
273+
274+
### Deployment Steps
275+
276+
#### 1. Install cert-manager (if not already installed):
277+
```bash
278+
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
279+
kubectl get pods -n cert-manager
280+
```
281+
282+
#### 2. Install Milvus Operator:
283+
```bash
284+
kubectl apply -f https://raw.githubusercontent.com/zilliztech/milvus-operator/main/deploy/manifests/deployment.yaml
285+
kubectl get pods -n milvus-operator
286+
```
287+
288+
#### 3. Create EBS Storage Class:
289+
```bash
290+
kubectl apply -f milvus/ebs-storage-class.yaml
291+
```
292+
293+
#### 4. Deploy Milvus in standalone mode:
294+
```bash
295+
kubectl apply -f milvus/milvus-standalone.yaml
296+
```
297+
298+
#### 5. Create Network Load Balancer Service (optional, for external access):
299+
```bash
300+
kubectl apply -f milvus/milvus-nlb-service.yaml
301+
```
302+
303+
#### 6. Access Milvus:
304+
You can access Milvus using port-forwarding:
305+
```bash
306+
kubectl port-forward service/my-release-milvus 19530:19530
307+
```
308+
309+
Or through the Network Load Balancer if you deployed the NLB service.
310+
311+
## Deploying MCP (Model Context Protocol) Service
312+
313+
The MCP service enables augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness. This implementation is framework-independent, not relying on LangChain or LlamaIndex.
314+
315+
### Architecture
316+
317+
The MCP service consists of several modular components:
318+
- **Agent**: Coordinates workflow and manages tool usage
319+
- **ChatOpenAI**: Handles interactions with the language model and tool calling
320+
- **MCPClient**: Connects to MCP servers and manages tool calls
321+
- **EmbeddingRetriever**: Creates and searches vector embeddings for relevant context
322+
- **VectorStore**: Interfaces with Milvus for storing and retrieving embeddings
323+
324+
### Workflow
325+
326+
1. **Knowledge Embedding**
327+
- Documents from the `knowledge` directory are converted to vector embeddings
328+
- Embeddings and source documents are stored in Milvus vector database
329+
330+
2. **Context Retrieval (RAG)**
331+
- User queries are converted to embeddings
332+
- The system finds relevant documents by calculating similarity between embeddings
333+
- Top matching documents form context for the LLM
334+
335+
3. **MCP Tool Setup**
336+
- MCP clients connect to tool servers (e.g., filesystem operations)
337+
- Tools are registered with the agent
338+
339+
4. **Task Execution**
340+
- User tasks are processed by the LLM with retrieved context
341+
- The LLM may use tools via MCP clients
342+
- Tool results are fed back to the LLM to continue the conversation
343+
344+
### Deployment Steps
345+
346+
#### 1. Set up environment variables:
347+
Create a `.env` file in the `mcp` directory with:
348+
```
349+
OPENAI_API_KEY=your_openai_api_key
350+
OPENAI_BASE_URL=your_openai_model_inference_endpoint
351+
EMBEDDING_BASE_URL=https://bedrock-runtime.us-west-2.amazonaws.com
352+
EMBEDDING_KEY=not_needed_for_aws_credentials
353+
AWS_REGION=us-west-2
354+
MILVUS_ADDRESS=your_milvus_service_address
355+
```
356+
357+
#### 2. Install dependencies:
358+
```bash
359+
cd mcp
360+
pnpm install
361+
```
362+
363+
#### 3. Run the application:
364+
```bash
365+
pnpm dev
366+
```
367+
368+
#### 4. Extend the system:
369+
This modular architecture can be extended by:
370+
- Adding more MCP servers for additional tool capabilities
371+
- Implementing advanced Milvus features like filtering and hybrid search
372+
- Adding more sophisticated RAG techniques
373+
- Implementing conversation history for multi-turn interactions
374+
- Deploying as a service with API endpoints
375+
202376
## How do we measure
203377
Our client program will generate prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
204378

agent/kubernetes/combined.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ spec:
2424
spec:
2525
nodeSelector:
2626
kubernetes.io/arch: arm64
27-
karpenter.sh/nodepool: karpenter-cpu-agent-Graviton
2827
affinity:
2928
nodeAffinity:
3029
requiredDuringSchedulingIgnoredDuringExecution:
@@ -36,12 +35,14 @@ spec:
3635
- arm64
3736
containers:
3837
- name: weather-function-service
39-
image: 412381761882.dkr.ecr.us-west-2.amazonaws.com/function:latest
38+
image: 412381761882.dkr.ecr.us-west-2.amazonaws.com/function:v5
4039
ports:
4140
- containerPort: 8000
4241
env:
4342
- name: LLM_SERVER_URL
44-
value: "http://llama-cpp-cpu-lb-2137543273.us-west-2.elb.amazonaws.com/v1/chat/completions"
43+
value: "http://52.11.105.97:8080/v1/chat/completions"
44+
- name: LLM_MODEL
45+
value: "Qwen/QwQ-32B-AWQ"
4546
- name: LLM_API_KEY
4647
valueFrom:
4748
secretKeyRef:
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Weather Function Call Service Fixes
2+
3+
## Issue Fixed
4+
Fixed the error: `'NoneType' object is not subscriptable` that occurred when processing chat requests.
5+
6+
## Root Cause
7+
The error was occurring because:
8+
1. The JSON payload in the curl request was missing a closing curly brace `}`, causing invalid JSON
9+
2. The server code wasn't properly handling errors when parsing function arguments
10+
3. There was insufficient error handling around function responses
11+
12+
## Changes Made
13+
1. Added robust error handling for function call processing
14+
2. Added safe parsing of function arguments with better error messages
15+
3. Implemented a fallback response when function call processing fails
16+
4. Added better handling for None responses from weather functions
17+
5. Improved logging for debugging function call issues
18+
19+
## How to Test
20+
Use the following corrected curl command:
21+
22+
```bash
23+
curl -i -X POST 'http://k8s-kuberays-weatherf-32a06e12b4-9249af50fedb7b66.elb.us-west-2.amazonaws.com/chat' \
24+
-H 'Content-Type: application/json' \
25+
-H 'Authorization: Bearer sk-1234' \
26+
-d '{
27+
"model": "Qwen/QwQ-32B-AWQ",
28+
"messages": [
29+
{
30+
"role": "system",
31+
"content": "You are a helpful weather assistant. Use the provided functions to get weather information."
32+
},
33+
{
34+
"role": "user",
35+
"content": "what is the current weather in London"
36+
}
37+
]
38+
}'
39+
```
40+
41+
Note the closing curly brace at the end of the JSON payload.

dockerfiles/functioncall/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ COPY server.py weather_service.py ./
1515
# These can be overridden when running the container
1616
ENV LLM_SERVER_URL="http://llm-service:8080/v1/chat/completions" \
1717
LLM_API_KEY="sk-1234" \
18+
LLM_MODEL="llama3" \
1819
CONNECT_TIMEOUT=10 \
1920
READ_TIMEOUT=300 \
2021
LLM_MAX_RETRIES=3

dockerfiles/functioncall/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ The application can be configured using the following environment variables:
1515

1616
- `LLM_SERVER_URL`: URL of the LLM server (default: "http://llm-service:8080/v1/chat/completions")
1717
- `LLM_API_KEY`: API key for the LLM service (default: "sk-1234")
18+
- `LLM_MODEL`: Model name to use for inference (default: "llama3")
1819
- `CONNECT_TIMEOUT`: Connection timeout in seconds (default: 10)
1920
- `READ_TIMEOUT`: Read timeout in seconds (default: 300)
2021
- `LLM_MAX_RETRIES`: Maximum number of retries for LLM requests (default: 3)
@@ -33,6 +34,7 @@ docker build -t weather-function-service:latest .
3334
docker run -p 8000:8000 \
3435
-e LLM_SERVER_URL="http://your-llm-server:8080/v1/chat/completions" \
3536
-e LLM_API_KEY="your-api-key" \
37+
-e LLM_MODEL="llama3" \
3638
weather-function-service:latest
3739
```
3840

@@ -69,6 +71,8 @@ spec:
6971
secretKeyRef:
7072
name: llm-credentials
7173
key: api-key
74+
- name: LLM_MODEL
75+
value: "llama3"
7276
resources:
7377
requests:
7478
memory: "256Mi"

0 commit comments

Comments
 (0)