aws-solutions-library-samples
diff --git a/‎.DS_Store‎
2 KB b/‎.DS_Store‎
2 KB
diff --git a/‎.gitignore‎
Lines changed: 43 additions & 0 deletions b/‎.gitignore‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 176 additions & 2 deletions b/‎README.md‎
Lines changed: 176 additions & 2 deletions
diff --git a/‎agent/kubernetes/combined.yaml‎
Lines changed: 4 additions & 3 deletions b/‎agent/kubernetes/combined.yaml‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎dockerfiles/functioncall/AmazonQ.md‎
Lines changed: 41 additions & 0 deletions b/‎dockerfiles/functioncall/AmazonQ.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎dockerfiles/functioncall/Dockerfile‎
Lines changed: 1 addition & 0 deletions b/‎dockerfiles/functioncall/Dockerfile‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎dockerfiles/functioncall/README.md‎
Lines changed: 4 additions & 0 deletions b/‎dockerfiles/functioncall/README.md‎
Lines changed: 4 additions & 0 deletions
@@ -0,0 +1,43 @@
+# Node.js dependencies
+node_modules/
+npm-debug.log
+yarn-debug.log
+yarn-error.log
+.pnpm-debug.log
+
+# Environment variables
+.env
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+
+# Build outputs
+dist/
+build/
+out/
+.next/
+
+# Cache directories
+.npm/
+.pnpm-store/
+.yarn/cache
+.yarn/unplugged
+.yarn/build-state.yml
+.yarn/install-state.gz
+
+# Editor directories and files
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
@@ -4,7 +4,7 @@
 The solution implements a scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for CPU-based inference and GPU instances for accelerated inference. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
 
 ## Architecture
-![Architecture Diagram](image/Diagram.png)
+![Architecture Diagram](image/arch.png)
 
 The architecture diagram illustrates our scalable ML inference solution with the following components:
 
@@ -24,7 +24,9 @@ The architecture diagram illustrates our scalable ML inference solution with the
 
 6. **Function Calling Service**: Enables agentic AI capabilities by allowing models to interact with external APIs and services.
 
-7. **Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
+7. **MCP (Model Context Protocol)**: Provides augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness.
+   
+8. **Monitoring & Observability**: Prometheus and Grafana for performance monitoring and visualization.
 
 This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.
 
@@ -199,6 +201,178 @@ The service will:
 3. Make the appropriate API call
 4. Return the weather information in a conversational format
 
+## Deploy LLM Gateway
+
+The LLM Gateway serves as a unified API layer for accessing multiple model backends through a standardized OpenAI-compatible interface. This section guides you through deploying LiteLLM as a proxy service on your EKS cluster.
+
+### Overview
+
+LiteLLM Proxy provides:
+- A unified OpenAI-compatible API for multiple model backends
+- Load balancing and routing between different models
+- Fallback mechanisms for high availability
+- Observability and monitoring capabilities
+- Authentication and rate limiting
+
+### Deployment Steps
+
+#### 1. Configure the LiteLLM service:
+The LiteLLM service is defined in `dockerfiles/litellm/combined.yaml` and includes:
+- A ConfigMap for the LiteLLM configuration
+- A Secret for API keys and database connection
+- A Deployment for the LiteLLM proxy service
+- A ClusterIP Service to expose the proxy internally
+- An Ingress to expose the service externally via an ALB
+
+#### 2. Customize the model configuration:
+Edit the `config.yaml` section in the ConfigMap to specify your model backends:
+```yaml
+model_list: 
+  - model_name: your-model-name
+    litellm_params:
+      model: openai/your-model-name
+      api_base: http://your-model-endpoint/v1
+      api_key: os.environ/OPENAI_API_KEY
+```
+
+#### 3. Update the secrets:
+Replace the base64-encoded API keys in the Secret section with your actual keys.
+
+#### 4. Deploy the LiteLLM proxy:
+```bash
+kubectl apply -f dockerfiles/litellm/combined.yaml
+```
+
+#### 5. Access the LiteLLM proxy:
+Once deployed, you can access the LiteLLM proxy through the ALB created by the Ingress:
+```bash
+# Get the ALB URL
+kubectl get ingress litellm-ingress
+
+# Test the API
+curl -X POST https://<YOUR-ALB-URL>/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer sk-1234" \
+  -d '{
+    "model": "unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF",
+    "messages": [{"role": "user", "content": "Hello, how are you?"}]
+  }'
+```
+
+The LiteLLM proxy will route your request to the appropriate model backend based on the model name specified in the request.
+
+## Installing Milvus Vector Database in EKS
+
+Milvus is an open-source vector database that powers embedding similarity search and AI applications. This section guides you through deploying Milvus on your EKS cluster with Graviton processors.
+
+### Prerequisites
+
+- Your EKS cluster is already set up with Graviton (ARM64) nodes
+- Cert-manager is installed on the cluster
+- AWS EBS CSI driver is configured for persistent storage
+
+### Deployment Steps
+
+#### 1. Install cert-manager (if not already installed):
+```bash
+kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
+kubectl get pods -n cert-manager
+```
+
+#### 2. Install Milvus Operator:
+```bash
+kubectl apply -f https://raw.githubusercontent.com/zilliztech/milvus-operator/main/deploy/manifests/deployment.yaml
+kubectl get pods -n milvus-operator
+```
+
+#### 3. Create EBS Storage Class:
+```bash
+kubectl apply -f milvus/ebs-storage-class.yaml
+```
+
+#### 4. Deploy Milvus in standalone mode:
+```bash
+kubectl apply -f milvus/milvus-standalone.yaml
+```
+
+#### 5. Create Network Load Balancer Service (optional, for external access):
+```bash
+kubectl apply -f milvus/milvus-nlb-service.yaml
+```
+
+#### 6. Access Milvus:
+You can access Milvus using port-forwarding:
+```bash
+kubectl port-forward service/my-release-milvus 19530:19530
+```
+
+Or through the Network Load Balancer if you deployed the NLB service.
+
+## Deploying MCP (Model Context Protocol) Service
+
+The MCP service enables augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness. This implementation is framework-independent, not relying on LangChain or LlamaIndex.
+
+### Architecture
+
+The MCP service consists of several modular components:
+- **Agent**: Coordinates workflow and manages tool usage
+- **ChatOpenAI**: Handles interactions with the language model and tool calling
+- **MCPClient**: Connects to MCP servers and manages tool calls
+- **EmbeddingRetriever**: Creates and searches vector embeddings for relevant context
+- **VectorStore**: Interfaces with Milvus for storing and retrieving embeddings
+
+### Workflow
+
+1. **Knowledge Embedding**
+   - Documents from the `knowledge` directory are converted to vector embeddings
+   - Embeddings and source documents are stored in Milvus vector database
+
+2. **Context Retrieval (RAG)**
+   - User queries are converted to embeddings
+   - The system finds relevant documents by calculating similarity between embeddings
+   - Top matching documents form context for the LLM
+
+3. **MCP Tool Setup**
+   - MCP clients connect to tool servers (e.g., filesystem operations)
+   - Tools are registered with the agent
+
+4. **Task Execution**
+   - User tasks are processed by the LLM with retrieved context
+   - The LLM may use tools via MCP clients
+   - Tool results are fed back to the LLM to continue the conversation
+
+### Deployment Steps
+
+#### 1. Set up environment variables:
+Create a `.env` file in the `mcp` directory with:
+```
+OPENAI_API_KEY=your_openai_api_key
+OPENAI_BASE_URL=your_openai_model_inference_endpoint
+EMBEDDING_BASE_URL=https://bedrock-runtime.us-west-2.amazonaws.com
+EMBEDDING_KEY=not_needed_for_aws_credentials
+AWS_REGION=us-west-2
+MILVUS_ADDRESS=your_milvus_service_address
+```
+
+#### 2. Install dependencies:
+```bash
+cd mcp
+pnpm install
+```
+
+#### 3. Run the application:
+```bash
+pnpm dev
+```
+
+#### 4. Extend the system:
+This modular architecture can be extended by:
+- Adding more MCP servers for additional tool capabilities
+- Implementing advanced Milvus features like filtering and hybrid search
+- Adding more sophisticated RAG techniques
+- Implementing conversation history for multi-turn interactions
+- Deploying as a service with API endpoints
+
 ## How do we measure
 Our client program will generate prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
 
 
@@ -24,7 +24,6 @@ spec:
     spec:
       nodeSelector:
         kubernetes.io/arch: arm64
-        karpenter.sh/nodepool: karpenter-cpu-agent-Graviton
       affinity:
         nodeAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
@@ -36,12 +35,14 @@ spec:
                 - arm64
       containers:
       - name: weather-function-service
-        image: 412381761882.dkr.ecr.us-west-2.amazonaws.com/function:latest
+        image: 412381761882.dkr.ecr.us-west-2.amazonaws.com/function:v5
         ports:
         - containerPort: 8000
         env:
         - name: LLM_SERVER_URL
-          value: "http://llama-cpp-cpu-lb-2137543273.us-west-2.elb.amazonaws.com/v1/chat/completions"
+          value: "http://52.11.105.97:8080/v1/chat/completions"
+        - name: LLM_MODEL
+          value: "Qwen/QwQ-32B-AWQ"
         - name: LLM_API_KEY
           valueFrom:
             secretKeyRef:
 
@@ -0,0 +1,41 @@
+# Weather Function Call Service Fixes
+
+## Issue Fixed
+Fixed the error: `'NoneType' object is not subscriptable` that occurred when processing chat requests.
+
+## Root Cause
+The error was occurring because:
+1. The JSON payload in the curl request was missing a closing curly brace `}`, causing invalid JSON
+2. The server code wasn't properly handling errors when parsing function arguments
+3. There was insufficient error handling around function responses
+
+## Changes Made
+1. Added robust error handling for function call processing
+2. Added safe parsing of function arguments with better error messages
+3. Implemented a fallback response when function call processing fails
+4. Added better handling for None responses from weather functions
+5. Improved logging for debugging function call issues
+
+## How to Test
+Use the following corrected curl command:
+
+```bash
+curl -i -X POST 'http://k8s-kuberays-weatherf-32a06e12b4-9249af50fedb7b66.elb.us-west-2.amazonaws.com/chat' \
+-H 'Content-Type: application/json' \
+-H 'Authorization: Bearer sk-1234' \
+-d '{
+    "model": "Qwen/QwQ-32B-AWQ",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful weather assistant. Use the provided functions to get weather information."
+      },
+      {
+        "role": "user",
+        "content": "what is the current weather in London"
+      }
+    ]
+}'
+```
+
+Note the closing curly brace at the end of the JSON payload.
@@ -15,6 +15,7 @@ COPY server.py weather_service.py ./
 # These can be overridden when running the container
 ENV LLM_SERVER_URL="http://llm-service:8080/v1/chat/completions" \
     LLM_API_KEY="sk-1234" \
+    LLM_MODEL="llama3" \
     CONNECT_TIMEOUT=10 \
     READ_TIMEOUT=300 \
     LLM_MAX_RETRIES=3
 
@@ -15,6 +15,7 @@ The application can be configured using the following environment variables:
 
 - `LLM_SERVER_URL`: URL of the LLM server (default: "http://llm-service:8080/v1/chat/completions")
 - `LLM_API_KEY`: API key for the LLM service (default: "sk-1234")
+- `LLM_MODEL`: Model name to use for inference (default: "llama3")
 - `CONNECT_TIMEOUT`: Connection timeout in seconds (default: 10)
 - `READ_TIMEOUT`: Read timeout in seconds (default: 300)
 - `LLM_MAX_RETRIES`: Maximum number of retries for LLM requests (default: 3)
@@ -33,6 +34,7 @@ docker build -t weather-function-service:latest .
 docker run -p 8000:8000 \
   -e LLM_SERVER_URL="http://your-llm-server:8080/v1/chat/completions" \
   -e LLM_API_KEY="your-api-key" \
+  -e LLM_MODEL="llama3" \
   weather-function-service:latest
 ```
 
@@ -69,6 +71,8 @@ spec:
             secretKeyRef:
               name: llm-credentials
               key: api-key
+        - name: LLM_MODEL
+          value: "llama3"
         resources:
           requests:
             memory: "256Mi"