From 7a56d713d4d4169b31198866414c6826d19ce70a Mon Sep 17 00:00:00 2001 From: Phil Date: Fri, 1 Aug 2025 14:58:48 -0400 Subject: [PATCH 1/4] Add Context-Enabled Semantic Caching recipe to semantic cache folder --- .../03_context_enabled_semantic_caching.ipynb | 1512 +++++++++++++++++ 1 file changed, 1512 insertions(+) create mode 100644 python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb diff --git a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb new file mode 100644 index 0000000..447fc54 --- /dev/null +++ b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb @@ -0,0 +1,1512 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vrbm9EkW-kRo" + }, + "source": [ + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", + "\n", + "# Context-Enabled Semantic Caching with Redis\n", + "\n", + "\n", + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4i9pSolc896M" + }, + "source": [ + "## What is Context-Enabled Semantic Caching?\n", + "\n", + "\n", + "Most caching systems today are **exact match**. They only return results if the query matches a key 1:1. \n", + "Ask **“What’s the weather in NYC?”**, and the system might cache and return that exact string. \n", + "But change it slightly—**“Is it raining in New York?”**—and you miss the cache completely.\n", + "\n", + "**Semantic caching** fixes that. It uses **vector embeddings** to find conceptually similar queries. \n", + "So whether a user asks “forecast for NYC,” “weather in Manhattan,” or “umbrella needed in NYC?”, they all hit the **same cached result** if the meaning aligns.\n", + "\n", + "But here’s the problem: \n", + "Even if you nail semantic similarity, **not all users want the same level of detail or format**. \n", + "With LLMs storing more history and memory on users, this is a chance to tailor responses to be fully personalized at fractions of the cost.\n", + "\n", + "That’s where **Context-Enabled Semantic Caching (CESC)** comes in.\n", + "\n", + "---\n", + "\n", + "\n", + "\n", + "### The Business Problem\n", + "\n", + "Enterprise LLM applications face three critical challenges:\n", + "- **Cost**: GPT-4o calls can cost $0.0025-0.01 per 1K tokens\n", + "- **Latency**: Cold LLM calls take 2-5 seconds, hurting user experience \n", + "- **Relevance**: Generic responses don't account for user roles, preferences, or context\n", + "\n", + "### Why It Matters\n", + "\n", + "| Challenge | Traditional Caching | Semantic Caching | CESC (Personalized) |\n", + "|----------------|-----------------------------|----------------------------------------|-------------------------------------------|\n", + "| **Match Type** | Exact string | Vector similarity | Vector + user context |\n", + "| **Relevance** | Low | Medium | High |\n", + "| **Latency** | Fast | Fast | Still fast (cached + lightweight model) |\n", + "| **Cost** | Low | Low | Low (personalization avoids full GPT-4o-mini) |\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "### Our Solution Architecture\n", + "\n", + "CESC creates a three-tier response system:\n", + "1. **Cold Start**: Fresh LLM call for new queries (expensive, slow, but comprehensive)\n", + "2. **Cache Hit**: Instant return of semantically similar cached responses (fast, cheap, generic)\n", + "3. **Personalized Cache Hit**: Lightweight model personalizes cached content using user memory (balanced speed/cost/relevance)\n", + "\n", + "Let's see this in action with a real enterprise IT support scenario.\n", + "[![](https://mermaid.ink/img/pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg?type=png)](https://mermaid.live/edit#pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "v6g7eVRZAcFA" + }, + "outputs": [], + "source": [ + "# 📦 Install required Python packages\n", + "!pip install -q \"redisvl>=0.8.0\" sentence-transformers openai tiktoken python-dotenv redis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "m04KxSuhBiOx" + }, + "outputs": [], + "source": [ + "# NBVAL_SKIP\n", + "%%sh\n", + "curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\n", + "echo \"deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/redis.list\n", + "sudo apt-get update > /dev/null 2>&1\n", + "sudo apt-get install redis-stack-server > /dev/null 2>&1\n", + "redis-stack-server --daemonize yes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xlsHkIF49Lve" + }, + "source": [ + "## Infrastructure Setup\n", + "\n", + "We're using Redis with vector search capabilities to store embeddings and enable semantic similarity matching. This simulates a production environment where your cache would be persistent across sessions.\n", + "\n", + "**Note**: In production, you'd typically use Redis Enterprise, or a managed Redis service such as Redis Cloud or Azure Managed Redis with proper clustering, persistence, and security configurations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "we-6LpNAByt1", + "outputId": "89b7e9c1-63f9-4458-cdab-0bc98b88a09e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "import redis\n", + "\n", + "# Redis connection params\n", + "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\")\n", + "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\")\n", + "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\")\n", + "\n", + "# Create Redis client\n", + "redis_client = redis.Redis(\n", + " host=REDIS_HOST,\n", + " port=REDIS_PORT,\n", + " password=REDIS_PASSWORD\n", + ")\n", + "\n", + "# Test connection\n", + "redis_client.ping()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZnqjGneBDFol" + }, + "outputs": [], + "source": [ + "import os\n", + "from google.colab import user_secret\n", + "\n", + "# 🔐 Ask user whether to use Azure OpenAI or OpenAI\n", + "use_azure = input(\"Use Azure OpenAI? (y/n): \").strip().lower() == \"y\"\n", + "\n", + "if use_azure:\n", + " print(\"🔒 Azure OpenAI selected.\")\n", + " print(\"📌 Please ensure the following secrets are added via the 🔐 Colab > Secrets menu:\")\n", + " print(\"- AZURE_OPENAI_API_KEY\")\n", + " print(\"- AZURE_OPENAI_ENDPOINT (e.g. https://your-resource.openai.azure.com)\")\n", + " print(\"- AZURE_OPENAI_API_VERSION (e.g. 2024-05-01-preview)\")\n", + " print(\"💡 Make sure 'gpt-4o' and 'gpt-4o-mini' models are deployed in your Azure Foundry.\\n\")\n", + "\n", + " os.environ[\"AZURE_OPENAI_API_KEY\"] = user_secret.get_secret(\"AZURE_OPENAI_API_KEY\")\n", + " os.environ[\"AZURE_OPENAI_ENDPOINT\"] = user_secret.get_secret(\"AZURE_OPENAI_ENDPOINT\")\n", + " os.environ[\"AZURE_OPENAI_API_VERSION\"] = user_secret.get_secret(\"AZURE_OPENAI_API_VERSION\")\n", + "\n", + " # Optional model deployment names\n", + " os.environ.setdefault(\"AZURE_OPENAI_GPT4_MODEL\", \"gpt-4o\")\n", + " os.environ.setdefault(\"AZURE_OPENAI_GPT4mini_MODEL\", \"gpt-4o-mini\")\n", + "\n", + "else:\n", + " print(\"🔒 OpenAI selected.\")\n", + " print(\"📌 Please ensure the following secret is added via the 🔐 Colab > Secrets menu:\")\n", + " print(\"- OPENAI_API_KEY\\n\")\n", + "\n", + " os.environ[\"OPENAI_API_KEY\"] = user_secret.get_secret(\"OPENAI_API_KEY\")\n", + "\n", + " # Optional model names (if using gpt-4o via OpenAI)\n", + " os.environ.setdefault(\"OPENAI_GPT4_MODEL\", \"gpt-4o\")\n", + " os.environ.setdefault(\"OPENAI_GPT4mini_MODEL\", \"gpt-4o-mini\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XtfiyQ4TEQmN" + }, + "outputs": [], + "source": [ + "import time\n", + "import uuid\n", + "import numpy as np\n", + "from typing import List, Dict\n", + "import redis\n", + "from sentence_transformers import SentenceTransformer\n", + "from redisvl.index import SearchIndex\n", + "from redisvl.utils.vectorize import HFTextVectorizer\n", + "from openai import AzureOpenAI\n", + "import tiktoken\n", + "import pandas as pd\n", + "from openai import AzureOpenAI, OpenAI\n", + "\n", + "# Connect to Redis\n", + "redis_client = redis.Redis(host=\"localhost\", port=6379, decode_responses=True)\n", + "\n", + "# RedisVL index\n", + "index_config = {\n", + " \"index\": {\n", + " \"name\": \"cesc_index\",\n", + " \"prefix\": \"cesc\",\n", + " \"storage_type\": \"hash\"\n", + " },\n", + " \"fields\": [\n", + " {\n", + " \"name\": \"content_vector\",\n", + " \"type\": \"vector\",\n", + " \"attrs\": {\n", + " \"dims\": 384,\n", + " \"distance_metric\": \"cosine\",\n", + " \"algorithm\": \"hnsw\"\n", + " }\n", + " },\n", + " {\"name\": \"content\", \"type\": \"text\"},\n", + " {\"name\": \"user_id\", \"type\": \"tag\"}\n", + " ]\n", + "}\n", + "search_index = SearchIndex.from_dict(index_config)\n", + "search_index.connect(\"redis://localhost:6379\")\n", + "search_index.create(overwrite=True)\n", + "\n", + "if use_azure:\n", + " client = AzureOpenAI(\n", + " azure_endpoint=os.getenv(\"AZURE_OPENAI_ENDPOINT\"),\n", + " api_key=os.getenv(\"AZURE_OPENAI_API_KEY\"),\n", + " api_version=os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", + " )\n", + " GPT4_MODEL = os.getenv(\"AZURE_OPENAI_GPT4_MODEL\")\n", + " GPT4mini_MODEL = os.getenv(\"AZURE_OPENAI_GPT4mini_MODEL\")\n", + "else:\n", + " client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + " )\n", + " GPT4_MODEL = os.getenv(\"OPENAI_GPT4_MODEL\")\n", + " GPT4mini_MODEL = os.getenv(\"OPENAI_GPT4mini_MODEL\")\n", + "\n", + "\n", + "# Embedding model + vectorizer\n", + "embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n", + "vectorizer = HFTextVectorizer(model=\"all-MiniLM-L6-v2\")\n", + "\n", + "# Token counter\n", + "class TokenCounter:\n", + " def __init__(self, model_name=\"gpt-4o\"):\n", + " try:\n", + " self.encoding = tiktoken.encoding_for_model(model_name)\n", + " except KeyError:\n", + " self.encoding = tiktoken.get_encoding(\"cl100k_base\")\n", + "\n", + " def count_tokens(self, text: str) -> int:\n", + " if not text:\n", + " return 0\n", + " return len(self.encoding.encode(text))\n", + "\n", + "token_counter = TokenCounter()\n", + "\n", + "class TelemetryLogger:\n", + " def __init__(self):\n", + " self.logs = []\n", + "\n", + " def log(self, user_id, method, latency_ms, input_tokens, output_tokens, cache_status, response_source):\n", + " model = response_source # assume model name is passed as source, e.g., \"gpt-4o\" or \"gpt-4o-mini\"\n", + " cost = self.calculate_cost(model, input_tokens, output_tokens)\n", + " self.logs.append({\n", + " \"timestamp\": time.time(),\n", + " \"user_id\": user_id,\n", + " \"method\": method,\n", + " \"latency_ms\": latency_ms,\n", + " \"input_tokens\": input_tokens,\n", + " \"output_tokens\": output_tokens,\n", + " \"total_tokens\": input_tokens + output_tokens,\n", + " \"cache_status\": cache_status,\n", + " \"response_source\": response_source,\n", + " \"cost_usd\": cost\n", + " })\n", + "\n", + " # 💵 Real cost vs baseline cold-call cost\n", + " cost = self.calculate_cost(response_source, input_tokens, output_tokens)\n", + " baseline = self.calculate_cost(\"gpt-4o\", input_tokens, output_tokens)\n", + "\n", + " self.logs[-1][\"cost_usd\"] = cost\n", + " self.logs[-1][\"baseline_cost_usd\"] = baseline\n", + "\n", + " def show_logs(self):\n", + " return pd.DataFrame(self.logs)\n", + "\n", + " def summarize(self):\n", + " df = pd.DataFrame(self.logs)\n", + " if df.empty:\n", + " print(\"No telemetry yet.\")\n", + " return\n", + "\n", + " df[\"total_tokens\"] = df[\"input_tokens\"] + df[\"output_tokens\"]\n", + "\n", + " display(df[[\n", + " \"user_id\",\n", + " \"cache_status\",\n", + " \"latency_ms\",\n", + " \"response_source\",\n", + " \"input_tokens\",\n", + " \"output_tokens\",\n", + " \"total_tokens\"\n", + " ]])\n", + "\n", + " # Compare cold start vs personalized\n", + " try:\n", + " cold_latency = df.loc[df[\"user_id\"] == \"user_cold\", \"latency_ms\"].values[0]\n", + " cx_latency = df.loc[df[\"user_id\"] == \"user_withcontext\", \"latency_ms\"].values[0]\n", + "\n", + " if cx_latency < cold_latency:\n", + " delta = cold_latency - cx_latency\n", + " pct = (delta / cold_latency) * 100\n", + " print(f\"\\n⚡ Personalized response (user_withcontext) was faster than the plain LLM by {int(delta)} ms — a {pct:.1f}% speed boost.\")\n", + " else:\n", + " delta = cx_latency - cold_latency\n", + " pct = (delta / cx_latency) * 100\n", + " print(f\"\\n⏱️ Personalized response (user_withcontext) was {int(delta)} ms slower than the plain LLM — a {pct:.1f}% slowdown.\")\n", + " print(\"📌 However, it returned a tailored response based on user memory, offering higher relevance.\")\n", + " except Exception as e:\n", + " print(\"\\n⚠️ Could not compute latency comparison:\", e)\n", + "\n", + " def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:\n", + " # Azure OpenAI pricing (per 1K tokens)\n", + " pricing = {\n", + " \"gpt-4o\": {\"input\": 0.005, \"output\": 0.015},\n", + " \"gpt-4o-mini\": {\"input\": 0.0015, \"output\": 0.003}\n", + " }\n", + "\n", + " if model not in pricing:\n", + " return 0.0\n", + "\n", + " input_cost = (input_tokens / 1000) * pricing[model][\"input\"]\n", + " output_cost = (output_tokens / 1000) * pricing[model][\"output\"]\n", + " return round(input_cost + output_cost, 6)\n", + "\n", + " def display_cost_summary(self):\n", + " df = self.show_logs()\n", + " if df.empty:\n", + " print(\"No telemetry logged yet.\")\n", + " return\n", + "\n", + " # Calculate savings per row\n", + " df[\"savings_usd\"] = df[\"baseline_cost_usd\"] - df[\"cost_usd\"]\n", + "\n", + " total_cost = df[\"cost_usd\"].sum()\n", + " baseline_cost = df[\"baseline_cost_usd\"].sum()\n", + " total_savings = df[\"savings_usd\"].sum()\n", + " savings_pct = (total_savings / baseline_cost * 100) if baseline_cost > 0 else 0\n", + "\n", + " # Display summary table\n", + " display(df[[\n", + " \"user_id\", \"cache_status\", \"response_source\",\n", + " \"input_tokens\", \"output_tokens\", \"latency_ms\",\n", + " \"cost_usd\", \"baseline_cost_usd\", \"savings_usd\"\n", + " ]])\n", + "\n", + " # 💸 Compare cost of plain LLM vs personalized\n", + " try:\n", + " cost_plain = df.loc[df[\"user_id\"] == \"user_cold\", \"cost_usd\"].values[0]\n", + " cost_personalized = df.loc[df[\"user_id\"] == \"user_withcontext\", \"cost_usd\"].values[0]\n", + "\n", + " print(f\"\\n🧾 Total Cost of Plain LLM Response: ${cost_plain:.4f}\")\n", + " print(f\"🧾 Total Cost of Personalized Response: ${cost_personalized:.4f}\")\n", + "\n", + " if cost_personalized < cost_plain:\n", + " delta = cost_plain - cost_personalized\n", + " pct = (delta / cost_plain) * 100\n", + " print(f\"\\n💡 Personalized response (user_withcontext) was cheaper than plain LLM by ${delta:.4f} — a {pct:.1f}% cost improvement.\")\n", + " else:\n", + " delta = cost_personalized - cost_plain\n", + " pct = (delta / cost_personalized) * 100\n", + " print(f\"\\n⏱️ Personalized response (user_withcontext) was ${delta:.4f} more expensive than plain LLM — a {pct:.1f}% cost increase.\")\n", + " print(\"📌 However, it returned a tailored response based on user memory, offering higher relevance.\")\n", + " except Exception as e:\n", + " print(\"\\n⚠️ Could not compute cost comparison:\", e)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "i3LSCGr3E1t8" + }, + "outputs": [], + "source": [ + "class AzureLLMClient:\n", + " def __init__(self, client, token_counter, gpt4_model=\"gpt-4o\", gpt4mini_model=\"gpt-4o-mini\"):\n", + " self.client = client\n", + " self.token_counter = token_counter\n", + " self.gpt4_model = gpt4_model\n", + " self.gpt4mini_model = gpt4mini_model\n", + "\n", + " def call_llm(self, prompt: str, model: str = \"gpt-4o\") -> Dict:\n", + " \"\"\"Call Azure OpenAI model and track latency, token usage, and cost\"\"\"\n", + " start_time = time.time()\n", + " response = self.client.chat.completions.create(\n", + " model=model,\n", + " messages=[{\"role\": \"user\", \"content\": prompt}],\n", + " temperature=0.7,\n", + " max_tokens=200\n", + " )\n", + " latency = (time.time() - start_time) * 1000\n", + "\n", + " output = response.choices[0].message.content\n", + " input_tokens = self.token_counter.count_tokens(prompt)\n", + " output_tokens = self.token_counter.count_tokens(output)\n", + "\n", + " return {\n", + " \"response\": output,\n", + " \"latency_ms\": round(latency, 2),\n", + " \"input_tokens\": input_tokens,\n", + " \"output_tokens\": output_tokens,\n", + " \"model\": model\n", + " }\n", + "\n", + " def call_gpt4(self, prompt: str) -> Dict:\n", + " return self.call_llm(prompt, model=self.gpt4_model)\n", + "\n", + " def call_gpt4mini(self, prompt: str) -> Dict:\n", + " return self.call_llm(prompt, model=self.gpt4mini_model)\n", + "\n", + " def personalize_response(self, cached_response: str, user_context: Dict, original_prompt: str) -> Dict:\n", + " context_prompt = self._build_context_prompt(cached_response, user_context, original_prompt)\n", + " start_time = time.time()\n", + " response = self.client.chat.completions.create(\n", + " model=self.gpt4mini_model,\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": context_prompt},\n", + " {\"role\": \"user\", \"content\": \"Please personalize this cached response for the user. Keep your response under 3 sentences.\"}\n", + " ]\n", + " )\n", + " latency = (time.time() - start_time) * 1000 # ms\n", + " reply = response.choices[0].message.content\n", + "\n", + " input_tokens = response.usage.prompt_tokens\n", + " output_tokens = response.usage.completion_tokens\n", + " total_tokens = response.usage.total_tokens\n", + "\n", + " return {\n", + " \"response\": reply,\n", + " \"latency_ms\": round(latency, 2),\n", + " \"input_tokens\": input_tokens,\n", + " \"output_tokens\": output_tokens,\n", + " \"tokens\": total_tokens,\n", + " \"model\": self.gpt4mini_model\n", + " }\n", + "\n", + " def _build_context_prompt(self, cached_response: str, user_context: Dict, prompt: str) -> str:\n", + " context_parts = []\n", + " if user_context.get(\"preferences\"):\n", + " context_parts.append(\"User preferences: \" + \", \".join(user_context[\"preferences\"]))\n", + " if user_context.get(\"goals\"):\n", + " context_parts.append(\"User goals: \" + \", \".join(user_context[\"goals\"]))\n", + " if user_context.get(\"history\"):\n", + " context_parts.append(\"User history: \" + \", \".join(user_context[\"history\"]))\n", + " context_blob = \"\\n\".join(context_parts)\n", + " return f\"\"\"You are a personalization assistant. A cached response was previously generated for the prompt: \"{prompt}\".\n", + "\n", + "Here is the cached response:\n", + "\\\"\\\"\\\"{cached_response}\\\"\\\"\\\"\n", + "\n", + "Use the user's context below to personalize and refine the response:\n", + "{context_blob}\n", + "\n", + "Respond in a way that feels tailored to this user, adjusting tone, content, or suggestions as needed. Keep your response under 3 sentences no matter what.\n", + "\"\"\"\n", + "\n", + "\n", + " def query(self, prompt: str, user_id: str) -> str:\n", + " start = time.time()\n", + " embedding = self.generate_embedding(prompt)\n", + "\n", + " # Check for cached match\n", + " cached = self.search_cache(embedding)\n", + "\n", + " if cached:\n", + " # Personalize with user context using lightweight model\n", + " context = self.user_context.get(user_id, {})\n", + " if context:\n", + " injected_prompt = self._build_context_prompt(cached, context, prompt)\n", + " result = self.llm_client.call_gpt4mini(injected_prompt)\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=result[\"latency_ms\"],\n", + " input_tokens=result[\"input_tokens\"],\n", + " output_tokens=result[\"output_tokens\"],\n", + " cache_status=\"miss\",\n", + " response_source=result[\"model\"]\n", + " )\n", + " return result[\"response\"]\n", + " else:\n", + " # Return raw cached result\n", + " latency = (time.time() - start) * 1000\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"raw_cache_hit\",\n", + " latency_ms=latency,\n", + " input_tokens=0,\n", + " output_tokens=0,\n", + " cache_status=\"cache_hit_raw\",\n", + " response_source=\"none\"\n", + " )\n", + " return cached\n", + " else:\n", + " # Cold start with GPT-4o\n", + " result = self.llm_client.call_gpt4(prompt)\n", + " self.store_response(prompt, result[\"response\"], embedding, user_id)\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=result[\"latency_ms\"],\n", + " input_tokens=result[\"input_tokens\"],\n", + " output_tokens=result[\"output_tokens\"],\n", + " cache_status=\"miss\",\n", + " response_source=result[\"model\"]\n", + " )\n", + " return result[\"response\"]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6APF2GQaE3fm" + }, + "outputs": [], + "source": [ + "from redisvl.query import VectorQuery\n", + "\n", + "class ContextEnabledSemanticCache:\n", + " def __init__(self, redis_index, vectorizer, llm_client: AzureLLMClient, telemetry: TelemetryLogger):\n", + " self.index = redis_index\n", + " self.vectorizer = vectorizer\n", + " self.llm = llm_client\n", + " self.telemetry = telemetry\n", + " self.user_memories: Dict[str, Dict] = {}\n", + "\n", + " def add_user_memory(self, user_id: str, memory_type: str, content: str):\n", + " if user_id not in self.user_memories:\n", + " self.user_memories[user_id] = {\"preferences\": [], \"history\": [], \"goals\": []}\n", + " self.user_memories[user_id][memory_type].append(content)\n", + "\n", + " def get_user_memory(self, user_id: str) -> Dict:\n", + " return self.user_memories.get(user_id, {})\n", + "\n", + " def generate_embedding(self, text: str) -> List[float]:\n", + " return self.vectorizer.embed(text)\n", + "\n", + "\n", + " def search_cache(self, embedding: List[float], threshold=0.85):\n", + " query = VectorQuery(\n", + " vector=embedding,\n", + " vector_field_name=\"content_vector\",\n", + " return_fields=[\"content\", \"user_id\"],\n", + " num_results=1,\n", + " return_score=True\n", + " )\n", + " results = self.index.query(query)\n", + "\n", + " if results:\n", + " first = results[0]\n", + " score = first.get(\"score\", None) or first.get(\"_score\", None) # fallback pattern\n", + " if score is None or score >= threshold:\n", + " return first[\"content\"]\n", + "\n", + " return None\n", + "\n", + " def store_response(self, prompt: str, response: str, embedding: List[float], user_id: str):\n", + " from redisvl.schema import IndexSchema # ensure schema imported\n", + "\n", + " # Convert embedding to bytes (float32)\n", + " import numpy as np\n", + " vec_bytes = np.array(embedding, dtype=np.float32).tobytes()\n", + "\n", + " doc = {\n", + " \"content\": response,\n", + " \"content_vector\": vec_bytes,\n", + " \"user_id\": user_id\n", + " }\n", + " self.index.load([doc]) # load does the insertion/upsert\n", + "\n", + " def query(self, prompt: str, user_id: str):\n", + " embedding = self.generate_embedding(prompt)\n", + " cached_response = self.search_cache(embedding)\n", + "\n", + " if cached_response:\n", + " user_context = self.get_user_memory(user_id)\n", + " if user_context:\n", + " result = self.llm.personalize_response(cached_response, user_context, prompt)\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=result[\"latency_ms\"],\n", + " input_tokens=result[\"input_tokens\"],\n", + " output_tokens=result[\"output_tokens\"],\n", + " cache_status=\"hit_personalized\",\n", + " response_source=result[\"model\"]\n", + " )\n", + " return result[\"response\"]\n", + " else:\n", + " # You can choose to skip telemetry logging for raw hits or log a minimal version\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=0,\n", + " input_tokens=0,\n", + " output_tokens=0,\n", + " cache_status=\"hit_raw\",\n", + " response_source=\"cache\"\n", + " )\n", + " return cached_response\n", + "\n", + " else:\n", + " result = self.llm.call_llm(prompt)\n", + " self.store_response(prompt, result[\"response\"], embedding, user_id)\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=result[\"latency_ms\"],\n", + " input_tokens=result[\"input_tokens\"],\n", + " output_tokens=result[\"output_tokens\"],\n", + " cache_status=\"miss\",\n", + " response_source=result[\"model\"]\n", + " )\n", + " return result[\"response\"]\n", + "\n", + "telemetry_logger = TelemetryLogger()\n", + "# ✅ Initialize engine\n", + "cesc = ContextEnabledSemanticCache(\n", + " redis_index=search_index,\n", + " vectorizer=vectorizer,\n", + " llm_client=AzureLLMClient(client, token_counter, GPT4_MODEL, GPT4mini_MODEL),\n", + " telemetry=telemetry_logger\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RgmW_S6s9Sy_" + }, + "source": [ + "## Scenario Setup: IT Support Dashboard Access\n", + "\n", + "We'll simulate three different approaches to handling the same IT support query:\n", + "- **User A (Cold)**: No cache, fresh LLM call every time\n", + "- **User B (No Context)**: Cache hit, but generic response \n", + "- **User C (With Context)**: Cache hit + personalization based on user memory\n", + "\n", + "The query: *A user in the finance department can't access the dashboard — what should I check?*\n", + "\n", + "### User Context Profile\n", + "User C represents an experienced IT support agent who:\n", + "- Specializes in finance department issues\n", + "- Has solved similar dashboard access problems before\n", + "- Uses specific tools and follows established troubleshooting patterns\n", + "- Needs responses tailored to their expertise level and current context" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zji4u12fgQZg", + "outputId": "cfc5cc09-381c-4d6e-8c43-0dcd98760edd" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "============================================================\n", + "🧊 Scenario 1: Plain LLM – cache miss\n", + "============================================================\n", + "\n", + "First, verify the user's permissions and access rights to the dashboard in the system settings. Ensure they are assigned the correct role or group. Next, check for any connectivity issues, browser compatibility, or recent changes to the dashboard configuration that might affect access. \n", + "\n", + "\n", + "============================================================\n", + "📦 Scenario 2: Semantic Cache Hit – generic, no user memory\n", + "============================================================\n", + "\n", + "First, verify the user's permissions and access rights to the dashboard in the system settings. Ensure they are assigned the correct role or group. Next, check for any connectivity issues, browser compatibility, or recent changes to the dashboard configuration that might affect access. \n", + "\n", + "\n", + "============================================================\n", + "🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory\n", + "============================================================\n", + "\n", + "First, check the user's permissions to ensure they have the 'finance_dashboard_viewer' role correctly assigned in the system settings. Since you’re using Chrome on macOS, confirm there are no browser compatibility issues and that your SSO is functioning properly. Lastly, review any recent configuration changes that might impact access to the dashboard. \n", + "\n" + ] + } + ], + "source": [ + "# 🔁 Reset Redis index and telemetry (optional for rerun clarity)\n", + "search_index.delete() # DANGER: removes all vectors\n", + "search_index.create(overwrite=True)\n", + "telemetry_logger.logs = []\n", + "\n", + "def print_divider(title: str = \"\", width: int = 60):\n", + " line = \"=\" * width\n", + " if title:\n", + " print(f\"\\n{line}\\n{title}\\n{line}\\n\")\n", + " else:\n", + " print(f\"\\n{line}\\n\")\n", + "\n", + "\n", + "# 🧪 Define demo prompt and users\n", + "prompt = \"A user in the finance department can't access the dashboard — what should I check? Answer in 2-3 sentences max.\"\n", + "users = {\n", + " \"cold\": \"user_cold\",\n", + " \"nocx\": \"user_nocontext\",\n", + " \"cx\": \"user_withcontext\"\n", + "}\n", + "\n", + "# 🧠 Add memory for personalized user (e.g., HR IT support agent)\n", + "cesc.add_user_memory(users[\"cx\"], \"preferences\", \"uses Chrome browser on macOS\")\n", + "cesc.add_user_memory(users[\"cx\"], \"goals\", \"resolve access issues efficiently for finance team users\")\n", + "cesc.add_user_memory(users[\"cx\"], \"history\", \"frequently resolves issues with 'finance_dashboard_viewer' role misconfigurations\")\n", + "cesc.add_user_memory(users[\"cx\"], \"history\", \"troubleshot recent problems with finance dashboard access and SSO\")\n", + "\n", + "# 🔍 Run prompt for each scenario\n", + "print_divider(\"🧊 Scenario 1: Plain LLM – cache miss\")\n", + "response_1 = cesc.query(prompt, user_id=users[\"cold\"])\n", + "print(response_1, \"\\n\")\n", + "\n", + "print_divider(\"📦 Scenario 2: Semantic Cache Hit – generic, extremely fast, no user memory\")\n", + "response_2 = cesc.query(prompt, user_id=users[\"nocx\"])\n", + "print(response_2, \"\\n\")\n", + "\n", + "print_divider(\"🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory\")\n", + "response_3 = cesc.query(prompt, user_id=users[\"cx\"])\n", + "print(response_3, \"\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJ-fUMmY9X4V" + }, + "source": [ + "## Key Observations\n", + "\n", + "Notice the different response patterns:\n", + "\n", + "1. **Cold Start Response**: Comprehensive but generic, took longest time and highest cost\n", + "2. **Cache Hit Response**: Identical to cold start, near-instant retrieval, minimal cost\n", + "3. **Personalized Response**: Adapted for user's specific role, tools, and experience level\n", + "\n", + "The personalized response demonstrates how CESC can:\n", + "- Reference user's specific browser/OS (Chrome on macOS)\n", + "- Mention role-specific permissions (finance_dashboard_viewer role)\n", + "- Reference past experience (SSO troubleshooting history)\n", + "- Maintain professional tone appropriate for experienced IT staff" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 600 + }, + "id": "zJdBei1UkQHO", + "outputId": "6df548bd-ec88-41b7-bf61-295e57d0cfbb" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "============================================================\n", + "📈 Telemetry Summary:\n", + "============================================================\n", + "\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"telemetry_logger\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"user_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"user_cold\",\n \"user_nocontext\",\n \"user_withcontext\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cache_status\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"miss\",\n \"hit_raw\",\n \"hit_personalized\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latency_ms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 651.6840342016469,\n \"min\": 0.0,\n \"max\": 1283.51,\n \"num_unique_values\": 3,\n \"samples\": [\n 1283.51,\n 0.0,\n 838.04\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"response_source\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"gpt-4o\",\n \"cache\",\n \"gpt-4o-mini\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"input_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 122,\n \"min\": 0,\n \"max\": 224,\n \"num_unique_values\": 3,\n \"samples\": [\n 25,\n 0,\n 224\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"output_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 34,\n \"min\": 0,\n \"max\": 66,\n \"num_unique_values\": 3,\n \"samples\": [\n 50,\n 0,\n 66\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"total_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 150,\n \"min\": 0,\n \"max\": 290,\n \"num_unique_values\": 3,\n \"samples\": [\n 75,\n 0,\n 290\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idcache_statuslatency_msresponse_sourceinput_tokensoutput_tokenstotal_tokens
0user_coldmiss1283.51gpt-4o255075
1user_nocontexthit_raw0.00cache000
2user_withcontexthit_personalized838.04gpt-4o-mini22466290
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " user_id cache_status latency_ms response_source \\\n", + "0 user_cold miss 1283.51 gpt-4o \n", + "1 user_nocontext hit_raw 0.00 cache \n", + "2 user_withcontext hit_personalized 838.04 gpt-4o-mini \n", + "\n", + " input_tokens output_tokens total_tokens \n", + "0 25 50 75 \n", + "1 0 0 0 \n", + "2 224 66 290 " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "⚡ Personalized response (user_withcontext) was faster than the plain LLM by 445 ms — a 34.7% speed boost.\n", + "None \n", + "\n", + "\n", + "============================================================\n", + "💸 Cost Breakdown:\n", + "============================================================\n", + "\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"telemetry_logger\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"user_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"user_cold\",\n \"user_nocontext\",\n \"user_withcontext\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cache_status\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"miss\",\n \"hit_raw\",\n \"hit_personalized\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"response_source\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"gpt-4o\",\n \"cache\",\n \"gpt-4o-mini\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"input_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 122,\n \"min\": 0,\n \"max\": 224,\n \"num_unique_values\": 3,\n \"samples\": [\n 25,\n 0,\n 224\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"output_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 34,\n \"min\": 0,\n \"max\": 66,\n \"num_unique_values\": 3,\n \"samples\": [\n 50,\n 0,\n 66\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latency_ms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 651.6840342016469,\n \"min\": 0.0,\n \"max\": 1283.51,\n \"num_unique_values\": 3,\n \"samples\": [\n 1283.51,\n 0.0,\n 838.04\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cost_usd\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0004410332564935816,\n \"min\": 0.0,\n \"max\": 0.000875,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.000875,\n 0.0,\n 0.000534\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"baseline_cost_usd\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0010601061267627877,\n \"min\": 0.0,\n \"max\": 0.00211,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.000875,\n 0.0,\n 0.00211\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"savings_usd\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0009099040242428502,\n \"min\": 0.0,\n \"max\": 0.001576,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.001576,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idcache_statusresponse_sourceinput_tokensoutput_tokenslatency_mscost_usdbaseline_cost_usdsavings_usd
0user_coldmissgpt-4o25501283.510.0008750.0008750.000000
1user_nocontexthit_rawcache000.000.0000000.0000000.000000
2user_withcontexthit_personalizedgpt-4o-mini22466838.040.0005340.0021100.001576
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " user_id cache_status response_source input_tokens \\\n", + "0 user_cold miss gpt-4o 25 \n", + "1 user_nocontext hit_raw cache 0 \n", + "2 user_withcontext hit_personalized gpt-4o-mini 224 \n", + "\n", + " output_tokens latency_ms cost_usd baseline_cost_usd savings_usd \n", + "0 50 1283.51 0.000875 0.000875 0.000000 \n", + "1 0 0.00 0.000000 0.000000 0.000000 \n", + "2 66 838.04 0.000534 0.002110 0.001576 " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "🧾 Total Cost of Plain LLM Response: $0.0009\n", + "🧾 Total Cost of Personalized Response: $0.0005\n", + "\n", + "💡 Personalized response (user_withcontext) was cheaper than plain LLM by $0.0003 — a 39.0% cost improvement.\n" + ] + } + ], + "source": [ + "# 📊 Show telemetry summary\n", + "print_divider(\"📈 Telemetry Summary:\")\n", + "print(telemetry_logger.summarize(), \"\\n\")\n", + "\n", + "print_divider(\"💸 Cost Breakdown:\")\n", + "telemetry_logger.display_cost_summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "natd_dr29bkH" + }, + "source": [ + "# Enterprise Significance & Large-Scale Impact\n", + "\n", + "## Production Metrics That Matter\n", + "\n", + "The results above demonstrate significant improvements across three critical enterprise metrics:\n", + "\n", + "### 💰 Cost Optimization\n", + "- **Immediate Savings**: 60-80% cost reduction on repeated queries\n", + "- **Scale Impact**: For enterprises processing 100K+ LLM queries daily, this translates to $1000s in monthly savings\n", + "- **Strategic Model Usage**: Expensive models (GPT-4o) for new content, efficient models (GPT-4o-mini) for personalization\n", + "\n", + "### ⚡ Performance Enhancement \n", + "- **Latency Reduction**: Cache hits respond in <100ms vs 2-5 seconds for cold calls\n", + "- **User Experience**: Sub-second responses feel instantaneous to end users\n", + "- **Scalability**: Redis can handle millions of vector operations per second\n", + "\n", + "### 🎯 Relevance & Personalization\n", + "- **Context Awareness**: Responses adapt to user roles, departments, and experience levels\n", + "- **Continuous Learning**: User memory grows with each interaction\n", + "- **Business Intelligence**: System learns organizational patterns and common solutions\n", + "\n", + "## ROI Calculations for Enterprise Deployment\n", + "\n", + "### Quantifiable Benefits\n", + "- **Cost Savings**: 60-80% reduction in LLM API costs\n", + "- **Productivity Gains**: 2-3x faster response times improve user productivity \n", + "- **Quality Improvement**: Consistent, personalized responses reduce error rates\n", + "- **Scalability**: Linear cost scaling vs exponential growth with pure LLM approaches\n", + "\n", + "### Investment Considerations\n", + "- **Infrastructure**: Redis Enterprise, vector compute resources\n", + "- **Development**: Initial implementation, integration with existing systems\n", + "- **Maintenance**: Ongoing optimization, user memory management\n", + "- **Training**: Staff education on new capabilities and best practices\n", + "\n", + "### Break-Even Analysis\n", + "For most enterprise deployments:\n", + "- **Break-even**: 3-6 months with >10K daily LLM queries\n", + "- **Positive ROI**: 200-400% in first year through combined cost savings and productivity gains\n", + "- **Compound Benefits**: Value increases as user memory and cache coverage grow\n", + "\n", + "The combination of semantic caching with user context represents a fundamental shift from generic AI responses to truly personalized, enterprise-aware intelligence that scales efficiently and cost-effectively." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 0c48d8d3c4d87b2a245a2ce4c0a3d0a81844484c Mon Sep 17 00:00:00 2001 From: Phil Date: Thu, 7 Aug 2025 14:54:31 -0400 Subject: [PATCH 2/4] fixed the google import syntax --- .../03_context_enabled_semantic_caching.ipynb | 89 ++++++++++++++----- 1 file changed, 68 insertions(+), 21 deletions(-) diff --git a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb index 447fc54..63b50cf 100644 --- a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb +++ b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb @@ -73,23 +73,32 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": { "id": "v6g7eVRZAcFA" }, "outputs": [], "source": [ "# 📦 Install required Python packages\n", - "!pip install -q \"redisvl>=0.8.0\" sentence-transformers openai tiktoken python-dotenv redis" + "!pip install -q \"redisvl>=0.8.0\" sentence-transformers openai tiktoken python-dotenv redis google pandas" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": { "id": "m04KxSuhBiOx" }, - "outputs": [], + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid syntax (2741142086.py, line 3)", + "output_type": "error", + "traceback": [ + " \u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[2]\u001b[39m\u001b[32m, line 3\u001b[39m\n\u001b[31m \u001b[39m\u001b[31mcurl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\u001b[39m\n ^\n\u001b[31mSyntaxError\u001b[39m\u001b[31m:\u001b[39m invalid syntax\n" + ] + } + ], "source": [ "# NBVAL_SKIP\n", "%%sh\n", @@ -115,7 +124,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -125,14 +134,30 @@ }, "outputs": [ { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" + "ename": "ConnectionError", + "evalue": "Error 10061 connecting to localhost:6379. No connection could be made because the target machine actively refused it.", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mConnectionRefusedError\u001b[39m Traceback (most recent call last)", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:389\u001b[39m, in \u001b[36mAbstractConnection.connect_check_health\u001b[39m\u001b[34m(self, check_health, retry_socket_connect)\u001b[39m\n\u001b[32m 388\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m retry_socket_connect:\n\u001b[32m--> \u001b[39m\u001b[32m389\u001b[39m sock = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mretry\u001b[49m\u001b[43m.\u001b[49m\u001b[43mcall_with_retry\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 390\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mlambda\u001b[39;49;00m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_connect\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mlambda\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43merror\u001b[49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mdisconnect\u001b[49m\u001b[43m(\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 391\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 392\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\retry.py:105\u001b[39m, in \u001b[36mRetry.call_with_retry\u001b[39m\u001b[34m(self, do, fail)\u001b[39m\n\u001b[32m 104\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m105\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mdo\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 106\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;28mself\u001b[39m._supported_errors \u001b[38;5;28;01mas\u001b[39;00m error:\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:390\u001b[39m, in \u001b[36mAbstractConnection.connect_check_health..\u001b[39m\u001b[34m()\u001b[39m\n\u001b[32m 388\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m retry_socket_connect:\n\u001b[32m 389\u001b[39m sock = \u001b[38;5;28mself\u001b[39m.retry.call_with_retry(\n\u001b[32m--> \u001b[39m\u001b[32m390\u001b[39m \u001b[38;5;28;01mlambda\u001b[39;00m: \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_connect\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m, \u001b[38;5;28;01mlambda\u001b[39;00m error: \u001b[38;5;28mself\u001b[39m.disconnect(error)\n\u001b[32m 391\u001b[39m )\n\u001b[32m 392\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:803\u001b[39m, in \u001b[36mConnection._connect\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 802\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m err \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m803\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m err\n\u001b[32m 804\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33msocket.getaddrinfo returned an empty list\u001b[39m\u001b[33m\"\u001b[39m)\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:787\u001b[39m, in \u001b[36mConnection._connect\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 786\u001b[39m \u001b[38;5;66;03m# connect\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m787\u001b[39m \u001b[43msock\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconnect\u001b[49m\u001b[43m(\u001b[49m\u001b[43msocket_address\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 789\u001b[39m \u001b[38;5;66;03m# set the socket_timeout now that we're connected\u001b[39;00m\n", + "\u001b[31mConnectionRefusedError\u001b[39m: [WinError 10061] No connection could be made because the target machine actively refused it", + "\nDuring handling of the above exception, another exception occurred:\n", + "\u001b[31mConnectionError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[3]\u001b[39m\u001b[32m, line 17\u001b[39m\n\u001b[32m 10\u001b[39m redis_client = redis.Redis(\n\u001b[32m 11\u001b[39m host=REDIS_HOST,\n\u001b[32m 12\u001b[39m port=REDIS_PORT,\n\u001b[32m 13\u001b[39m password=REDIS_PASSWORD\n\u001b[32m 14\u001b[39m )\n\u001b[32m 16\u001b[39m \u001b[38;5;66;03m# Test connection\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m17\u001b[39m \u001b[43mredis_client\u001b[49m\u001b[43m.\u001b[49m\u001b[43mping\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\commands\\core.py:1219\u001b[39m, in \u001b[36mManagementCommands.ping\u001b[39m\u001b[34m(self, **kwargs)\u001b[39m\n\u001b[32m 1213\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mping\u001b[39m(\u001b[38;5;28mself\u001b[39m, **kwargs) -> ResponseT:\n\u001b[32m 1214\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1215\u001b[39m \u001b[33;03m Ping the Redis server\u001b[39;00m\n\u001b[32m 1216\u001b[39m \n\u001b[32m 1217\u001b[39m \u001b[33;03m For more information see https://redis.io/commands/ping\u001b[39;00m\n\u001b[32m 1218\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1219\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mexecute_command\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mPING\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\client.py:621\u001b[39m, in \u001b[36mRedis.execute_command\u001b[39m\u001b[34m(self, *args, **options)\u001b[39m\n\u001b[32m 620\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mexecute_command\u001b[39m(\u001b[38;5;28mself\u001b[39m, *args, **options):\n\u001b[32m--> \u001b[39m\u001b[32m621\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_execute_command\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\client.py:627\u001b[39m, in \u001b[36mRedis._execute_command\u001b[39m\u001b[34m(self, *args, **options)\u001b[39m\n\u001b[32m 625\u001b[39m pool = \u001b[38;5;28mself\u001b[39m.connection_pool\n\u001b[32m 626\u001b[39m command_name = args[\u001b[32m0\u001b[39m]\n\u001b[32m--> \u001b[39m\u001b[32m627\u001b[39m conn = \u001b[38;5;28mself\u001b[39m.connection \u001b[38;5;129;01mor\u001b[39;00m \u001b[43mpool\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_connection\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 629\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._single_connection_client:\n\u001b[32m 630\u001b[39m \u001b[38;5;28mself\u001b[39m.single_connection_lock.acquire()\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\utils.py:195\u001b[39m, in \u001b[36mdeprecated_args..decorator..wrapper\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 190\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m arg \u001b[38;5;129;01min\u001b[39;00m provided_args:\n\u001b[32m 191\u001b[39m warn_deprecated_arg_usage(\n\u001b[32m 192\u001b[39m arg, func.\u001b[34m__name__\u001b[39m, reason, version, stacklevel=\u001b[32m3\u001b[39m\n\u001b[32m 193\u001b[39m )\n\u001b[32m--> \u001b[39m\u001b[32m195\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:1533\u001b[39m, in \u001b[36mConnectionPool.get_connection\u001b[39m\u001b[34m(self, command_name, *keys, **options)\u001b[39m\n\u001b[32m 1529\u001b[39m \u001b[38;5;28mself\u001b[39m._in_use_connections.add(connection)\n\u001b[32m 1531\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1532\u001b[39m \u001b[38;5;66;03m# ensure this connection is connected to Redis\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1533\u001b[39m \u001b[43mconnection\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconnect\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1534\u001b[39m \u001b[38;5;66;03m# connections that the pool provides should be ready to send\u001b[39;00m\n\u001b[32m 1535\u001b[39m \u001b[38;5;66;03m# a command. if not, the connection was either returned to the\u001b[39;00m\n\u001b[32m 1536\u001b[39m \u001b[38;5;66;03m# pool before all data has been read or the socket has been\u001b[39;00m\n\u001b[32m 1537\u001b[39m \u001b[38;5;66;03m# closed. either way, reconnect and verify everything is good.\u001b[39;00m\n\u001b[32m 1538\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:380\u001b[39m, in \u001b[36mAbstractConnection.connect\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 378\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mconnect\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m 379\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mConnects to the Redis server if not already connected\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m--> \u001b[39m\u001b[32m380\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mconnect_check_health\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcheck_health\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n", + "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:397\u001b[39m, in \u001b[36mAbstractConnection.connect_check_health\u001b[39m\u001b[34m(self, check_health, retry_socket_connect)\u001b[39m\n\u001b[32m 395\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTimeoutError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33mTimeout connecting to server\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 396\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m--> \u001b[39m\u001b[32m397\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m(\u001b[38;5;28mself\u001b[39m._error_message(e))\n\u001b[32m 399\u001b[39m \u001b[38;5;28mself\u001b[39m._sock = sock\n\u001b[32m 400\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n", + "\u001b[31mConnectionError\u001b[39m: Error 10061 connecting to localhost:6379. No connection could be made because the target machine actively refused it." + ] } ], "source": [ @@ -161,10 +186,22 @@ "metadata": { "id": "ZnqjGneBDFol" }, - "outputs": [], + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'google'", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mModuleNotFoundError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mos\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mgoogle\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mcolab\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m user_secret\n\u001b[32m 4\u001b[39m \u001b[38;5;66;03m# 🔐 Ask user whether to use Azure OpenAI or OpenAI\u001b[39;00m\n\u001b[32m 5\u001b[39m use_azure = \u001b[38;5;28minput\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mUse Azure OpenAI? (y/n): \u001b[39m\u001b[33m\"\u001b[39m).strip().lower() == \u001b[33m\"\u001b[39m\u001b[33my\u001b[39m\u001b[33m\"\u001b[39m\n", + "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'google'" + ] + } + ], "source": [ "import os\n", - "from google.colab import user_secret\n", + "from google.colab import userdata\n", "\n", "# 🔐 Ask user whether to use Azure OpenAI or OpenAI\n", "use_azure = input(\"Use Azure OpenAI? (y/n): \").strip().lower() == \"y\"\n", @@ -177,9 +214,9 @@ " print(\"- AZURE_OPENAI_API_VERSION (e.g. 2024-05-01-preview)\")\n", " print(\"💡 Make sure 'gpt-4o' and 'gpt-4o-mini' models are deployed in your Azure Foundry.\\n\")\n", "\n", - " os.environ[\"AZURE_OPENAI_API_KEY\"] = user_secret.get_secret(\"AZURE_OPENAI_API_KEY\")\n", - " os.environ[\"AZURE_OPENAI_ENDPOINT\"] = user_secret.get_secret(\"AZURE_OPENAI_ENDPOINT\")\n", - " os.environ[\"AZURE_OPENAI_API_VERSION\"] = user_secret.get_secret(\"AZURE_OPENAI_API_VERSION\")\n", + " os.environ[\"AZURE_OPENAI_API_KEY\"] = userdata.get(\"AZURE_OPENAI_API_KEY\")\n", + " os.environ[\"AZURE_OPENAI_ENDPOINT\"] = userdata.get(\"AZURE_OPENAI_ENDPOINT\")\n", + " os.environ[\"AZURE_OPENAI_API_VERSION\"] = userdata.get(\"AZURE_OPENAI_API_VERSION\")\n", "\n", " # Optional model deployment names\n", " os.environ.setdefault(\"AZURE_OPENAI_GPT4_MODEL\", \"gpt-4o\")\n", @@ -190,7 +227,7 @@ " print(\"📌 Please ensure the following secret is added via the 🔐 Colab > Secrets menu:\")\n", " print(\"- OPENAI_API_KEY\\n\")\n", "\n", - " os.environ[\"OPENAI_API_KEY\"] = user_secret.get_secret(\"OPENAI_API_KEY\")\n", + " os.environ[\"OPENAI_API_KEY\"] = userdata.get(\"OPENAI_API_KEY\")\n", "\n", " # Optional model names (if using gpt-4o via OpenAI)\n", " os.environ.setdefault(\"OPENAI_GPT4_MODEL\", \"gpt-4o\")\n", @@ -1500,11 +1537,21 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3", + "display_name": ".venv", + "language": "python", "name": "python3" }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" } }, "nbformat": 4, From 237274bc919d413c3a8592a247829adbf8d62d2b Mon Sep 17 00:00:00 2001 From: Phil Date: Thu, 7 Aug 2025 14:58:18 -0400 Subject: [PATCH 3/4] cell outputs removed --- .../03_context_enabled_semantic_caching.ipynb | 58 ++----------------- 1 file changed, 5 insertions(+), 53 deletions(-) diff --git a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb index 63b50cf..5c10d4a 100644 --- a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb +++ b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb @@ -85,20 +85,11 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "id": "m04KxSuhBiOx" }, - "outputs": [ - { - "ename": "SyntaxError", - "evalue": "invalid syntax (2741142086.py, line 3)", - "output_type": "error", - "traceback": [ - " \u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[2]\u001b[39m\u001b[32m, line 3\u001b[39m\n\u001b[31m \u001b[39m\u001b[31mcurl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\u001b[39m\n ^\n\u001b[31mSyntaxError\u001b[39m\u001b[31m:\u001b[39m invalid syntax\n" - ] - } - ], + "outputs": [], "source": [ "# NBVAL_SKIP\n", "%%sh\n", @@ -124,7 +115,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -132,34 +123,7 @@ "id": "we-6LpNAByt1", "outputId": "89b7e9c1-63f9-4458-cdab-0bc98b88a09e" }, - "outputs": [ - { - "ename": "ConnectionError", - "evalue": "Error 10061 connecting to localhost:6379. No connection could be made because the target machine actively refused it.", - "output_type": "error", - "traceback": [ - "\u001b[31m---------------------------------------------------------------------------\u001b[39m", - "\u001b[31mConnectionRefusedError\u001b[39m Traceback (most recent call last)", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:389\u001b[39m, in \u001b[36mAbstractConnection.connect_check_health\u001b[39m\u001b[34m(self, check_health, retry_socket_connect)\u001b[39m\n\u001b[32m 388\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m retry_socket_connect:\n\u001b[32m--> \u001b[39m\u001b[32m389\u001b[39m sock = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mretry\u001b[49m\u001b[43m.\u001b[49m\u001b[43mcall_with_retry\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 390\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mlambda\u001b[39;49;00m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_connect\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mlambda\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43merror\u001b[49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mdisconnect\u001b[49m\u001b[43m(\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 391\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 392\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\retry.py:105\u001b[39m, in \u001b[36mRetry.call_with_retry\u001b[39m\u001b[34m(self, do, fail)\u001b[39m\n\u001b[32m 104\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m105\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mdo\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 106\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;28mself\u001b[39m._supported_errors \u001b[38;5;28;01mas\u001b[39;00m error:\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:390\u001b[39m, in \u001b[36mAbstractConnection.connect_check_health..\u001b[39m\u001b[34m()\u001b[39m\n\u001b[32m 388\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m retry_socket_connect:\n\u001b[32m 389\u001b[39m sock = \u001b[38;5;28mself\u001b[39m.retry.call_with_retry(\n\u001b[32m--> \u001b[39m\u001b[32m390\u001b[39m \u001b[38;5;28;01mlambda\u001b[39;00m: \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_connect\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m, \u001b[38;5;28;01mlambda\u001b[39;00m error: \u001b[38;5;28mself\u001b[39m.disconnect(error)\n\u001b[32m 391\u001b[39m )\n\u001b[32m 392\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:803\u001b[39m, in \u001b[36mConnection._connect\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 802\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m err \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m803\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m err\n\u001b[32m 804\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33msocket.getaddrinfo returned an empty list\u001b[39m\u001b[33m\"\u001b[39m)\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:787\u001b[39m, in \u001b[36mConnection._connect\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 786\u001b[39m \u001b[38;5;66;03m# connect\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m787\u001b[39m \u001b[43msock\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconnect\u001b[49m\u001b[43m(\u001b[49m\u001b[43msocket_address\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 789\u001b[39m \u001b[38;5;66;03m# set the socket_timeout now that we're connected\u001b[39;00m\n", - "\u001b[31mConnectionRefusedError\u001b[39m: [WinError 10061] No connection could be made because the target machine actively refused it", - "\nDuring handling of the above exception, another exception occurred:\n", - "\u001b[31mConnectionError\u001b[39m Traceback (most recent call last)", - "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[3]\u001b[39m\u001b[32m, line 17\u001b[39m\n\u001b[32m 10\u001b[39m redis_client = redis.Redis(\n\u001b[32m 11\u001b[39m host=REDIS_HOST,\n\u001b[32m 12\u001b[39m port=REDIS_PORT,\n\u001b[32m 13\u001b[39m password=REDIS_PASSWORD\n\u001b[32m 14\u001b[39m )\n\u001b[32m 16\u001b[39m \u001b[38;5;66;03m# Test connection\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m17\u001b[39m \u001b[43mredis_client\u001b[49m\u001b[43m.\u001b[49m\u001b[43mping\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\commands\\core.py:1219\u001b[39m, in \u001b[36mManagementCommands.ping\u001b[39m\u001b[34m(self, **kwargs)\u001b[39m\n\u001b[32m 1213\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mping\u001b[39m(\u001b[38;5;28mself\u001b[39m, **kwargs) -> ResponseT:\n\u001b[32m 1214\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1215\u001b[39m \u001b[33;03m Ping the Redis server\u001b[39;00m\n\u001b[32m 1216\u001b[39m \n\u001b[32m 1217\u001b[39m \u001b[33;03m For more information see https://redis.io/commands/ping\u001b[39;00m\n\u001b[32m 1218\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1219\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mexecute_command\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mPING\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\client.py:621\u001b[39m, in \u001b[36mRedis.execute_command\u001b[39m\u001b[34m(self, *args, **options)\u001b[39m\n\u001b[32m 620\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mexecute_command\u001b[39m(\u001b[38;5;28mself\u001b[39m, *args, **options):\n\u001b[32m--> \u001b[39m\u001b[32m621\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_execute_command\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43moptions\u001b[49m\u001b[43m)\u001b[49m\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\client.py:627\u001b[39m, in \u001b[36mRedis._execute_command\u001b[39m\u001b[34m(self, *args, **options)\u001b[39m\n\u001b[32m 625\u001b[39m pool = \u001b[38;5;28mself\u001b[39m.connection_pool\n\u001b[32m 626\u001b[39m command_name = args[\u001b[32m0\u001b[39m]\n\u001b[32m--> \u001b[39m\u001b[32m627\u001b[39m conn = \u001b[38;5;28mself\u001b[39m.connection \u001b[38;5;129;01mor\u001b[39;00m \u001b[43mpool\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_connection\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 629\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._single_connection_client:\n\u001b[32m 630\u001b[39m \u001b[38;5;28mself\u001b[39m.single_connection_lock.acquire()\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\utils.py:195\u001b[39m, in \u001b[36mdeprecated_args..decorator..wrapper\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 190\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m arg \u001b[38;5;129;01min\u001b[39;00m provided_args:\n\u001b[32m 191\u001b[39m warn_deprecated_arg_usage(\n\u001b[32m 192\u001b[39m arg, func.\u001b[34m__name__\u001b[39m, reason, version, stacklevel=\u001b[32m3\u001b[39m\n\u001b[32m 193\u001b[39m )\n\u001b[32m--> \u001b[39m\u001b[32m195\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:1533\u001b[39m, in \u001b[36mConnectionPool.get_connection\u001b[39m\u001b[34m(self, command_name, *keys, **options)\u001b[39m\n\u001b[32m 1529\u001b[39m \u001b[38;5;28mself\u001b[39m._in_use_connections.add(connection)\n\u001b[32m 1531\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1532\u001b[39m \u001b[38;5;66;03m# ensure this connection is connected to Redis\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1533\u001b[39m \u001b[43mconnection\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconnect\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1534\u001b[39m \u001b[38;5;66;03m# connections that the pool provides should be ready to send\u001b[39;00m\n\u001b[32m 1535\u001b[39m \u001b[38;5;66;03m# a command. if not, the connection was either returned to the\u001b[39;00m\n\u001b[32m 1536\u001b[39m \u001b[38;5;66;03m# pool before all data has been read or the socket has been\u001b[39;00m\n\u001b[32m 1537\u001b[39m \u001b[38;5;66;03m# closed. either way, reconnect and verify everything is good.\u001b[39;00m\n\u001b[32m 1538\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:380\u001b[39m, in \u001b[36mAbstractConnection.connect\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 378\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mconnect\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m 379\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mConnects to the Redis server if not already connected\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m--> \u001b[39m\u001b[32m380\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mconnect_check_health\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcheck_health\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n", - "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\redis\\connection.py:397\u001b[39m, in \u001b[36mAbstractConnection.connect_check_health\u001b[39m\u001b[34m(self, check_health, retry_socket_connect)\u001b[39m\n\u001b[32m 395\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTimeoutError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33mTimeout connecting to server\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 396\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m--> \u001b[39m\u001b[32m397\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m(\u001b[38;5;28mself\u001b[39m._error_message(e))\n\u001b[32m 399\u001b[39m \u001b[38;5;28mself\u001b[39m._sock = sock\n\u001b[32m 400\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n", - "\u001b[31mConnectionError\u001b[39m: Error 10061 connecting to localhost:6379. No connection could be made because the target machine actively refused it." - ] - } - ], + "outputs": [], "source": [ "import os\n", "import redis\n", @@ -186,19 +150,7 @@ "metadata": { "id": "ZnqjGneBDFol" }, - "outputs": [ - { - "ename": "ModuleNotFoundError", - "evalue": "No module named 'google'", - "output_type": "error", - "traceback": [ - "\u001b[31m---------------------------------------------------------------------------\u001b[39m", - "\u001b[31mModuleNotFoundError\u001b[39m Traceback (most recent call last)", - "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mos\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mgoogle\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mcolab\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m user_secret\n\u001b[32m 4\u001b[39m \u001b[38;5;66;03m# 🔐 Ask user whether to use Azure OpenAI or OpenAI\u001b[39;00m\n\u001b[32m 5\u001b[39m use_azure = \u001b[38;5;28minput\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mUse Azure OpenAI? (y/n): \u001b[39m\u001b[33m\"\u001b[39m).strip().lower() == \u001b[33m\"\u001b[39m\u001b[33my\u001b[39m\u001b[33m\"\u001b[39m\n", - "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'google'" - ] - } - ], + "outputs": [], "source": [ "import os\n", "from google.colab import userdata\n", From 20cf640516f8b6a9dd22c819ba426be6c51a9de3 Mon Sep 17 00:00:00 2001 From: Phil Date: Mon, 18 Aug 2025 13:05:02 -0400 Subject: [PATCH 4/4] addressed all feedback from PR feedback --- .../03_context_enabled_semantic_caching.ipynb | 2643 ++++++++--------- 1 file changed, 1162 insertions(+), 1481 deletions(-) diff --git a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb index 5c10d4a..55d0848 100644 --- a/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb +++ b/python-recipes/semantic-cache/03_context_enabled_semantic_caching.ipynb @@ -1,1511 +1,1192 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "vrbm9EkW-kRo" - }, - "source": [ - "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", - "\n", - "# Context-Enabled Semantic Caching with Redis\n", - "\n", - "\n", - "\"Open" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4i9pSolc896M" - }, - "source": [ - "## What is Context-Enabled Semantic Caching?\n", - "\n", - "\n", - "Most caching systems today are **exact match**. They only return results if the query matches a key 1:1. \n", - "Ask **“What’s the weather in NYC?”**, and the system might cache and return that exact string. \n", - "But change it slightly—**“Is it raining in New York?”**—and you miss the cache completely.\n", - "\n", - "**Semantic caching** fixes that. It uses **vector embeddings** to find conceptually similar queries. \n", - "So whether a user asks “forecast for NYC,” “weather in Manhattan,” or “umbrella needed in NYC?”, they all hit the **same cached result** if the meaning aligns.\n", - "\n", - "But here’s the problem: \n", - "Even if you nail semantic similarity, **not all users want the same level of detail or format**. \n", - "With LLMs storing more history and memory on users, this is a chance to tailor responses to be fully personalized at fractions of the cost.\n", - "\n", - "That’s where **Context-Enabled Semantic Caching (CESC)** comes in.\n", - "\n", - "---\n", - "\n", - "\n", - "\n", - "### The Business Problem\n", - "\n", - "Enterprise LLM applications face three critical challenges:\n", - "- **Cost**: GPT-4o calls can cost $0.0025-0.01 per 1K tokens\n", - "- **Latency**: Cold LLM calls take 2-5 seconds, hurting user experience \n", - "- **Relevance**: Generic responses don't account for user roles, preferences, or context\n", - "\n", - "### Why It Matters\n", - "\n", - "| Challenge | Traditional Caching | Semantic Caching | CESC (Personalized) |\n", - "|----------------|-----------------------------|----------------------------------------|-------------------------------------------|\n", - "| **Match Type** | Exact string | Vector similarity | Vector + user context |\n", - "| **Relevance** | Low | Medium | High |\n", - "| **Latency** | Fast | Fast | Still fast (cached + lightweight model) |\n", - "| **Cost** | Low | Low | Low (personalization avoids full GPT-4o-mini) |\n", - "\n", - "\n", - "\n", - "---\n", - "\n", - "### Our Solution Architecture\n", - "\n", - "CESC creates a three-tier response system:\n", - "1. **Cold Start**: Fresh LLM call for new queries (expensive, slow, but comprehensive)\n", - "2. **Cache Hit**: Instant return of semantically similar cached responses (fast, cheap, generic)\n", - "3. **Personalized Cache Hit**: Lightweight model personalizes cached content using user memory (balanced speed/cost/relevance)\n", - "\n", - "Let's see this in action with a real enterprise IT support scenario.\n", - "[![](https://mermaid.ink/img/pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg?type=png)](https://mermaid.live/edit#pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg)" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "v6g7eVRZAcFA" - }, - "outputs": [], - "source": [ - "# 📦 Install required Python packages\n", - "!pip install -q \"redisvl>=0.8.0\" sentence-transformers openai tiktoken python-dotenv redis google pandas" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vrbm9EkW-kRo" + }, + "source": [ + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)\n", + "\n", + "# Context-Enabled Semantic Caching with Redis\n", + "\n", + "\n", + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4i9pSolc896M" + }, + "source": [ + "## What is Context-Enabled Semantic Caching?\n", + "\n", + "\n", + "Most caching systems today are **exact match**. They only return results if the query matches a key 1:1. \n", + "Ask **“What’s the weather in NYC?”**, and the system might cache and return that exact string. \n", + "But change it slightly—**“Is it raining in New York?”**—and you miss the cache completely.\n", + "\n", + "**Semantic caching** fixes that. It uses **vector embeddings** to find conceptually similar queries. \n", + "So whether a user asks “forecast for NYC,” “weather in Manhattan,” or “umbrella needed in NYC?”, they all hit the **same cached result** if the meaning aligns.\n", + "\n", + "But here’s the problem: \n", + "Even if you nail semantic similarity, **not all users want the same level of detail or format**. \n", + "With LLMs storing more history and memory on users, this is a chance to tailor responses to be fully personalized at fractions of the cost.\n", + "\n", + "That’s where **Context-Enabled Semantic Caching (CESC)** comes in.\n", + "\n", + "---\n", + "\n", + "\n", + "\n", + "### The Business Problem\n", + "\n", + "Enterprise LLM applications face three critical challenges:\n", + "- **Cost**: GPT-4o calls can cost $0.0025-0.01 per 1K tokens\n", + "- **Latency**: Cold LLM calls take 2-5 seconds, hurting user experience \n", + "- **Relevance**: Generic responses don't account for user roles, preferences, or context\n", + "\n", + "### Why It Matters\n", + "\n", + "| Challenge | Traditional Caching | Semantic Caching | CESC (Personalized) |\n", + "|----------------|-----------------------------|----------------------------------------|-------------------------------------------|\n", + "| **Match Type** | Exact string | Vector similarity | Vector + user context |\n", + "| **Relevance** | Low | Medium | High |\n", + "| **Latency** | Fast | Fast | Still fast (cached + lightweight model) |\n", + "| **Cost** | Low | Low | Low (personalization avoids full GPT-4o-mini) |\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "### Our Solution Architecture\n", + "\n", + "CESC creates a three-tier response system:\n", + "1. **Cold Start**: Fresh LLM call for new queries (expensive, slow, but comprehensive)\n", + "2. **Cache Hit**: Instant return of semantically similar cached responses (fast, cheap, generic)\n", + "3. **Personalized Cache Hit**: Lightweight model personalizes cached content using user memory (balanced speed/cost/relevance)\n", + "\n", + "Let's see this in action with a real enterprise IT support scenario.\n", + "[![](https://mermaid.ink/img/pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg?type=png)](https://mermaid.live/edit#pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "v6g7eVRZAcFA" + }, + "outputs": [], + "source": [ + "# 📦 Install required Python packages\n", + "!pip install -q \"redisvl>=0.8.0\" sentence-transformers openai tiktoken python-dotenv redis google pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run a Redis instance\n", + "\n", + "\n", + "#### For Colab\n", + "Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "m04KxSuhBiOx" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "m04KxSuhBiOx" - }, - "outputs": [], - "source": [ - "# NBVAL_SKIP\n", - "%%sh\n", - "curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\n", - "echo \"deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/redis.list\n", - "sudo apt-get update > /dev/null 2>&1\n", - "sudo apt-get install redis-stack-server > /dev/null 2>&1\n", - "redis-stack-server --daemonize yes" - ] + "ename": "SyntaxError", + "evalue": "invalid syntax (2741142086.py, line 3)", + "output_type": "error", + "traceback": [ + " \u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[2]\u001b[39m\u001b[32m, line 3\u001b[39m\n\u001b[31m \u001b[39m\u001b[31mcurl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\u001b[39m\n ^\n\u001b[31mSyntaxError\u001b[39m\u001b[31m:\u001b[39m invalid syntax\n" + ] + } + ], + "source": [ + "# NBVAL_SKIP\n", + "%%sh\n", + "curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\n", + "echo \"deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/redis.list\n", + "sudo apt-get update > /dev/null 2>&1\n", + "sudo apt-get install redis-stack-server > /dev/null 2>&1\n", + "redis-stack-server --daemonize yes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### For Alternative Environments\n", + "There are many ways to get the necessary redis-stack instance running\n", + "1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.com/try-free/). Or, if you have your\n", + "own version of Redis Enterprise running, that works too!\n", + "2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)\n", + "3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xlsHkIF49Lve" + }, + "source": [ + "## Infrastructure Setup\n", + "\n", + "We're using Redis with vector search capabilities to store embeddings and enable semantic similarity matching. This simulates a production environment where your cache would be persistent across sessions.\n", + "\n", + "**Note**: In production, you'd typically use Redis Enterprise, or a managed Redis service such as Redis Cloud or Azure Managed Redis with proper clustering, persistence, and security configurations." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "we-6LpNAByt1", + "outputId": "89b7e9c1-63f9-4458-cdab-0bc98b88a09e" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "xlsHkIF49Lve" - }, - "source": [ - "## Infrastructure Setup\n", - "\n", - "We're using Redis with vector search capabilities to store embeddings and enable semantic similarity matching. This simulates a production environment where your cache would be persistent across sessions.\n", - "\n", - "**Note**: In production, you'd typically use Redis Enterprise, or a managed Redis service such as Redis Cloud or Azure Managed Redis with proper clustering, persistence, and security configurations." + "data": { + "text/plain": [ + "True" ] - }, + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "import redis\n", + "\n", + "# Redis connection params\n", + "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\")\n", + "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\")\n", + "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\")\n", + "\n", + "#\n", + "# Create Redis client\n", + "redis_client = redis.Redis(\n", + " host=REDIS_HOST,\n", + " port=REDIS_PORT,\n", + " password=REDIS_PASSWORD\n", + ")\n", + "\n", + "redis_url = f\"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}\" if REDIS_PASSWORD else f\"redis://{REDIS_HOST}:{REDIS_PORT}\"\n", + "\n", + "# Test connection\n", + "redis_client.ping()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "we-6LpNAByt1", - "outputId": "89b7e9c1-63f9-4458-cdab-0bc98b88a09e" - }, - "outputs": [], - "source": [ - "import os\n", - "import redis\n", - "\n", - "# Redis connection params\n", - "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\")\n", - "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\")\n", - "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\")\n", - "\n", - "# Create Redis client\n", - "redis_client = redis.Redis(\n", - " host=REDIS_HOST,\n", - " port=REDIS_PORT,\n", - " password=REDIS_PASSWORD\n", - ")\n", - "\n", - "# Test connection\n", - "redis_client.ping()" + "data": { + "text/plain": [ + "True" ] - }, + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "\n", + "from dotenv import load_dotenv\n", + "\n", + "# Load environment variables from .env file\n", + "# Make sure you have a .env file in the root of this project\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "ZnqjGneBDFol" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZnqjGneBDFol" - }, - "outputs": [], - "source": [ - "import os\n", - "from google.colab import userdata\n", - "\n", - "# 🔐 Ask user whether to use Azure OpenAI or OpenAI\n", - "use_azure = input(\"Use Azure OpenAI? (y/n): \").strip().lower() == \"y\"\n", - "\n", - "if use_azure:\n", - " print(\"🔒 Azure OpenAI selected.\")\n", - " print(\"📌 Please ensure the following secrets are added via the 🔐 Colab > Secrets menu:\")\n", - " print(\"- AZURE_OPENAI_API_KEY\")\n", - " print(\"- AZURE_OPENAI_ENDPOINT (e.g. https://your-resource.openai.azure.com)\")\n", - " print(\"- AZURE_OPENAI_API_VERSION (e.g. 2024-05-01-preview)\")\n", - " print(\"💡 Make sure 'gpt-4o' and 'gpt-4o-mini' models are deployed in your Azure Foundry.\\n\")\n", - "\n", - " os.environ[\"AZURE_OPENAI_API_KEY\"] = userdata.get(\"AZURE_OPENAI_API_KEY\")\n", - " os.environ[\"AZURE_OPENAI_ENDPOINT\"] = userdata.get(\"AZURE_OPENAI_ENDPOINT\")\n", - " os.environ[\"AZURE_OPENAI_API_VERSION\"] = userdata.get(\"AZURE_OPENAI_API_VERSION\")\n", - "\n", - " # Optional model deployment names\n", - " os.environ.setdefault(\"AZURE_OPENAI_GPT4_MODEL\", \"gpt-4o\")\n", - " os.environ.setdefault(\"AZURE_OPENAI_GPT4mini_MODEL\", \"gpt-4o-mini\")\n", - "\n", - "else:\n", - " print(\"🔒 OpenAI selected.\")\n", - " print(\"📌 Please ensure the following secret is added via the 🔐 Colab > Secrets menu:\")\n", - " print(\"- OPENAI_API_KEY\\n\")\n", - "\n", - " os.environ[\"OPENAI_API_KEY\"] = userdata.get(\"OPENAI_API_KEY\")\n", - "\n", - " # Optional model names (if using gpt-4o via OpenAI)\n", - " os.environ.setdefault(\"OPENAI_GPT4_MODEL\", \"gpt-4o\")\n", - " os.environ.setdefault(\"OPENAI_GPT4mini_MODEL\", \"gpt-4o-mini\")" - ] - }, + "name": "stdout", + "output_type": "stream", + "text": [ + "🔒 Azure OpenAI selected (based on USE_AZURE environment variable).\n", + "📌 Please ensure the following secrets are added via the 🔐 Colab > Secrets menu or as environment variables:\n", + "- AZURE_OPENAI_API_KEY\n", + "- AZURE_OPENAI_ENDPOINT (e.g. https://your-resource.openai.azure.com)\n", + "- AZURE_OPENAI_API_VERSION (e.g. 2024-05-01-preview)\n", + "💡 Make sure 'gpt-4o' and 'gpt-4o-mini' models are deployed in your Azure Foundry.\n", + "\n" + ] + } + ], + "source": [ + "# Helper function to get secrets from Colab or environment variables\n", + "def get_secret(secret_name: str) -> str:\n", + " \"\"\"\n", + " Retrieves a secret from Google Colab's userdata if available,\n", + " otherwise falls back to an environment variable.\n", + " \"\"\"\n", + " try:\n", + " from google.colab import userdata\n", + " secret = userdata.get(secret_name)\n", + " if secret:\n", + " return secret\n", + " except (ImportError, KeyError):\n", + " # Not in Colab or secret not found, fall back to environment variables\n", + " pass\n", + " return os.getenv(secret_name)\n", + "\n", + "# 🔐 Determine whether to use Azure OpenAI from environment variables.\n", + "# Set USE_AZURE=true in your .env file to use Azure. Defaults to OpenAI if not set or false.\n", + "use_azure = input(\"Use Azure OpenAI? (y/n): \").strip().lower() == \"y\"\n", + "\n", + "if use_azure:\n", + " print(\"🔒 Azure OpenAI selected (based on USE_AZURE environment variable).\")\n", + " print(\"📌 Please ensure the following secrets are added via the 🔐 Colab > Secrets menu or as environment variables:\")\n", + " print(\"- AZURE_OPENAI_API_KEY\")\n", + " print(\"- AZURE_OPENAI_ENDPOINT (e.g. https://your-resource.openai.azure.com)\")\n", + " print(\"- AZURE_OPENAI_API_VERSION (e.g. 2024-05-01-preview)\")\n", + " print(\"💡 Make sure 'gpt-4o' and 'gpt-4o-mini' models are deployed in your Azure Foundry.\\n\")\n", + "\n", + " os.environ[\"AZURE_OPENAI_API_KEY\"] = get_secret(\"AZURE_OPENAI_API_KEY\")\n", + " os.environ[\"AZURE_OPENAI_ENDPOINT\"] = get_secret(\"AZURE_OPENAI_ENDPOINT\")\n", + " os.environ[\"AZURE_OPENAI_API_VERSION\"] = get_secret(\"AZURE_OPENAI_API_VERSION\")\n", + "\n", + " # Optional model deployment names\n", + " os.environ.setdefault(\"AZURE_OPENAI_MODEL_GPT4\", \"gpt-4o\")\n", + " os.environ.setdefault(\"AZURE_OPENAI_MODEL_GPT4_MINI\", \"gpt-4o-mini\")\n", + "\n", + "else:\n", + " print(\"🔒 OpenAI selected (default or USE_AZURE is not 'true').\")\n", + " print(\"📌 Please ensure the following secret is added via the 🔐 Colab > Secrets menu or as an environment variable:\")\n", + " print(\"- OPENAI_API_KEY\\n\")\n", + "\n", + " os.environ[\"OPENAI_API_KEY\"] = get_secret(\"OPENAI_API_KEY\")\n", + "\n", + " # Optional model names (if using gpt-4o via OpenAI)\n", + " os.environ.setdefault(\"OPENAI_MODEL_GPT4\", \"gpt-4o\")\n", + " os.environ.setdefault(\"OPENAI_MODEL_GPT4_MINI\", \"gpt-4o-mini\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "XtfiyQ4TEQmN" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XtfiyQ4TEQmN" - }, - "outputs": [], - "source": [ - "import time\n", - "import uuid\n", - "import numpy as np\n", - "from typing import List, Dict\n", - "import redis\n", - "from sentence_transformers import SentenceTransformer\n", - "from redisvl.index import SearchIndex\n", - "from redisvl.utils.vectorize import HFTextVectorizer\n", - "from openai import AzureOpenAI\n", - "import tiktoken\n", - "import pandas as pd\n", - "from openai import AzureOpenAI, OpenAI\n", - "\n", - "# Connect to Redis\n", - "redis_client = redis.Redis(host=\"localhost\", port=6379, decode_responses=True)\n", - "\n", - "# RedisVL index\n", - "index_config = {\n", - " \"index\": {\n", - " \"name\": \"cesc_index\",\n", - " \"prefix\": \"cesc\",\n", - " \"storage_type\": \"hash\"\n", - " },\n", - " \"fields\": [\n", - " {\n", - " \"name\": \"content_vector\",\n", - " \"type\": \"vector\",\n", - " \"attrs\": {\n", - " \"dims\": 384,\n", - " \"distance_metric\": \"cosine\",\n", - " \"algorithm\": \"hnsw\"\n", - " }\n", - " },\n", - " {\"name\": \"content\", \"type\": \"text\"},\n", - " {\"name\": \"user_id\", \"type\": \"tag\"}\n", - " ]\n", - "}\n", - "search_index = SearchIndex.from_dict(index_config)\n", - "search_index.connect(\"redis://localhost:6379\")\n", - "search_index.create(overwrite=True)\n", - "\n", - "if use_azure:\n", - " client = AzureOpenAI(\n", - " azure_endpoint=os.getenv(\"AZURE_OPENAI_ENDPOINT\"),\n", - " api_key=os.getenv(\"AZURE_OPENAI_API_KEY\"),\n", - " api_version=os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", - " )\n", - " GPT4_MODEL = os.getenv(\"AZURE_OPENAI_GPT4_MODEL\")\n", - " GPT4mini_MODEL = os.getenv(\"AZURE_OPENAI_GPT4mini_MODEL\")\n", - "else:\n", - " client = OpenAI(\n", - " api_key=os.getenv(\"OPENAI_API_KEY\")\n", - " )\n", - " GPT4_MODEL = os.getenv(\"OPENAI_GPT4_MODEL\")\n", - " GPT4mini_MODEL = os.getenv(\"OPENAI_GPT4mini_MODEL\")\n", - "\n", - "\n", - "# Embedding model + vectorizer\n", - "embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n", - "vectorizer = HFTextVectorizer(model=\"all-MiniLM-L6-v2\")\n", - "\n", - "# Token counter\n", - "class TokenCounter:\n", - " def __init__(self, model_name=\"gpt-4o\"):\n", - " try:\n", - " self.encoding = tiktoken.encoding_for_model(model_name)\n", - " except KeyError:\n", - " self.encoding = tiktoken.get_encoding(\"cl100k_base\")\n", - "\n", - " def count_tokens(self, text: str) -> int:\n", - " if not text:\n", - " return 0\n", - " return len(self.encoding.encode(text))\n", - "\n", - "token_counter = TokenCounter()\n", - "\n", - "class TelemetryLogger:\n", - " def __init__(self):\n", - " self.logs = []\n", - "\n", - " def log(self, user_id, method, latency_ms, input_tokens, output_tokens, cache_status, response_source):\n", - " model = response_source # assume model name is passed as source, e.g., \"gpt-4o\" or \"gpt-4o-mini\"\n", - " cost = self.calculate_cost(model, input_tokens, output_tokens)\n", - " self.logs.append({\n", - " \"timestamp\": time.time(),\n", - " \"user_id\": user_id,\n", - " \"method\": method,\n", - " \"latency_ms\": latency_ms,\n", - " \"input_tokens\": input_tokens,\n", - " \"output_tokens\": output_tokens,\n", - " \"total_tokens\": input_tokens + output_tokens,\n", - " \"cache_status\": cache_status,\n", - " \"response_source\": response_source,\n", - " \"cost_usd\": cost\n", - " })\n", - "\n", - " # 💵 Real cost vs baseline cold-call cost\n", - " cost = self.calculate_cost(response_source, input_tokens, output_tokens)\n", - " baseline = self.calculate_cost(\"gpt-4o\", input_tokens, output_tokens)\n", - "\n", - " self.logs[-1][\"cost_usd\"] = cost\n", - " self.logs[-1][\"baseline_cost_usd\"] = baseline\n", - "\n", - " def show_logs(self):\n", - " return pd.DataFrame(self.logs)\n", - "\n", - " def summarize(self):\n", - " df = pd.DataFrame(self.logs)\n", - " if df.empty:\n", - " print(\"No telemetry yet.\")\n", - " return\n", - "\n", - " df[\"total_tokens\"] = df[\"input_tokens\"] + df[\"output_tokens\"]\n", - "\n", - " display(df[[\n", - " \"user_id\",\n", - " \"cache_status\",\n", - " \"latency_ms\",\n", - " \"response_source\",\n", - " \"input_tokens\",\n", - " \"output_tokens\",\n", - " \"total_tokens\"\n", - " ]])\n", - "\n", - " # Compare cold start vs personalized\n", - " try:\n", - " cold_latency = df.loc[df[\"user_id\"] == \"user_cold\", \"latency_ms\"].values[0]\n", - " cx_latency = df.loc[df[\"user_id\"] == \"user_withcontext\", \"latency_ms\"].values[0]\n", - "\n", - " if cx_latency < cold_latency:\n", - " delta = cold_latency - cx_latency\n", - " pct = (delta / cold_latency) * 100\n", - " print(f\"\\n⚡ Personalized response (user_withcontext) was faster than the plain LLM by {int(delta)} ms — a {pct:.1f}% speed boost.\")\n", - " else:\n", - " delta = cx_latency - cold_latency\n", - " pct = (delta / cx_latency) * 100\n", - " print(f\"\\n⏱️ Personalized response (user_withcontext) was {int(delta)} ms slower than the plain LLM — a {pct:.1f}% slowdown.\")\n", - " print(\"📌 However, it returned a tailored response based on user memory, offering higher relevance.\")\n", - " except Exception as e:\n", - " print(\"\\n⚠️ Could not compute latency comparison:\", e)\n", - "\n", - " def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:\n", - " # Azure OpenAI pricing (per 1K tokens)\n", - " pricing = {\n", - " \"gpt-4o\": {\"input\": 0.005, \"output\": 0.015},\n", - " \"gpt-4o-mini\": {\"input\": 0.0015, \"output\": 0.003}\n", - " }\n", - "\n", - " if model not in pricing:\n", - " return 0.0\n", - "\n", - " input_cost = (input_tokens / 1000) * pricing[model][\"input\"]\n", - " output_cost = (output_tokens / 1000) * pricing[model][\"output\"]\n", - " return round(input_cost + output_cost, 6)\n", - "\n", - " def display_cost_summary(self):\n", - " df = self.show_logs()\n", - " if df.empty:\n", - " print(\"No telemetry logged yet.\")\n", - " return\n", - "\n", - " # Calculate savings per row\n", - " df[\"savings_usd\"] = df[\"baseline_cost_usd\"] - df[\"cost_usd\"]\n", - "\n", - " total_cost = df[\"cost_usd\"].sum()\n", - " baseline_cost = df[\"baseline_cost_usd\"].sum()\n", - " total_savings = df[\"savings_usd\"].sum()\n", - " savings_pct = (total_savings / baseline_cost * 100) if baseline_cost > 0 else 0\n", - "\n", - " # Display summary table\n", - " display(df[[\n", - " \"user_id\", \"cache_status\", \"response_source\",\n", - " \"input_tokens\", \"output_tokens\", \"latency_ms\",\n", - " \"cost_usd\", \"baseline_cost_usd\", \"savings_usd\"\n", - " ]])\n", - "\n", - " # 💸 Compare cost of plain LLM vs personalized\n", - " try:\n", - " cost_plain = df.loc[df[\"user_id\"] == \"user_cold\", \"cost_usd\"].values[0]\n", - " cost_personalized = df.loc[df[\"user_id\"] == \"user_withcontext\", \"cost_usd\"].values[0]\n", - "\n", - " print(f\"\\n🧾 Total Cost of Plain LLM Response: ${cost_plain:.4f}\")\n", - " print(f\"🧾 Total Cost of Personalized Response: ${cost_personalized:.4f}\")\n", - "\n", - " if cost_personalized < cost_plain:\n", - " delta = cost_plain - cost_personalized\n", - " pct = (delta / cost_plain) * 100\n", - " print(f\"\\n💡 Personalized response (user_withcontext) was cheaper than plain LLM by ${delta:.4f} — a {pct:.1f}% cost improvement.\")\n", - " else:\n", - " delta = cost_personalized - cost_plain\n", - " pct = (delta / cost_personalized) * 100\n", - " print(f\"\\n⏱️ Personalized response (user_withcontext) was ${delta:.4f} more expensive than plain LLM — a {pct:.1f}% cost increase.\")\n", - " print(\"📌 However, it returned a tailored response based on user memory, offering higher relevance.\")\n", - " except Exception as e:\n", - " print(\"\\n⚠️ Could not compute cost comparison:\", e)\n" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "c:\\Users\\PhilipLaussermair\\Desktop\\Code\\Internal\\sc recipe\\redis-ai-resources\\.venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "i3LSCGr3E1t8" - }, - "outputs": [], - "source": [ - "class AzureLLMClient:\n", - " def __init__(self, client, token_counter, gpt4_model=\"gpt-4o\", gpt4mini_model=\"gpt-4o-mini\"):\n", - " self.client = client\n", - " self.token_counter = token_counter\n", - " self.gpt4_model = gpt4_model\n", - " self.gpt4mini_model = gpt4mini_model\n", - "\n", - " def call_llm(self, prompt: str, model: str = \"gpt-4o\") -> Dict:\n", - " \"\"\"Call Azure OpenAI model and track latency, token usage, and cost\"\"\"\n", - " start_time = time.time()\n", - " response = self.client.chat.completions.create(\n", - " model=model,\n", - " messages=[{\"role\": \"user\", \"content\": prompt}],\n", - " temperature=0.7,\n", - " max_tokens=200\n", - " )\n", - " latency = (time.time() - start_time) * 1000\n", - "\n", - " output = response.choices[0].message.content\n", - " input_tokens = self.token_counter.count_tokens(prompt)\n", - " output_tokens = self.token_counter.count_tokens(output)\n", - "\n", - " return {\n", - " \"response\": output,\n", - " \"latency_ms\": round(latency, 2),\n", - " \"input_tokens\": input_tokens,\n", - " \"output_tokens\": output_tokens,\n", - " \"model\": model\n", - " }\n", - "\n", - " def call_gpt4(self, prompt: str) -> Dict:\n", - " return self.call_llm(prompt, model=self.gpt4_model)\n", - "\n", - " def call_gpt4mini(self, prompt: str) -> Dict:\n", - " return self.call_llm(prompt, model=self.gpt4mini_model)\n", - "\n", - " def personalize_response(self, cached_response: str, user_context: Dict, original_prompt: str) -> Dict:\n", - " context_prompt = self._build_context_prompt(cached_response, user_context, original_prompt)\n", - " start_time = time.time()\n", - " response = self.client.chat.completions.create(\n", - " model=self.gpt4mini_model,\n", - " messages=[\n", - " {\"role\": \"system\", \"content\": context_prompt},\n", - " {\"role\": \"user\", \"content\": \"Please personalize this cached response for the user. Keep your response under 3 sentences.\"}\n", - " ]\n", - " )\n", - " latency = (time.time() - start_time) * 1000 # ms\n", - " reply = response.choices[0].message.content\n", - "\n", - " input_tokens = response.usage.prompt_tokens\n", - " output_tokens = response.usage.completion_tokens\n", - " total_tokens = response.usage.total_tokens\n", - "\n", - " return {\n", - " \"response\": reply,\n", - " \"latency_ms\": round(latency, 2),\n", - " \"input_tokens\": input_tokens,\n", - " \"output_tokens\": output_tokens,\n", - " \"tokens\": total_tokens,\n", - " \"model\": self.gpt4mini_model\n", - " }\n", - "\n", - " def _build_context_prompt(self, cached_response: str, user_context: Dict, prompt: str) -> str:\n", - " context_parts = []\n", - " if user_context.get(\"preferences\"):\n", - " context_parts.append(\"User preferences: \" + \", \".join(user_context[\"preferences\"]))\n", - " if user_context.get(\"goals\"):\n", - " context_parts.append(\"User goals: \" + \", \".join(user_context[\"goals\"]))\n", - " if user_context.get(\"history\"):\n", - " context_parts.append(\"User history: \" + \", \".join(user_context[\"history\"]))\n", - " context_blob = \"\\n\".join(context_parts)\n", - " return f\"\"\"You are a personalization assistant. A cached response was previously generated for the prompt: \"{prompt}\".\n", - "\n", - "Here is the cached response:\n", - "\\\"\\\"\\\"{cached_response}\\\"\\\"\\\"\n", - "\n", - "Use the user's context below to personalize and refine the response:\n", - "{context_blob}\n", - "\n", - "Respond in a way that feels tailored to this user, adjusting tone, content, or suggestions as needed. Keep your response under 3 sentences no matter what.\n", - "\"\"\"\n", - "\n", - "\n", - " def query(self, prompt: str, user_id: str) -> str:\n", - " start = time.time()\n", - " embedding = self.generate_embedding(prompt)\n", - "\n", - " # Check for cached match\n", - " cached = self.search_cache(embedding)\n", - "\n", - " if cached:\n", - " # Personalize with user context using lightweight model\n", - " context = self.user_context.get(user_id, {})\n", - " if context:\n", - " injected_prompt = self._build_context_prompt(cached, context, prompt)\n", - " result = self.llm_client.call_gpt4mini(injected_prompt)\n", - " self.telemetry.log(\n", - " user_id=user_id,\n", - " method=\"context_query\",\n", - " latency_ms=result[\"latency_ms\"],\n", - " input_tokens=result[\"input_tokens\"],\n", - " output_tokens=result[\"output_tokens\"],\n", - " cache_status=\"miss\",\n", - " response_source=result[\"model\"]\n", - " )\n", - " return result[\"response\"]\n", - " else:\n", - " # Return raw cached result\n", - " latency = (time.time() - start) * 1000\n", - " self.telemetry.log(\n", - " user_id=user_id,\n", - " method=\"raw_cache_hit\",\n", - " latency_ms=latency,\n", - " input_tokens=0,\n", - " output_tokens=0,\n", - " cache_status=\"cache_hit_raw\",\n", - " response_source=\"none\"\n", - " )\n", - " return cached\n", - " else:\n", - " # Cold start with GPT-4o\n", - " result = self.llm_client.call_gpt4(prompt)\n", - " self.store_response(prompt, result[\"response\"], embedding, user_id)\n", - " self.telemetry.log(\n", - " user_id=user_id,\n", - " method=\"context_query\",\n", - " latency_ms=result[\"latency_ms\"],\n", - " input_tokens=result[\"input_tokens\"],\n", - " output_tokens=result[\"output_tokens\"],\n", - " cache_status=\"miss\",\n", - " response_source=result[\"model\"]\n", - " )\n", - " return result[\"response\"]\n" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "12:46:22 redisvl.index.index INFO Index already exists, overwriting.\n" + ] + } + ], + "source": [ + "import time\n", + "import uuid\n", + "import numpy as np\n", + "from typing import List, Dict\n", + "import redis\n", + "from sentence_transformers import SentenceTransformer\n", + "from redisvl.index import SearchIndex\n", + "from redisvl.utils.vectorize import HFTextVectorizer\n", + "from openai import AzureOpenAI\n", + "import tiktoken\n", + "import pandas as pd\n", + "from openai import AzureOpenAI, OpenAI\n", + "import logging\n", + "\n", + "# Suppress noisy loggers\n", + "logging.getLogger(\"sentence_transformers\").setLevel(logging.WARNING)\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)\n", + "\n", + "\n", + "# RedisVL index\n", + "index_config = {\n", + " \"index\": {\n", + " \"name\": \"cesc_index\",\n", + " \"prefix\": \"cesc\",\n", + " \"storage_type\": \"hash\"\n", + " },\n", + " \"fields\": [\n", + " {\n", + " \"name\": \"content_vector\",\n", + " \"type\": \"vector\",\n", + " \"attrs\": {\n", + " \"dims\": 384,\n", + " \"distance_metric\": \"cosine\",\n", + " \"algorithm\": \"hnsw\"\n", + " }\n", + " },\n", + " {\"name\": \"content\", \"type\": \"text\"},\n", + " {\"name\": \"user_id\", \"type\": \"tag\"},\n", + " {\"name\": \"prompt\", \"type\": \"text\"},\n", + " {\"name\": \"model\", \"type\": \"tag\"},\n", + " {\"name\": \"created_at\", \"type\": \"numeric\"},\n", + " ]\n", + "}\n", + "search_index = SearchIndex.from_dict(index_config)\n", + "# Connect using the redis_url defined in the previous cell\n", + "search_index.connect(redis_url)\n", + "search_index.create(overwrite=True)\n", + "\n", + "if use_azure:\n", + " client = AzureOpenAI(\n", + " azure_endpoint=os.getenv(\"AZURE_OPENAI_ENDPOINT\"),\n", + " api_key=os.getenv(\"AZURE_OPENAI_API_KEY\"),\n", + " api_version=os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", + " )\n", + " MODEL_GPT4 = os.getenv(\"AZURE_OPENAI_MODEL_GPT4\", \"gpt-4o\")\n", + " MODEL_GPT4_MINI = os.getenv(\"AZURE_OPENAI_MODEL_GPT4_MINI\", \"gpt-4o-mini\")\n", + "else:\n", + " client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + " )\n", + " MODEL_GPT4 = os.getenv(\"OPENAI_MODEL_GPT4\", \"gpt-4o\")\n", + " MODEL_GPT4_MINI = os.getenv(\"OPENAI_MODEL_GPT4_MINI\", \"gpt-4o-mini\")\n", + "\n", + "\n", + "# Embedding model + vectorizer\n", + "embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n", + "vectorizer = HFTextVectorizer(model=\"all-MiniLM-L6-v2\")\n", + "\n", + "# Token counter\n", + "class TokenCounter:\n", + " def __init__(self, model_name=\"gpt-4o\"):\n", + " try:\n", + " self.encoding = tiktoken.encoding_for_model(model_name)\n", + " except KeyError:\n", + " self.encoding = tiktoken.get_encoding(\"cl100k_base\")\n", + "\n", + " def count_tokens(self, text: str) -> int:\n", + " if not text:\n", + " return 0\n", + " return len(self.encoding.encode(text))\n", + "\n", + "token_counter = TokenCounter()\n", + "\n", + "class TelemetryLogger:\n", + " def __init__(self):\n", + " self.logs = []\n", + "\n", + " def log(self, user_id, method, latency_ms, input_tokens, output_tokens, cache_status, response_source):\n", + " model = response_source # assume model name is passed as source, e.g., \"gpt-4o\" or \"gpt-4o-mini\"\n", + " cost = self.calculate_cost(model, input_tokens, output_tokens)\n", + " self.logs.append({\n", + " \"timestamp\": time.time(),\n", + " \"user_id\": user_id,\n", + " \"method\": method,\n", + " \"latency_ms\": latency_ms,\n", + " \"input_tokens\": input_tokens,\n", + " \"output_tokens\": output_tokens,\n", + " \"total_tokens\": input_tokens + output_tokens,\n", + " \"cache_status\": cache_status,\n", + " \"response_source\": response_source,\n", + " \"cost_usd\": cost\n", + " })\n", + "\n", + " # 💵 Real cost vs baseline cold-call cost\n", + " cost = self.calculate_cost(response_source, input_tokens, output_tokens)\n", + " baseline = self.calculate_cost(\"gpt-4o\", input_tokens, output_tokens)\n", + "\n", + " self.logs[-1][\"cost_usd\"] = cost\n", + " self.logs[-1][\"baseline_cost_usd\"] = baseline\n", + "\n", + " def show_logs(self):\n", + " return pd.DataFrame(self.logs)\n", + "\n", + " def summarize(self):\n", + " df = pd.DataFrame(self.logs)\n", + " if df.empty:\n", + " print(\"No telemetry yet.\")\n", + " return\n", + "\n", + " df[\"total_tokens\"] = df[\"input_tokens\"] + df[\"output_tokens\"]\n", + "\n", + " display(df[[\n", + " \"user_id\",\n", + " \"cache_status\",\n", + " \"latency_ms\",\n", + " \"response_source\",\n", + " \"input_tokens\",\n", + " \"output_tokens\",\n", + " \"total_tokens\"\n", + " ]])\n", + "\n", + " # Compare cold start vs personalized\n", + " try:\n", + " cold_latency = df.loc[df[\"user_id\"] == \"user_cold\", \"latency_ms\"].values[0]\n", + " cx_latency = df.loc[df[\"user_id\"] == \"user_withcontext\", \"latency_ms\"].values[0]\n", + "\n", + " if cx_latency < cold_latency:\n", + " delta = cold_latency - cx_latency\n", + " pct = (delta / cold_latency) * 100\n", + " print(f\"\\n⚡ Personalized response (user_withcontext) was faster than the plain LLM by {int(delta)} ms — a {pct:.1f}% speed boost.\")\n", + " else:\n", + " delta = cx_latency - cold_latency\n", + " pct = (delta / cx_latency) * 100\n", + " print(f\"\\n⏱️ Personalized response (user_withcontext) was {int(delta)} ms slower than the plain LLM — a {pct:.1f}% slowdown.\")\n", + " print(\"📌 However, it returned a tailored response based on user memory, offering higher relevance.\")\n", + " except Exception as e:\n", + " print(\"\\n⚠️ Could not compute latency comparison:\", e)\n", + "\n", + " def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:\n", + " # Azure OpenAI pricing (per 1K tokens)\n", + " pricing = {\n", + " \"gpt-4o\": {\"input\": 0.005, \"output\": 0.015},\n", + " \"gpt-4o-mini\": {\"input\": 0.0015, \"output\": 0.003}\n", + " }\n", + "\n", + " if model not in pricing:\n", + " return 0.0\n", + "\n", + " input_cost = (input_tokens / 1000) * pricing[model][\"input\"]\n", + " output_cost = (output_tokens / 1000) * pricing[model][\"output\"]\n", + " return round(input_cost + output_cost, 6)\n", + "\n", + " def display_cost_summary(self):\n", + " df = self.show_logs()\n", + " if df.empty:\n", + " print(\"No telemetry logged yet.\")\n", + " return\n", + "\n", + " # Calculate savings per row\n", + " df[\"savings_usd\"] = df[\"baseline_cost_usd\"] - df[\"cost_usd\"]\n", + "\n", + " total_cost = df[\"cost_usd\"].sum()\n", + " baseline_cost = df[\"baseline_cost_usd\"].sum()\n", + " total_savings = df[\"savings_usd\"].sum()\n", + " savings_pct = (total_savings / baseline_cost * 100) if baseline_cost > 0 else 0\n", + "\n", + " # Display summary table\n", + " display(df[[\n", + " \"user_id\", \"cache_status\", \"response_source\",\n", + " \"input_tokens\", \"output_tokens\", \"latency_ms\",\n", + " \"cost_usd\", \"baseline_cost_usd\", \"savings_usd\"\n", + " ]])\n", + "\n", + " # 💸 Compare cost of plain LLM vs personalized\n", + " try:\n", + " cost_plain = df.loc[df[\"user_id\"] == \"user_cold\", \"cost_usd\"].values[0]\n", + " cost_personalized = df.loc[df[\"user_id\"] == \"user_withcontext\", \"cost_usd\"].values[0]\n", + "\n", + " print(f\"\\n🧾 Total Cost of Plain LLM Response: ${cost_plain:.4f}\")\n", + " print(f\"🧾 Total Cost of Personalized Response: ${cost_personalized:.4f}\")\n", + "\n", + " if cost_personalized < cost_plain:\n", + " delta = cost_plain - cost_personalized\n", + " pct = (delta / cost_plain) * 100\n", + " print(f\"\\n💡 Personalized response (user_withcontext) was cheaper than plain LLM by ${delta:.4f} — a {pct:.1f}% cost improvement.\")\n", + " else:\n", + " delta = cost_personalized - cost_plain\n", + " pct = (delta / cost_personalized) * 100\n", + " print(f\"\\n⏱️ Personalized response (user_withcontext) was ${delta:.4f} more expensive than plain LLM — a {pct:.1f}% cost increase.\")\n", + " print(\"📌 However, it returned a tailored response based on user memory, offering higher relevance.\")\n", + " except Exception as e:\n", + " print(\"\\n⚠️ Could not compute cost comparison:\", e)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "i3LSCGr3E1t8" + }, + "outputs": [], + "source": [ + "class AzureLLMClient:\n", + " def __init__(self, client, token_counter, gpt4_model=\"gpt-4o\", gpt4mini_model=\"gpt-4o-mini\"):\n", + " self.client = client\n", + " self.token_counter = token_counter\n", + " self.gpt4_model = gpt4_model\n", + " self.gpt4mini_model = gpt4mini_model\n", + "\n", + " def call_llm(self, prompt: str, model: str = \"gpt-4o\") -> Dict:\n", + " \"\"\"Call Azure OpenAI model and track latency, token usage, and cost\"\"\"\n", + " start_time = time.time()\n", + " response = self.client.chat.completions.create(\n", + " model=model,\n", + " messages=[{\"role\": \"user\", \"content\": prompt}],\n", + " temperature=0.7,\n", + " max_tokens=200\n", + " )\n", + " latency = (time.time() - start_time) * 1000\n", + "\n", + " output = response.choices[0].message.content\n", + " input_tokens = self.token_counter.count_tokens(prompt)\n", + " output_tokens = self.token_counter.count_tokens(output)\n", + "\n", + " return {\n", + " \"response\": output,\n", + " \"latency_ms\": round(latency, 2),\n", + " \"input_tokens\": input_tokens,\n", + " \"output_tokens\": output_tokens,\n", + " \"model\": model\n", + " }\n", + "\n", + " def call_gpt4(self, prompt: str) -> Dict:\n", + " return self.call_llm(prompt, model=self.gpt4_model)\n", + "\n", + " def call_gpt4mini(self, prompt: str) -> Dict:\n", + " return self.call_llm(prompt, model=self.gpt4mini_model)\n", + "\n", + " def personalize_response(self, cached_response: str, user_context: Dict, original_prompt: str) -> Dict:\n", + " context_prompt = self._build_context_prompt(cached_response, user_context, original_prompt)\n", + " start_time = time.time()\n", + " response = self.client.chat.completions.create(\n", + " model=self.gpt4mini_model,\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": context_prompt},\n", + " {\"role\": \"user\", \"content\": \"Please personalize this cached response for the user. Keep your response under 3 sentences.\"}\n", + " ]\n", + " )\n", + " latency = (time.time() - start_time) * 1000 # ms\n", + " reply = response.choices[0].message.content\n", + "\n", + " input_tokens = response.usage.prompt_tokens\n", + " output_tokens = response.usage.completion_tokens\n", + " total_tokens = response.usage.total_tokens\n", + "\n", + " return {\n", + " \"response\": reply,\n", + " \"latency_ms\": round(latency, 2),\n", + " \"input_tokens\": input_tokens,\n", + " \"output_tokens\": output_tokens,\n", + " \"tokens\": total_tokens,\n", + " \"model\": self.gpt4mini_model\n", + " }\n", + "\n", + " def _build_context_prompt(self, cached_response: str, user_context: Dict, prompt: str) -> str:\n", + " context_parts = []\n", + " if user_context.get(\"preferences\"):\n", + " context_parts.append(\"User preferences: \" + \", \".join(user_context[\"preferences\"]))\n", + " if user_context.get(\"goals\"):\n", + " context_parts.append(\"User goals: \" + \", \".join(user_context[\"goals\"]))\n", + " if user_context.get(\"history\"):\n", + " context_parts.append(\"User history: \" + \", \".join(user_context[\"history\"]))\n", + " context_blob = \"\\n\".join(context_parts)\n", + " return f\"\"\"You are a personalization assistant. A cached response was previously generated for the prompt: \"{prompt}\".\n", + "\n", + "Here is the cached response:\n", + "\\\"\\\"\\\"{cached_response}\\\"\\\"\\\"\n", + "\n", + "Use the user's context below to personalize and refine the response:\n", + "{context_blob}\n", + "\n", + "Respond in a way that feels tailored to this user, adjusting tone, content, or suggestions as needed. Keep your response under 3 sentences no matter what.\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "6APF2GQaE3fm" + }, + "outputs": [], + "source": [ + "from redisvl.query import VectorQuery\n", + "\n", + "class ContextEnabledSemanticCache:\n", + " def __init__(self, redis_index, vectorizer, llm_client: \"AzureLLMClient\", telemetry: \"TelemetryLogger\", cache_ttl: int = -1):\n", + " self.index = redis_index\n", + " self.vectorizer = vectorizer\n", + " self.llm = llm_client\n", + " self.telemetry = telemetry\n", + " self.user_memories: Dict[str, Dict] = {}\n", + " self.cache_ttl = cache_ttl # seconds, -1 for no expiry\n", + "\n", + " def add_user_memory(self, user_id: str, memory_type: str, content: str):\n", + " if user_id not in self.user_memories:\n", + " self.user_memories[user_id] = {\"preferences\": [], \"history\": [], \"goals\": []}\n", + " self.user_memories[user_id][memory_type].append(content)\n", + "\n", + " def get_user_memory(self, user_id: str) -> Dict:\n", + " return self.user_memories.get(user_id, {})\n", + "\n", + " def generate_embedding(self, text: str) -> List[float]:\n", + " # Disable progress bar for cleaner output\n", + " return self.vectorizer.embed(text, show_progress_bar=False)\n", + "\n", + "\n", + " def search_cache(\n", + " self,\n", + " embedding: List[float],\n", + " distance_threshold: float = 0.2, # Loosened for consistency\n", + " ):\n", + " \"\"\"\n", + " Find the best cached match and gate it by a distance threshold.\n", + " The score returned by RediSearch (HNSW + cosine) is a distance (lower is better).\n", + " We accept a hit if distance <= distance_threshold.\n", + " \"\"\"\n", + " return_fields = [\"content\", \"user_id\", \"prompt\", \"model\", \"created_at\"]\n", + " query = VectorQuery(\n", + " vector=embedding,\n", + " vector_field_name=\"content_vector\",\n", + " return_fields=return_fields,\n", + " num_results=1,\n", + " return_score=True,\n", + " )\n", + " results = self.index.query(query)\n", + "\n", + " if results:\n", + " first = results[0]\n", + " # Use 'vector_distance' which is the standard score field in redisvl\n", + " score = first.get(\"vector_distance\", None)\n", + " if score is not None and float(score) <= distance_threshold:\n", + " return {field: first[field] for field in return_fields}\n", + "\n", + " return None\n", + "\n", + " def store_response(self, prompt: str, response: str, embedding: List[float], user_id: str, model: str):\n", + " import numpy as np\n", + " vec_bytes = np.array(embedding, dtype=np.float32).tobytes()\n", + "\n", + " doc = {\n", + " \"content\": response,\n", + " \"content_vector\": vec_bytes,\n", + " \"user_id\": user_id,\n", + " \"prompt\": prompt,\n", + " \"model\": model,\n", + " \"created_at\": int(time.time())\n", + " }\n", + " \n", + " # Use a unique key for each entry and set TTL\n", + " key = f\"{self.index.prefix}:{uuid.uuid4()}\"\n", + " self.index.load([doc], keys=[key])\n", + " \n", + " if self.cache_ttl > 0:\n", + " # We need a direct redis-py client to set TTL on the hash key\n", + " redis_client = self.index.client\n", + " redis_client.expire(key, self.cache_ttl)\n", + "\n", + "\n", + " def query(self, prompt: str, user_id: str):\n", + " start_time = time.time()\n", + " embedding = self.generate_embedding(prompt)\n", + " cached_result = self.search_cache(embedding)\n", + "\n", + " if cached_result:\n", + " cached_response = cached_result[\"content\"]\n", + " user_context = self.get_user_memory(user_id)\n", + " if user_context:\n", + " result = self.llm.personalize_response(cached_response, user_context, prompt)\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=result[\"latency_ms\"],\n", + " input_tokens=result[\"input_tokens\"],\n", + " output_tokens=result[\"output_tokens\"],\n", + " cache_status=\"hit_personalized\",\n", + " response_source=result[\"model\"]\n", + " )\n", + " return result[\"response\"]\n", + " else:\n", + " # Measure actual cache hit latency (embedding + Redis query time)\n", + " cache_latency = (time.time() - start_time) * 1000\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=round(cache_latency, 2),\n", + " input_tokens=0,\n", + " output_tokens=0,\n", + " cache_status=\"hit_raw\",\n", + " response_source=\"cache\"\n", + " )\n", + " return cached_response\n", + "\n", + " else:\n", + " result = self.llm.call_llm(prompt)\n", + " self.store_response(prompt, result[\"response\"], embedding, user_id, result[\"model\"])\n", + " self.telemetry.log(\n", + " user_id=user_id,\n", + " method=\"context_query\",\n", + " latency_ms=result[\"latency_ms\"],\n", + " input_tokens=result[\"input_tokens\"],\n", + " output_tokens=result[\"output_tokens\"],\n", + " cache_status=\"miss\",\n", + " response_source=result[\"model\"]\n", + " )\n", + " return result[\"response\"]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RgmW_S6s9Sy_" + }, + "source": [ + "## Scenario Setup: IT Support Dashboard Access\n", + "\n", + "We'll simulate three different approaches to handling the same IT support query:\n", + "- **User A (Cold)**: No cache, fresh LLM call every time\n", + "- **User B (No Context)**: Cache hit, but generic response \n", + "- **User C (With Context)**: Cache hit + personalization based on user memory\n", + "\n", + "The query: *A user in the finance department can't access the dashboard — what should I check?*\n", + "\n", + "### User Context Profile\n", + "User C represents an experienced IT support agent who:\n", + "- Specializes in finance department issues\n", + "- Has solved similar dashboard access problems before\n", + "- Uses specific tools and follows established troubleshooting patterns\n", + "- Needs responses tailored to their expertise level and current context" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "zji4u12fgQZg", + "outputId": "cfc5cc09-381c-4d6e-8c43-0dcd98760edd" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6APF2GQaE3fm" - }, - "outputs": [], - "source": [ - "from redisvl.query import VectorQuery\n", - "\n", - "class ContextEnabledSemanticCache:\n", - " def __init__(self, redis_index, vectorizer, llm_client: AzureLLMClient, telemetry: TelemetryLogger):\n", - " self.index = redis_index\n", - " self.vectorizer = vectorizer\n", - " self.llm = llm_client\n", - " self.telemetry = telemetry\n", - " self.user_memories: Dict[str, Dict] = {}\n", - "\n", - " def add_user_memory(self, user_id: str, memory_type: str, content: str):\n", - " if user_id not in self.user_memories:\n", - " self.user_memories[user_id] = {\"preferences\": [], \"history\": [], \"goals\": []}\n", - " self.user_memories[user_id][memory_type].append(content)\n", - "\n", - " def get_user_memory(self, user_id: str) -> Dict:\n", - " return self.user_memories.get(user_id, {})\n", - "\n", - " def generate_embedding(self, text: str) -> List[float]:\n", - " return self.vectorizer.embed(text)\n", - "\n", - "\n", - " def search_cache(self, embedding: List[float], threshold=0.85):\n", - " query = VectorQuery(\n", - " vector=embedding,\n", - " vector_field_name=\"content_vector\",\n", - " return_fields=[\"content\", \"user_id\"],\n", - " num_results=1,\n", - " return_score=True\n", - " )\n", - " results = self.index.query(query)\n", - "\n", - " if results:\n", - " first = results[0]\n", - " score = first.get(\"score\", None) or first.get(\"_score\", None) # fallback pattern\n", - " if score is None or score >= threshold:\n", - " return first[\"content\"]\n", - "\n", - " return None\n", - "\n", - " def store_response(self, prompt: str, response: str, embedding: List[float], user_id: str):\n", - " from redisvl.schema import IndexSchema # ensure schema imported\n", - "\n", - " # Convert embedding to bytes (float32)\n", - " import numpy as np\n", - " vec_bytes = np.array(embedding, dtype=np.float32).tobytes()\n", - "\n", - " doc = {\n", - " \"content\": response,\n", - " \"content_vector\": vec_bytes,\n", - " \"user_id\": user_id\n", - " }\n", - " self.index.load([doc]) # load does the insertion/upsert\n", - "\n", - " def query(self, prompt: str, user_id: str):\n", - " embedding = self.generate_embedding(prompt)\n", - " cached_response = self.search_cache(embedding)\n", - "\n", - " if cached_response:\n", - " user_context = self.get_user_memory(user_id)\n", - " if user_context:\n", - " result = self.llm.personalize_response(cached_response, user_context, prompt)\n", - " self.telemetry.log(\n", - " user_id=user_id,\n", - " method=\"context_query\",\n", - " latency_ms=result[\"latency_ms\"],\n", - " input_tokens=result[\"input_tokens\"],\n", - " output_tokens=result[\"output_tokens\"],\n", - " cache_status=\"hit_personalized\",\n", - " response_source=result[\"model\"]\n", - " )\n", - " return result[\"response\"]\n", - " else:\n", - " # You can choose to skip telemetry logging for raw hits or log a minimal version\n", - " self.telemetry.log(\n", - " user_id=user_id,\n", - " method=\"context_query\",\n", - " latency_ms=0,\n", - " input_tokens=0,\n", - " output_tokens=0,\n", - " cache_status=\"hit_raw\",\n", - " response_source=\"cache\"\n", - " )\n", - " return cached_response\n", - "\n", - " else:\n", - " result = self.llm.call_llm(prompt)\n", - " self.store_response(prompt, result[\"response\"], embedding, user_id)\n", - " self.telemetry.log(\n", - " user_id=user_id,\n", - " method=\"context_query\",\n", - " latency_ms=result[\"latency_ms\"],\n", - " input_tokens=result[\"input_tokens\"],\n", - " output_tokens=result[\"output_tokens\"],\n", - " cache_status=\"miss\",\n", - " response_source=result[\"model\"]\n", - " )\n", - " return result[\"response\"]\n", - "\n", - "telemetry_logger = TelemetryLogger()\n", - "# ✅ Initialize engine\n", - "cesc = ContextEnabledSemanticCache(\n", - " redis_index=search_index,\n", - " vectorizer=vectorizer,\n", - " llm_client=AzureLLMClient(client, token_counter, GPT4_MODEL, GPT4mini_MODEL),\n", - " telemetry=telemetry_logger\n", - ")\n" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "============================================================\n", + "🧊 Scenario 1: Plain LLM – cache miss\n", + "============================================================\n", + "First, ensure the user has the appropriate permissions or access rights to view the dashboard. Check if their role or group membership includes access to the dashboard. Additionally, verify that there are no technical issues, such as network restrictions or dashboard configuration errors.\n", + "\n", + "============================================================\n", + "📦 Scenario 2: Semantic Cache Hit – generic, extremely fast, no user memory\n", + "============================================================\n", + "First, ensure the user has the appropriate permissions or access rights to view the dashboard. Check if their role or group membership includes access to the dashboard. Additionally, verify that there are no technical issues, such as network restrictions or dashboard configuration errors.\n", + "\n", + "============================================================\n", + "🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory\n", + "============================================================\n", + "First, check if the user has the correct 'finance_dashboard_viewer' role assigned and ensure there are no recent misconfigurations affecting their access. Since you're using Chrome on macOS, also verify that there are no network restrictions or issues with SSO that might be preventing the login. This should help you quickly resolve the issue for the finance team user.\n", + "\n" + ] + } + ], + "source": [ + "from IPython.display import clear_output, display, Markdown\n", + "clear_output(wait=True)\n", + "\n", + "# 🔁 Reset Redis index and telemetry (optional for rerun clarity)\n", + "search_index.delete()\n", + "search_index.create(overwrite=True)\n", + "\n", + "# Initialize telemetry and engine\n", + "telemetry_logger = TelemetryLogger()\n", + "cesc = ContextEnabledSemanticCache(\n", + " redis_index=search_index,\n", + " vectorizer=vectorizer,\n", + " llm_client=AzureLLMClient(client, token_counter, MODEL_GPT4, MODEL_GPT4_MINI),\n", + " telemetry=telemetry_logger,\n", + " cache_ttl=3600 # Expire cache entries after 1 hour\n", + ")\n", + "\n", + "def get_divider(title: str = \"\", width: int = 60) -> str:\n", + " line = \"=\" * width\n", + " if title:\n", + " return f\"\\n{line}\\n{title}\\n{line}\\n\"\n", + " else:\n", + " return f\"\\n{line}\\n\"\n", + "\n", + "# 🧪 Define demo prompt and users\n", + "prompt = \"A user in the finance department can't access the dashboard — what should I check? Answer in 2-3 sentences max.\"\n", + "users = {\n", + " \"cold\": \"user_cold\",\n", + " \"nocx\": \"user_nocontext\",\n", + " \"cx\": \"user_withcontext\"\n", + "}\n", + "\n", + "# 🧠 Add memory for personalized user (e.g., HR IT support agent)\n", + "cesc.add_user_memory(users[\"cx\"], \"preferences\", \"uses Chrome browser on macOS\")\n", + "cesc.add_user_memory(users[\"cx\"], \"goals\", \"resolve access issues efficiently for finance team users\")\n", + "cesc.add_user_memory(users[\"cx\"], \"history\", \"frequently resolves issues with 'finance_dashboard_viewer' role misconfigurations\")\n", + "cesc.add_user_memory(users[\"cx\"], \"history\", \"troubleshot recent problems with finance dashboard access and SSO\")\n", + "\n", + "# 🔍 Run prompt for each scenario and collect output\n", + "output_parts = []\n", + "\n", + "output_parts.append(get_divider(\"🧊 Scenario 1: Plain LLM – cache miss\"))\n", + "response_1 = cesc.query(prompt, user_id=users[\"cold\"])\n", + "output_parts.append(response_1 + \"\\n\")\n", + "\n", + "output_parts.append(get_divider(\"📦 Scenario 2: Semantic Cache Hit – generic, extremely fast, no user memory\"))\n", + "response_2 = cesc.query(prompt, user_id=users[\"nocx\"])\n", + "output_parts.append(response_2 + \"\\n\")\n", + "\n", + "output_parts.append(get_divider(\"🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory\"))\n", + "response_3 = cesc.query(prompt, user_id=users[\"cx\"])\n", + "output_parts.append(response_3 + \"\\n\")\n", + "\n", + "# Print all collected output at once\n", + "print(\"\".join(output_parts))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJ-fUMmY9X4V" + }, + "source": [ + "## Key Observations\n", + "\n", + "Notice the different response patterns:\n", + "\n", + "1. **Cold Start Response**: Comprehensive but generic, took longest time and highest cost\n", + "2. **Cache Hit Response**: Identical to cold start, near-instant retrieval, minimal cost\n", + "3. **Personalized Response**: Adapted for user's specific role, tools, and experience level\n", + "\n", + "The personalized response demonstrates how CESC can:\n", + "- Reference user's specific browser/OS (Chrome on macOS)\n", + "- Mention role-specific permissions (finance_dashboard_viewer role)\n", + "- Reference past experience (SSO troubleshooting history)\n", + "- Maintain professional tone appropriate for experienced IT staff" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 600 }, + "id": "zJdBei1UkQHO", + "outputId": "6df548bd-ec88-41b7-bf61-295e57d0cfbb" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "RgmW_S6s9Sy_" - }, - "source": [ - "## Scenario Setup: IT Support Dashboard Access\n", - "\n", - "We'll simulate three different approaches to handling the same IT support query:\n", - "- **User A (Cold)**: No cache, fresh LLM call every time\n", - "- **User B (No Context)**: Cache hit, but generic response \n", - "- **User C (With Context)**: Cache hit + personalization based on user memory\n", - "\n", - "The query: *A user in the finance department can't access the dashboard — what should I check?*\n", - "\n", - "### User Context Profile\n", - "User C represents an experienced IT support agent who:\n", - "- Specializes in finance department issues\n", - "- Has solved similar dashboard access problems before\n", - "- Uses specific tools and follows established troubleshooting patterns\n", - "- Needs responses tailored to their expertise level and current context" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "============================================================\n", + "📈 Telemetry Summary:\n", + "============================================================\n", + "\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "zji4u12fgQZg", - "outputId": "cfc5cc09-381c-4d6e-8c43-0dcd98760edd" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "============================================================\n", - "🧊 Scenario 1: Plain LLM – cache miss\n", - "============================================================\n", - "\n", - "First, verify the user's permissions and access rights to the dashboard in the system settings. Ensure they are assigned the correct role or group. Next, check for any connectivity issues, browser compatibility, or recent changes to the dashboard configuration that might affect access. \n", - "\n", - "\n", - "============================================================\n", - "📦 Scenario 2: Semantic Cache Hit – generic, no user memory\n", - "============================================================\n", - "\n", - "First, verify the user's permissions and access rights to the dashboard in the system settings. Ensure they are assigned the correct role or group. Next, check for any connectivity issues, browser compatibility, or recent changes to the dashboard configuration that might affect access. \n", - "\n", - "\n", - "============================================================\n", - "🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory\n", - "============================================================\n", - "\n", - "First, check the user's permissions to ensure they have the 'finance_dashboard_viewer' role correctly assigned in the system settings. Since you’re using Chrome on macOS, confirm there are no browser compatibility issues and that your SSO is functioning properly. Lastly, review any recent configuration changes that might impact access to the dashboard. \n", - "\n" - ] - } + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idcache_statuslatency_msresponse_sourceinput_tokensoutput_tokenstotal_tokens
0user_coldmiss1757.95gpt-4o254974
1user_nocontexthit_raw19.64cache000
2user_withcontexthit_personalized1795.41gpt-4o-mini22373296
\n", + "
" ], - "source": [ - "# 🔁 Reset Redis index and telemetry (optional for rerun clarity)\n", - "search_index.delete() # DANGER: removes all vectors\n", - "search_index.create(overwrite=True)\n", - "telemetry_logger.logs = []\n", - "\n", - "def print_divider(title: str = \"\", width: int = 60):\n", - " line = \"=\" * width\n", - " if title:\n", - " print(f\"\\n{line}\\n{title}\\n{line}\\n\")\n", - " else:\n", - " print(f\"\\n{line}\\n\")\n", - "\n", - "\n", - "# 🧪 Define demo prompt and users\n", - "prompt = \"A user in the finance department can't access the dashboard — what should I check? Answer in 2-3 sentences max.\"\n", - "users = {\n", - " \"cold\": \"user_cold\",\n", - " \"nocx\": \"user_nocontext\",\n", - " \"cx\": \"user_withcontext\"\n", - "}\n", - "\n", - "# 🧠 Add memory for personalized user (e.g., HR IT support agent)\n", - "cesc.add_user_memory(users[\"cx\"], \"preferences\", \"uses Chrome browser on macOS\")\n", - "cesc.add_user_memory(users[\"cx\"], \"goals\", \"resolve access issues efficiently for finance team users\")\n", - "cesc.add_user_memory(users[\"cx\"], \"history\", \"frequently resolves issues with 'finance_dashboard_viewer' role misconfigurations\")\n", - "cesc.add_user_memory(users[\"cx\"], \"history\", \"troubleshot recent problems with finance dashboard access and SSO\")\n", - "\n", - "# 🔍 Run prompt for each scenario\n", - "print_divider(\"🧊 Scenario 1: Plain LLM – cache miss\")\n", - "response_1 = cesc.query(prompt, user_id=users[\"cold\"])\n", - "print(response_1, \"\\n\")\n", - "\n", - "print_divider(\"📦 Scenario 2: Semantic Cache Hit – generic, extremely fast, no user memory\")\n", - "response_2 = cesc.query(prompt, user_id=users[\"nocx\"])\n", - "print(response_2, \"\\n\")\n", - "\n", - "print_divider(\"🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory\")\n", - "response_3 = cesc.query(prompt, user_id=users[\"cx\"])\n", - "print(response_3, \"\\n\")" + "text/plain": [ + " user_id cache_status latency_ms response_source \\\n", + "0 user_cold miss 1757.95 gpt-4o \n", + "1 user_nocontext hit_raw 19.64 cache \n", + "2 user_withcontext hit_personalized 1795.41 gpt-4o-mini \n", + "\n", + " input_tokens output_tokens total_tokens \n", + "0 25 49 74 \n", + "1 0 0 0 \n", + "2 223 73 296 " ] + }, + "metadata": {}, + "output_type": "display_data" }, { - "cell_type": "markdown", - "metadata": { - "id": "gJ-fUMmY9X4V" - }, - "source": [ - "## Key Observations\n", - "\n", - "Notice the different response patterns:\n", - "\n", - "1. **Cold Start Response**: Comprehensive but generic, took longest time and highest cost\n", - "2. **Cache Hit Response**: Identical to cold start, near-instant retrieval, minimal cost\n", - "3. **Personalized Response**: Adapted for user's specific role, tools, and experience level\n", - "\n", - "The personalized response demonstrates how CESC can:\n", - "- Reference user's specific browser/OS (Chrome on macOS)\n", - "- Mention role-specific permissions (finance_dashboard_viewer role)\n", - "- Reference past experience (SSO troubleshooting history)\n", - "- Maintain professional tone appropriate for experienced IT staff" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "⏱️ Personalized response (user_withcontext) was 37 ms slower than the plain LLM — a 2.1% slowdown.\n", + "📌 However, it returned a tailored response based on user memory, offering higher relevance.\n", + "\n", + "============================================================\n", + "💸 Cost Breakdown:\n", + "============================================================\n", + "\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 600 - }, - "id": "zJdBei1UkQHO", - "outputId": "6df548bd-ec88-41b7-bf61-295e57d0cfbb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "============================================================\n", - "📈 Telemetry Summary:\n", - "============================================================\n", - "\n" - ] - }, - { - "data": { - "application/vnd.google.colaboratory.intrinsic+json": { - "summary": "{\n \"name\": \"telemetry_logger\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"user_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"user_cold\",\n \"user_nocontext\",\n \"user_withcontext\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cache_status\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"miss\",\n \"hit_raw\",\n \"hit_personalized\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latency_ms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 651.6840342016469,\n \"min\": 0.0,\n \"max\": 1283.51,\n \"num_unique_values\": 3,\n \"samples\": [\n 1283.51,\n 0.0,\n 838.04\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"response_source\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"gpt-4o\",\n \"cache\",\n \"gpt-4o-mini\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"input_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 122,\n \"min\": 0,\n \"max\": 224,\n \"num_unique_values\": 3,\n \"samples\": [\n 25,\n 0,\n 224\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"output_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 34,\n \"min\": 0,\n \"max\": 66,\n \"num_unique_values\": 3,\n \"samples\": [\n 50,\n 0,\n 66\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"total_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 150,\n \"min\": 0,\n \"max\": 290,\n \"num_unique_values\": 3,\n \"samples\": [\n 75,\n 0,\n 290\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", - "type": "dataframe" - }, - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idcache_statuslatency_msresponse_sourceinput_tokensoutput_tokenstotal_tokens
0user_coldmiss1283.51gpt-4o255075
1user_nocontexthit_raw0.00cache000
2user_withcontexthit_personalized838.04gpt-4o-mini22466290
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "\n", - "
\n", - "
\n" - ], - "text/plain": [ - " user_id cache_status latency_ms response_source \\\n", - "0 user_cold miss 1283.51 gpt-4o \n", - "1 user_nocontext hit_raw 0.00 cache \n", - "2 user_withcontext hit_personalized 838.04 gpt-4o-mini \n", - "\n", - " input_tokens output_tokens total_tokens \n", - "0 25 50 75 \n", - "1 0 0 0 \n", - "2 224 66 290 " - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "⚡ Personalized response (user_withcontext) was faster than the plain LLM by 445 ms — a 34.7% speed boost.\n", - "None \n", - "\n", - "\n", - "============================================================\n", - "💸 Cost Breakdown:\n", - "============================================================\n", - "\n" - ] - }, - { - "data": { - "application/vnd.google.colaboratory.intrinsic+json": { - "summary": "{\n \"name\": \"telemetry_logger\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"user_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"user_cold\",\n \"user_nocontext\",\n \"user_withcontext\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cache_status\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"miss\",\n \"hit_raw\",\n \"hit_personalized\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"response_source\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"gpt-4o\",\n \"cache\",\n \"gpt-4o-mini\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"input_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 122,\n \"min\": 0,\n \"max\": 224,\n \"num_unique_values\": 3,\n \"samples\": [\n 25,\n 0,\n 224\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"output_tokens\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 34,\n \"min\": 0,\n \"max\": 66,\n \"num_unique_values\": 3,\n \"samples\": [\n 50,\n 0,\n 66\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latency_ms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 651.6840342016469,\n \"min\": 0.0,\n \"max\": 1283.51,\n \"num_unique_values\": 3,\n \"samples\": [\n 1283.51,\n 0.0,\n 838.04\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cost_usd\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0004410332564935816,\n \"min\": 0.0,\n \"max\": 0.000875,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.000875,\n 0.0,\n 0.000534\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"baseline_cost_usd\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0010601061267627877,\n \"min\": 0.0,\n \"max\": 0.00211,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.000875,\n 0.0,\n 0.00211\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"savings_usd\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0009099040242428502,\n \"min\": 0.0,\n \"max\": 0.001576,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.001576,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", - "type": "dataframe" - }, - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idcache_statusresponse_sourceinput_tokensoutput_tokenslatency_mscost_usdbaseline_cost_usdsavings_usd
0user_coldmissgpt-4o25501283.510.0008750.0008750.000000
1user_nocontexthit_rawcache000.000.0000000.0000000.000000
2user_withcontexthit_personalizedgpt-4o-mini22466838.040.0005340.0021100.001576
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "\n", - " \n", - "\n", - " \n", - "
\n", - "\n", - "\n", - "
\n", - " \n", - "\n", - "\n", - "\n", - " \n", - "
\n", - "\n", - "
\n", - "
\n" - ], - "text/plain": [ - " user_id cache_status response_source input_tokens \\\n", - "0 user_cold miss gpt-4o 25 \n", - "1 user_nocontext hit_raw cache 0 \n", - "2 user_withcontext hit_personalized gpt-4o-mini 224 \n", - "\n", - " output_tokens latency_ms cost_usd baseline_cost_usd savings_usd \n", - "0 50 1283.51 0.000875 0.000875 0.000000 \n", - "1 0 0.00 0.000000 0.000000 0.000000 \n", - "2 66 838.04 0.000534 0.002110 0.001576 " - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "🧾 Total Cost of Plain LLM Response: $0.0009\n", - "🧾 Total Cost of Personalized Response: $0.0005\n", - "\n", - "💡 Personalized response (user_withcontext) was cheaper than plain LLM by $0.0003 — a 39.0% cost improvement.\n" - ] - } + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idcache_statusresponse_sourceinput_tokensoutput_tokenslatency_mscost_usdbaseline_cost_usdsavings_usd
0user_coldmissgpt-4o25491757.950.0008600.000860.000000
1user_nocontexthit_rawcache0019.640.0000000.000000.000000
2user_withcontexthit_personalizedgpt-4o-mini223731795.410.0005530.002210.001657
\n", + "
" ], - "source": [ - "# 📊 Show telemetry summary\n", - "print_divider(\"📈 Telemetry Summary:\")\n", - "print(telemetry_logger.summarize(), \"\\n\")\n", - "\n", - "print_divider(\"💸 Cost Breakdown:\")\n", - "telemetry_logger.display_cost_summary()" + "text/plain": [ + " user_id cache_status response_source input_tokens \\\n", + "0 user_cold miss gpt-4o 25 \n", + "1 user_nocontext hit_raw cache 0 \n", + "2 user_withcontext hit_personalized gpt-4o-mini 223 \n", + "\n", + " output_tokens latency_ms cost_usd baseline_cost_usd savings_usd \n", + "0 49 1757.95 0.000860 0.00086 0.000000 \n", + "1 0 19.64 0.000000 0.00000 0.000000 \n", + "2 73 1795.41 0.000553 0.00221 0.001657 " ] + }, + "metadata": {}, + "output_type": "display_data" }, { - "cell_type": "markdown", - "metadata": { - "id": "natd_dr29bkH" - }, - "source": [ - "# Enterprise Significance & Large-Scale Impact\n", - "\n", - "## Production Metrics That Matter\n", - "\n", - "The results above demonstrate significant improvements across three critical enterprise metrics:\n", - "\n", - "### 💰 Cost Optimization\n", - "- **Immediate Savings**: 60-80% cost reduction on repeated queries\n", - "- **Scale Impact**: For enterprises processing 100K+ LLM queries daily, this translates to $1000s in monthly savings\n", - "- **Strategic Model Usage**: Expensive models (GPT-4o) for new content, efficient models (GPT-4o-mini) for personalization\n", - "\n", - "### ⚡ Performance Enhancement \n", - "- **Latency Reduction**: Cache hits respond in <100ms vs 2-5 seconds for cold calls\n", - "- **User Experience**: Sub-second responses feel instantaneous to end users\n", - "- **Scalability**: Redis can handle millions of vector operations per second\n", - "\n", - "### 🎯 Relevance & Personalization\n", - "- **Context Awareness**: Responses adapt to user roles, departments, and experience levels\n", - "- **Continuous Learning**: User memory grows with each interaction\n", - "- **Business Intelligence**: System learns organizational patterns and common solutions\n", - "\n", - "## ROI Calculations for Enterprise Deployment\n", - "\n", - "### Quantifiable Benefits\n", - "- **Cost Savings**: 60-80% reduction in LLM API costs\n", - "- **Productivity Gains**: 2-3x faster response times improve user productivity \n", - "- **Quality Improvement**: Consistent, personalized responses reduce error rates\n", - "- **Scalability**: Linear cost scaling vs exponential growth with pure LLM approaches\n", - "\n", - "### Investment Considerations\n", - "- **Infrastructure**: Redis Enterprise, vector compute resources\n", - "- **Development**: Initial implementation, integration with existing systems\n", - "- **Maintenance**: Ongoing optimization, user memory management\n", - "- **Training**: Staff education on new capabilities and best practices\n", - "\n", - "### Break-Even Analysis\n", - "For most enterprise deployments:\n", - "- **Break-even**: 3-6 months with >10K daily LLM queries\n", - "- **Positive ROI**: 200-400% in first year through combined cost savings and productivity gains\n", - "- **Compound Benefits**: Value increases as user memory and cache coverage grow\n", - "\n", - "The combination of semantic caching with user context represents a fundamental shift from generic AI responses to truly personalized, enterprise-aware intelligence that scales efficiently and cost-effectively." - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "🧾 Total Cost of Plain LLM Response: $0.0009\n", + "🧾 Total Cost of Personalized Response: $0.0006\n", + "\n", + "💡 Personalized response (user_withcontext) was cheaper than plain LLM by $0.0003 — a 35.7% cost improvement.\n" + ] } + ], + "source": [ + "def print_divider(title: str = \"\", width: int = 60):\n", + " line = \"=\" * width\n", + " if title:\n", + " print(f\"\\n{line}\\n{title}\\n{line}\\n\")\n", + " else:\n", + " print(f\"\\n{line}\\n\")\n", + "\n", + "# 📊 Show telemetry summary\n", + "print_divider(\"📈 Telemetry Summary:\")\n", + "telemetry_logger.summarize()\n", + "\n", + "print_divider(\"💸 Cost Breakdown:\")\n", + "telemetry_logger.display_cost_summary()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "natd_dr29bkH" + }, + "source": [ + "# Enterprise Significance & Large-Scale Impact\n", + "\n", + "## Production Metrics That Matter\n", + "\n", + "The results above demonstrate significant improvements across three critical enterprise metrics:\n", + "\n", + "### 💰 Cost Optimization\n", + "- **Immediate Savings**: 60-80% cost reduction on repeated queries\n", + "- **Scale Impact**: For enterprises processing 100K+ LLM queries daily, this translates to $1000s in monthly savings\n", + "- **Strategic Model Usage**: Expensive models (GPT-4o) for new content, efficient models (GPT-4o-mini) for personalization\n", + "\n", + "### ⚡ Performance Enhancement \n", + "- **Latency Reduction**: Cache hits respond in <100ms vs 2-5 seconds for cold calls\n", + "- **User Experience**: Sub-second responses feel instantaneous to end users\n", + "- **Scalability**: Redis can handle millions of vector operations per second\n", + "\n", + "### 🎯 Relevance & Personalization\n", + "- **Context Awareness**: Responses adapt to user roles, departments, and experience levels\n", + "- **Continuous Learning**: User memory grows with each interaction\n", + "- **Business Intelligence**: System learns organizational patterns and common solutions\n", + "\n", + "## ROI Calculations for Enterprise Deployment\n", + "\n", + "### Quantifiable Benefits\n", + "- **Cost Savings**: 60-80% reduction in LLM API costs\n", + "- **Productivity Gains**: 2-3x faster response times improve user productivity \n", + "- **Quality Improvement**: Consistent, personalized responses reduce error rates\n", + "- **Scalability**: Linear cost scaling vs exponential growth with pure LLM approaches\n", + "\n", + "### Investment Considerations\n", + "- **Infrastructure**: Redis Enterprise, vector compute resources\n", + "- **Development**: Initial implementation, integration with existing systems\n", + "- **Maintenance**: Ongoing optimization, user memory management\n", + "- **Training**: Staff education on new capabilities and best practices\n", + "\n", + "### Break-Even Analysis\n", + "For most enterprise deployments:\n", + "- **Break-even**: 3-6 months with >10K daily LLM queries\n", + "- **Positive ROI**: 200-400% in first year through combined cost savings and productivity gains\n", + "- **Compound Benefits**: Value increases as user memory and cache coverage grow\n", + "\n", + "The combination of semantic caching with user context represents a fundamental shift from generic AI responses to truly personalized, enterprise-aware intelligence that scales efficiently and cost-effectively." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 0 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 }