-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
architectureenhancementNew feature or requestNew feature or requestmicrosoftwhere microsoft team is involved and or they lead the taskwhere microsoft team is involved and or they lead the task
Description
Summary
No llm-semantic-cache-lookup / llm-semantic-cache-store policies are configured. Every semantically similar request goes through to the backend, consuming tokens and incurring costs.
Gap
Semantic caching reduces TPM consumption and improves response latency for repeated/similar queries. In a government context with multiple users asking similar questions (e.g., policy Q&A, form assistance), caching could significantly reduce Azure OpenAI costs.
Proposed Implementation
- Infrastructure: Deploy Azure Managed Redis (Enterprise tier with RediSearch) in Canada East
- Model: Deploy an embeddings model (e.g.,
text-embedding-ada-002) in the shared AI Foundry Hub - APIM Configuration:
- Configure Redis as an APIM external cache
- Add
llm-semantic-cache-lookupin inbound (after content safety, before backend call) - Add
llm-semantic-cache-storein outbound - Use
<vary-by>@(context.Subscription.Id)</vary-by>for cross-tenant isolation
- Terraform: Add Redis module, embeddings model deployment, APIM external cache resource
- Make it opt-in per tenant via
semantic_caching_enabledflag
Key Considerations
- Data Residency: Cached prompts/completions stored in Redis must remain in Canada East for BC Gov compliance
- Cross-tenant isolation: Critical - must use
<vary-by>with subscription ID to prevent cross-tenant cache hits - Score threshold: Start with 0.05 (low), tune based on cache hit ratio monitoring
- Sensitivity: Government queries may be too varied/sensitive for caching to be cost-effective - needs POC
- Cost: Azure Managed Redis Enterprise is not cheap - ROI depends on query repetition patterns
Prerequisites
- Estimate current OpenAI TPM usage and query repetition patterns across tenants
- Cost analysis: Redis Enterprise vs. token savings
- POC with representative queries to measure cache hit rates
- Confirm data residency requirements for cached AI responses
Severity
LOW-MEDIUM - High infrastructure overhead. Value depends on query patterns and cost optimization needs.
Context
Identified during APIM multi-tenancy and AI gateway policy gap analysis (Feb 2026).
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
architectureenhancementNew feature or requestNew feature or requestmicrosoftwhere microsoft team is involved and or they lead the taskwhere microsoft team is involved and or they lead the task
Type
Projects
Status
Backlog