Skip to content

APIM AI Gateway: Evaluate semantic caching for cost reduction #101

@mishraomp

Description

@mishraomp

Summary

No llm-semantic-cache-lookup / llm-semantic-cache-store policies are configured. Every semantically similar request goes through to the backend, consuming tokens and incurring costs.

Gap

Semantic caching reduces TPM consumption and improves response latency for repeated/similar queries. In a government context with multiple users asking similar questions (e.g., policy Q&A, form assistance), caching could significantly reduce Azure OpenAI costs.

Proposed Implementation

  1. Infrastructure: Deploy Azure Managed Redis (Enterprise tier with RediSearch) in Canada East
  2. Model: Deploy an embeddings model (e.g., text-embedding-ada-002) in the shared AI Foundry Hub
  3. APIM Configuration:
    • Configure Redis as an APIM external cache
    • Add llm-semantic-cache-lookup in inbound (after content safety, before backend call)
    • Add llm-semantic-cache-store in outbound
    • Use <vary-by>@(context.Subscription.Id)</vary-by> for cross-tenant isolation
  4. Terraform: Add Redis module, embeddings model deployment, APIM external cache resource
  5. Make it opt-in per tenant via semantic_caching_enabled flag

Key Considerations

  • Data Residency: Cached prompts/completions stored in Redis must remain in Canada East for BC Gov compliance
  • Cross-tenant isolation: Critical - must use <vary-by> with subscription ID to prevent cross-tenant cache hits
  • Score threshold: Start with 0.05 (low), tune based on cache hit ratio monitoring
  • Sensitivity: Government queries may be too varied/sensitive for caching to be cost-effective - needs POC
  • Cost: Azure Managed Redis Enterprise is not cheap - ROI depends on query repetition patterns

Prerequisites

  • Estimate current OpenAI TPM usage and query repetition patterns across tenants
  • Cost analysis: Redis Enterprise vs. token savings
  • POC with representative queries to measure cache hit rates
  • Confirm data residency requirements for cached AI responses

Severity

LOW-MEDIUM - High infrastructure overhead. Value depends on query patterns and cost optimization needs.

Context

Identified during APIM multi-tenancy and AI gateway policy gap analysis (Feb 2026).

References

Metadata

Metadata

Assignees

Labels

architectureenhancementNew feature or requestmicrosoftwhere microsoft team is involved and or they lead the task

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions