APIM AI Gateway: Evaluate semantic caching for cost reduction

## Summary

No `llm-semantic-cache-lookup` / `llm-semantic-cache-store` policies are configured. Every semantically similar request goes through to the backend, consuming tokens and incurring costs.

## Gap

Semantic caching reduces TPM consumption and improves response latency for repeated/similar queries. In a government context with multiple users asking similar questions (e.g., policy Q&A, form assistance), caching could significantly reduce Azure OpenAI costs.

## Proposed Implementation

1. **Infrastructure**: Deploy Azure Managed Redis (Enterprise tier with RediSearch) in Canada East
2. **Model**: Deploy an embeddings model (e.g., `text-embedding-ada-002`) in the shared AI Foundry Hub
3. **APIM Configuration**:
   - Configure Redis as an APIM external cache
   - Add `llm-semantic-cache-lookup` in inbound (after content safety, before backend call)
   - Add `llm-semantic-cache-store` in outbound
   - Use `<vary-by>@(context.Subscription.Id)</vary-by>` for **cross-tenant isolation**
4. **Terraform**: Add Redis module, embeddings model deployment, APIM external cache resource
5. Make it **opt-in per tenant** via `semantic_caching_enabled` flag

## Key Considerations

- **Data Residency**: Cached prompts/completions stored in Redis must remain in Canada East for BC Gov compliance
- **Cross-tenant isolation**: Critical - must use `<vary-by>` with subscription ID to prevent cross-tenant cache hits
- **Score threshold**: Start with 0.05 (low), tune based on cache hit ratio monitoring
- **Sensitivity**: Government queries may be too varied/sensitive for caching to be cost-effective - needs POC
- **Cost**: Azure Managed Redis Enterprise is not cheap - ROI depends on query repetition patterns

## Prerequisites

- Estimate current OpenAI TPM usage and query repetition patterns across tenants
- Cost analysis: Redis Enterprise vs. token savings
- POC with representative queries to measure cache hit rates
- Confirm data residency requirements for cached AI responses

## Severity

LOW-MEDIUM - High infrastructure overhead. Value depends on query patterns and cost optimization needs.

## Context

Identified during APIM multi-tenancy and AI gateway policy gap analysis (Feb 2026).

## References

- [Azure APIM Semantic Caching docs](https://learn.microsoft.com/en-us/azure/api-management/llm-semantic-cache-lookup-policy)
- [Enable semantic caching for LLM APIs](https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching)
- [AI Gateway labs - Semantic Caching](https://github.com/Azure-Samples/ai-gateway/tree/main/labs/semantic-caching)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APIM AI Gateway: Evaluate semantic caching for cost reduction #101

Summary

Gap

Proposed Implementation

Key Considerations

Prerequisites

Severity

Context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

APIM AI Gateway: Evaluate semantic caching for cost reduction #101

Description

Summary

Gap

Proposed Implementation

Key Considerations

Prerequisites

Severity

Context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions