About token consumption & TPM scalability for large user bases

Hi team 👋

First of all, thanks for publishing this demo — it’s a great reference for building a RAG solution on Azure.

I’m currently evaluating this architecture for a scenario where the application would need to scale to **thousands of concurrent users**, and I had a question/concern regarding **token consumption and TPM limits** as conversations grow.

### Current flow
From what I see, the current flow involves:

1. An LLM call to **rewrite the user query**
2. A second LLM call to **generate the final response**, using:
   - the rewritten query  
   - context retrieved from Azure AI Search  

In both cases, the **full conversation history is sent** to the model.

### Scalability concern
While this works well for short conversations, I’m concerned about what happens over time:

- Early interactions are relatively cheap in terms of tokens
- As the conversation grows, **each new user question includes an increasingly large history**
- Token usage per request can grow significantly, even for simple follow-up questions
- At scale (thousands of users), this could:
  - Drive token costs up quickly
  - Lead to `rateLimitException` errors due to TPM saturation, even if individual users are low-volume

I’m aware that Azure OpenAI / Foundry allows configuring very high TPM limits (up to millions), but I’m wondering:

- Is the expectation that scaling is mainly handled by **increasing TPM quotas**?
- Or has the team considered **token optimization strategies**, such as:
  - Limiting or summarizing conversation history
  - Using rolling context windows
  - Avoiding sending full history to the query-rewrite step
  - More aggressively separating “retrieval context” from conversational context

### Question
I’d love to understand:

- Whether this tradeoff was already considered in the design
- If there are recommended best practices for adapting this demo to **large-scale, multi-user production scenarios**

Thanks again for the great sample, and looking forward to your thoughts!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About token consumption & TPM scalability for large user bases #2963

Current flow

Scalability concern

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About token consumption & TPM scalability for large user bases #2963

Description

Current flow

Scalability concern

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions