-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Hi team 👋
First of all, thanks for publishing this demo — it’s a great reference for building a RAG solution on Azure.
I’m currently evaluating this architecture for a scenario where the application would need to scale to thousands of concurrent users, and I had a question/concern regarding token consumption and TPM limits as conversations grow.
Current flow
From what I see, the current flow involves:
- An LLM call to rewrite the user query
- A second LLM call to generate the final response, using:
- the rewritten query
- context retrieved from Azure AI Search
In both cases, the full conversation history is sent to the model.
Scalability concern
While this works well for short conversations, I’m concerned about what happens over time:
- Early interactions are relatively cheap in terms of tokens
- As the conversation grows, each new user question includes an increasingly large history
- Token usage per request can grow significantly, even for simple follow-up questions
- At scale (thousands of users), this could:
- Drive token costs up quickly
- Lead to
rateLimitExceptionerrors due to TPM saturation, even if individual users are low-volume
I’m aware that Azure OpenAI / Foundry allows configuring very high TPM limits (up to millions), but I’m wondering:
- Is the expectation that scaling is mainly handled by increasing TPM quotas?
- Or has the team considered token optimization strategies, such as:
- Limiting or summarizing conversation history
- Using rolling context windows
- Avoiding sending full history to the query-rewrite step
- More aggressively separating “retrieval context” from conversational context
Question
I’d love to understand:
- Whether this tradeoff was already considered in the design
- If there are recommended best practices for adapting this demo to large-scale, multi-user production scenarios
Thanks again for the great sample, and looking forward to your thoughts!