Skip to content

About token consumption & TPM scalability for large user bases #2963

@Kevinjvn

Description

@Kevinjvn

Hi team 👋

First of all, thanks for publishing this demo — it’s a great reference for building a RAG solution on Azure.

I’m currently evaluating this architecture for a scenario where the application would need to scale to thousands of concurrent users, and I had a question/concern regarding token consumption and TPM limits as conversations grow.

Current flow

From what I see, the current flow involves:

  1. An LLM call to rewrite the user query
  2. A second LLM call to generate the final response, using:
    • the rewritten query
    • context retrieved from Azure AI Search

In both cases, the full conversation history is sent to the model.

Scalability concern

While this works well for short conversations, I’m concerned about what happens over time:

  • Early interactions are relatively cheap in terms of tokens
  • As the conversation grows, each new user question includes an increasingly large history
  • Token usage per request can grow significantly, even for simple follow-up questions
  • At scale (thousands of users), this could:
    • Drive token costs up quickly
    • Lead to rateLimitException errors due to TPM saturation, even if individual users are low-volume

I’m aware that Azure OpenAI / Foundry allows configuring very high TPM limits (up to millions), but I’m wondering:

  • Is the expectation that scaling is mainly handled by increasing TPM quotas?
  • Or has the team considered token optimization strategies, such as:
    • Limiting or summarizing conversation history
    • Using rolling context windows
    • Avoiding sending full history to the query-rewrite step
    • More aggressively separating “retrieval context” from conversational context

Question

I’d love to understand:

  • Whether this tradeoff was already considered in the design
  • If there are recommended best practices for adapting this demo to large-scale, multi-user production scenarios

Thanks again for the great sample, and looking forward to your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions