Skip to content

Commit 6922b90

Browse files
committed
token usage
1 parent cef7d4c commit 6922b90

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

articles/ai-services/openai/concepts/use-your-data.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,8 +415,36 @@ When you chat with a model, providing a history of the chat will help the model
415415

416416
## Token usage estimation for Azure OpenAI On Your Data
417417

418+
Azure OpenAI On Your Data Retrieval Augmented Generation (RAG) service that leverages both a search service (such as Azure AI Search) and generation (Azure OpenAI models) to let users get answers for their questions based on provided data.
418419

420+
As part of this RAG pipeline, there are are three steps at a high-level:
419421

422+
1. Reformulate the user query into a list of search intents. This is done by making a call to the model with a prompt that includes instructions, the user question, and conversation history. Let's call this an *intent prompt*.
423+
424+
1. For each intent, multiple document chunks are retrieved from the search service. After filtering out irrelevant chunks based on the user-specified threshold of strictness and reranking/aggregating the chunks based on internal logic, the user-specified number document chunks are chosen.
425+
426+
3. These document chunks, along with the user question, conversation history, role information, and instructions are sent to the model to generate the final model response. Let's call this the *generation prompt*.
427+
428+
In total, there are two calls made to GPT:
429+
430+
* For the intent, the token estimate for the *intent prompt* includes those for the user question, conversation history and the instructions sent to the model for intent generation.
431+
432+
* For generation, the token estimate for the *generation prompt* includes those for the user question, conversation history, the retrieved list of document chunks, role information and the instructions sent to it for generation.
433+
434+
The model generated output tokens (both intents and response) need to be taken into account for total token estimation.
435+
436+
| Model | Generation prompt token count | Intent prompt | Token count | Response token count | Intent token count |
437+
|--|--|--|--|--|
438+
| gpt-35-turbo-16k | 4297 | 1366 | 111 | 25 |
439+
| gpt-4-0613 | 3997 | 1385 | 118 | 18 |
440+
| gpt-4-1106-preview | 4538 | 811 | 119 | 27 |
441+
| gpt-35-turbo-1106 | 4854 | 1372 | 110 | 26 |
442+
443+
444+
445+
446+
447+
<!--
420448
| Model | Max tokens for system message | Max tokens for model response |
421449
|--|--|--|
422450
| GPT-35-0301 | 400 | 1500 |
@@ -447,6 +475,8 @@ class TokenEstimator(object):
447475
token_output = TokenEstimator.estimate_tokens(input_text)
448476
```
449477
478+
-->
479+
450480
## Troubleshooting
451481

452482
### Failed ingestion jobs

0 commit comments

Comments
 (0)