Skip to content

Commit f81aa24

Browse files
Hackathon - Token Limits (#780)
<!-- Explain the changes introduced in your PR --> ## Pull Request approval You will need to get your PR approved by at least one member of the Sourcegraph team. For reviews of docs formatting, styles, and component usage, please tag the docs team via the #docs Slack channel. --------- Co-authored-by: Maedah Batool <[email protected]>
1 parent ca6797f commit f81aa24

File tree

1 file changed

+51
-32
lines changed

1 file changed

+51
-32
lines changed
Lines changed: 51 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
# Cody Input and Output Token Limits
22

3-
For all models, Cody allows up to **4,000 tokens of output**, which is approximately **500-600** lines of code.
3+
<p class="subtitle">Learn about Cody's token limits and how to manage them.</p>
44

5-
For Claude 3 Sonnet or Opus models, Cody tracks two separate token limits:
6-
* @-mention context is limited to 30,000 tokens (~4,000 lines of code) and can be specified using the @-filename syntax. This context is explicitly defined by the user and is used to provide specific information to Cody.
7-
* Conversation context is limited to 15,000 tokens and includes user questions, system responses, and automatically retrieved context items. Apart from user questions, this context is generated automatically by Cody.
5+
For all models, Cody allows up to **4,000 tokens of output**, which is approximately **500-600** lines of code. For Claude 3 Sonnet or Opus models, Cody tracks two separate token limits:
86

9-
All other models are currently capped at **7,000 tokens** of shared context between `@-mention` context and chat history.
7+
- The @-mention context is limited to **30,000 tokens** (~4,000 lines of code) and can be specified using the @-filename syntax. The user explicitly defines this context, which provides specific information to Cody.
8+
- Conversation context is limited to **15,000 tokens**, including user questions, system responses, and automatically retrieved context items. Apart from user questions, Cody generates this context automatically.
9+
10+
All other models are currently capped at **7,000 tokens** of shared context between the `@-mention` context and chat history.
1011

1112
Here's a detailed breakdown of the token limits by model:
1213

@@ -20,6 +21,7 @@ Here's a detailed breakdown of the token limits by model:
2021
| claude-2.0 | 7,000 | shared | 4,000 |
2122
| claude-2.1 | 7,000 | shared | 4,000 |
2223
| claude-3 Haiku | 7,000 | shared | 4,000 |
24+
| claude-3.5 Haiku | 7,000 | shared | 4,000 |
2325
| **claude-3 Sonnet** | **15,000** | **30,000** | **4,000** |
2426
| **claude-3.5 Sonnet** | **15,000** | **30,000** | **4,000** |
2527
| **claude-3.5 Sonnet (New)** | **15,000** | **30,000** | **4,000** |
@@ -28,8 +30,6 @@ Here's a detailed breakdown of the token limits by model:
2830
| Google Gemini 1.5 Flash | 7,000 | shared | 4,000 |
2931
| Google Gemini 1.5 Pro | 7,000 | shared | 4,000 |
3032

31-
32-
3333
</Tab>
3434

3535
<Tab title="Pro">
@@ -42,6 +42,7 @@ Here's a detailed breakdown of the token limits by model:
4242
| claude-2.0 | 7,000 | shared | 4,000 |
4343
| claude-2.1 | 7,000 | shared | 4,000 |
4444
| claude-3 Haiku | 7,000 | shared | 4,000 |
45+
| claude-3.5 Haiku | 7,000 | shared | 4,000 |
4546
| **claude-3 Sonnet** | **15,000** | **30,000** | **4,000** |
4647
| **claude-3.5 Sonnet** | **15,000** | **30,000** | **4,000** |
4748
| **claude-3.5 Sonnet (New)** | **15,000** | **30,000** | **4,000** |
@@ -61,6 +62,7 @@ Here's a detailed breakdown of the token limits by model:
6162
| claude-2.0 | 7,000 | shared | 1,000 |
6263
| claude-2.1 | 7,000 | shared | 1,000 |
6364
| claude-3 Haiku | 7,000 | shared | 1,000 |
65+
| claude-3.5 Haiku | 7,000 | shared | 1,000 |
6466
| **claude-3 Sonnet** | **15,000** | **30,000** | **4,000** |
6567
| **claude-3.5 Sonnet** | **15,000** | **30,000** | **4,000** |
6668
| **claude-3.5 Sonnet (New)** | **15,000** | **30,000** | **4,000** |
@@ -69,49 +71,66 @@ Here's a detailed breakdown of the token limits by model:
6971
</Tab>
7072
</Tabs>
7173

72-
<Callout type="info">For Cody Enterprise, the token limits are the standard limits. Exact token limits may vary depending on your deployment. Please contact your Sourcegraph representative. For more information on how Cody builds context, see our [docs here](/cody/core-concepts/context).</Callout>
74+
<br />
75+
76+
<Callout type="info">For Cody Enterprise, the token limits are the standard limits. Exact token limits may vary depending on your deployment. Please get in touch with your Sourcegraph representative. For more information on how Cody builds context, see our [docs here](/cody/core-concepts/context).</Callout>
7377

7478
## What is a Context Window?
7579

76-
A context window in large language models refers to the maximum number of tokens (words or subwords) that the model can process at once. This window determines how much context the model can consider when generating text or code.
80+
A context window in large language models refers to the maximum number of tokens (words or subwords) the model can process simultaneously. This window determines how much context the model can consider when generating text or code.
7781

78-
Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. By limiting the context window, the model can operate more efficiently and make predictions in a reasonable amount of time.
82+
Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. Limiting the context window allows the model to operate more efficiently and make predictions in a reasonable amount of time.
7983

8084
## What is an Output Limit?
8185

82-
**Output Limit** refers to the maximum number of tokens that a large language model can generate in a single response. This limit is typically set to ensure that the model's output remains manageable and relevant to the given context.
86+
**Output Limit** refers to the maximum number of tokens a large language model can generate in a single response. This limit is typically set to ensure the model's output remains manageable and relevant to the context.
8387

84-
When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could potentially continue.
88+
When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could continue.
8589

86-
The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses, ensuring that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.
90+
The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses. It also ensures that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.
8791

8892
## Current foundation model limits
8993

90-
Here is a table with the context window sizes and ouput limits for each of our [supported models](/cody/capabilities/supported-models).
94+
Here is a table with the context window sizes and output limits for each of our [supported models](/cody/capabilities/supported-models).
9195

92-
| **Model** | **Context Window** | **Output Limit** |
93-
| --------------- | ------------------ | ---------------- |
94-
| gpt-3.5-turbo | 16,385 tokens | 4,096 tokens |
95-
| gpt-4 | 8,192 tokens | 4,096 tokens |
96-
| gpt-4-turbo | 128,000 tokens | 4,096 tokens |
97-
| claude instant | 100,000 tokens | 4,096 tokens |
98-
| claude-2.0 | 100,000 tokens | 4,096 tokens |
99-
| claude-2.1 | 200,000 tokens | 4,096 tokens |
100-
| claude-3 Haiku | 200,000 tokens | 4,096 tokens |
101-
| claude-3 Sonnet | 200,000 tokens | 4,096 tokens |
102-
| claude-3 Opus | 200,000 tokens | 4,096 tokens |
103-
| mixtral 8x7b | 32,000 tokens | 4,096 tokens |
96+
| **Model** | **Context Window** | **Output Limit** |
97+
| ---------------- | ------------------ | ---------------- |
98+
| gpt-3.5-turbo | 16,385 tokens | 4,096 tokens |
99+
| gpt-4 | 8,192 tokens | 4,096 tokens |
100+
| gpt-4-turbo | 128,000 tokens | 4,096 tokens |
101+
| claude instant | 100,000 tokens | 4,096 tokens |
102+
| claude-2.0 | 100,000 tokens | 4,096 tokens |
103+
| claude-2.1 | 200,000 tokens | 4,096 tokens |
104+
| claude-3 Haiku | 200,000 tokens | 4,096 tokens |
105+
| claude-3.5 Haiku | 200,000 tokens | 4,096 tokens |
106+
| claude-3 Sonnet | 200,000 tokens | 4,096 tokens |
107+
| claude-3 Opus | 200,000 tokens | 4,096 tokens |
108+
| mixtral 8x7b | 32,000 tokens | 4,096 tokens |
104109

105-
<Callout type="info">These foundation model limits are the inherent limits of the LLM models themselves. For instance, Claude 3 models have a 200K context window compared to 8,192 for GPT-4.</Callout>
110+
<Callout type="info">These foundation model limits are the LLM models' inherent limits. For instance, Claude 3 models have a 200K context window compared to 8,192 for GPT-4.</Callout>
106111

107112
## Tradeoffs: Size, Accuracy, Latency and Cost
108113

109-
So why doesn't Cody use the full available context window for each model? There are a few tradeoffs that we need to consider, namely, context size, retrieval accuracy, latency and costs.
114+
So why does Cody not use each model's full available context window? We need to consider a few tradeoffs, namely, context size, retrieval accuracy, latency, and costs.
115+
116+
### Context Size
117+
118+
A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG-based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism.
119+
120+
If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not increase response quality. Conversely, some queries require a vast array of documents to synthesize the best response, so increasing the context window would be beneficial. We work to balance these nuances against increased latency and cost tradeoffs for input token lengths.
121+
122+
### Retrieval Accuracy
123+
124+
Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases. This means it is important to put the relevant information into as few tokens as possible to avoid confusing the underlying LLM.
125+
126+
As foundation models continue to improve, we see an increase in context retrieval, meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.
127+
128+
### Latency
110129

111-
1. **Context Size**: A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism. If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not actually increase response quality. Conversely, some queries require a vast array of documents to syntehsize the best possible response, so an increase in context window would be beneficial. We work to balance these nuances against the latency and cost tradeoffs that come with increased input token lengths.
130+
With a larger context window, the model needs to process more data, which can increase the latency or response time. The end user often experiences this as "time to first token" or how long they wait until they see an output start to stream.
112131

113-
2. **Retrieval Accuracy**: Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases. This means that it is important to put the relevant information into as few tokens as possible, so as not to confuse the underlying LLM. As foundation models continue to improve, we are seeing increased within context retrieval meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.
132+
In some cases, longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use-case and user-dependent.
114133

115-
3. **Latency**: With a larger context window, the model needs to process more data, which can increase the latency or response time. This is often experienced by the end user as "time to first token" or how long the user waits until they see an output start to stream. In some cases longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use case and user dependent.
134+
### Computational Cost
116135

117-
4. **Computational Cost**: Finally, the costs of processing large context windows scale linearly with the context window size. In order to provide a high quality response at a reasonable cost to the user, Cody leverages our expertise in code based RAG to drive down the generation costs while maintaining output quality.
136+
Finally, processing large context windows costs linearly with the context window size. Cody leverages our expertise in code-based RAG to drive down generation costs while maintaining output quality to provide a high-quality response at a reasonable cost to the user.

0 commit comments

Comments
 (0)