Hackathon - Token Limits (#780)

kukicado · MaedahBatool · web-flow · commit f81aa249516f · 2024-11-15T10:38:08.000-08:00
&lt;!-- Explain the changes introduced in your PR --&gt;

## Pull Request approval

You will need to get your PR approved by at least one member of the
Sourcegraph team. For reviews of docs formatting, styles, and component
usage, please tag the docs team via the #docs Slack channel.

---------

Co-authored-by: Maedah Batool &lt;me@MaedahBatool.com&gt;
diff --git a/docs/cody/core-concepts/token-limits.mdx b/docs/cody/core-concepts/token-limits.mdx
@@ -1,12 +1,13 @@
 # Cody Input and Output Token Limits
 
-For all models, Cody allows up to **4,000 tokens of output**, which is approximately **500-600** lines of code.
+<p class="subtitle">Learn about Cody's token limits and how to manage them.</p>
 
-For Claude 3 Sonnet or Opus models, Cody tracks two separate token limits:
-* @-mention context is limited to 30,000 tokens (~4,000 lines of code) and can be specified using the @-filename syntax. This context is explicitly defined by the user and is used to provide specific information to Cody.
-* Conversation context is limited to 15,000 tokens and includes user questions, system responses, and automatically retrieved context items. Apart from user questions, this context is generated automatically by Cody.
+For all models, Cody allows up to **4,000 tokens of output**, which is approximately **500-600** lines of code. For Claude 3 Sonnet or Opus models, Cody tracks two separate token limits:
 
-All other models are currently capped at **7,000 tokens** of shared context between `@-mention` context and chat history.
+- The @-mention context is limited to **30,000 tokens** (~4,000 lines of code) and can be specified using the @-filename syntax. The user explicitly defines this context, which provides specific information to Cody.
+- Conversation context is limited to **15,000 tokens**, including user questions, system responses, and automatically retrieved context items. Apart from user questions, Cody generates this context automatically.
+
+All other models are currently capped at **7,000 tokens** of shared context between the `@-mention` context and chat history.
 
 Here's a detailed breakdown of the token limits by model:
 
@@ -20,6 +21,7 @@ Here's a detailed breakdown of the token limits by model:
 | claude-2.0                  | 7,000                    | shared                | 4,000      |
 | claude-2.1                  | 7,000                    | shared                | 4,000      |
 | claude-3 Haiku              | 7,000                    | shared                | 4,000      |
+| claude-3.5 Haiku            | 7,000                    | shared                | 4,000      |
 | **claude-3 Sonnet**         | **15,000**               | **30,000**            | **4,000**  |
 | **claude-3.5 Sonnet**       | **15,000**               | **30,000**            | **4,000**  |
 | **claude-3.5 Sonnet (New)** | **15,000**               | **30,000**            | **4,000**  |
@@ -28,8 +30,6 @@ Here's a detailed breakdown of the token limits by model:
 | Google Gemini 1.5 Flash     | 7,000                    | shared                | 4,000      |
 | Google Gemini 1.5 Pro       | 7,000                    | shared                | 4,000      |
 
-
-
   </Tab>
 
 <Tab title="Pro">
@@ -42,6 +42,7 @@ Here's a detailed breakdown of the token limits by model:
 | claude-2.0                  | 7,000                    | shared                | 4,000      |
 | claude-2.1                  | 7,000                    | shared                | 4,000      |
 | claude-3 Haiku              | 7,000                    | shared                | 4,000      |
+| claude-3.5 Haiku            | 7,000                    | shared                | 4,000      |
 | **claude-3 Sonnet**         | **15,000**               | **30,000**            | **4,000**  |
 | **claude-3.5 Sonnet**       | **15,000**               | **30,000**            | **4,000**  |
 | **claude-3.5 Sonnet (New)** | **15,000**               | **30,000**            | **4,000**  |
@@ -61,6 +62,7 @@ Here's a detailed breakdown of the token limits by model:
 | claude-2.0                  | 7,000                    | shared                | 1,000      |
 | claude-2.1                  | 7,000                    | shared                | 1,000      |
 | claude-3 Haiku              | 7,000                    | shared                | 1,000      |
+| claude-3.5 Haiku            | 7,000                    | shared                | 1,000      |
 | **claude-3 Sonnet**         | **15,000**               | **30,000**            | **4,000**  |
 | **claude-3.5 Sonnet**       | **15,000**               | **30,000**            | **4,000**  |
 | **claude-3.5 Sonnet (New)** | **15,000**               | **30,000**            | **4,000**  |
@@ -69,49 +71,66 @@ Here's a detailed breakdown of the token limits by model:
   </Tab>
 </Tabs>
 
-<Callout type="info">For Cody Enterprise, the token limits are the standard limits. Exact token limits may vary depending on your deployment. Please contact your Sourcegraph representative. For more information on how Cody builds context, see our [docs here](/cody/core-concepts/context).</Callout>
+<br />
+
+<Callout type="info">For Cody Enterprise, the token limits are the standard limits. Exact token limits may vary depending on your deployment. Please get in touch with your Sourcegraph representative. For more information on how Cody builds context, see our [docs here](/cody/core-concepts/context).</Callout>
 
 ## What is a Context Window?
 
-A context window in large language models refers to the maximum number of tokens (words or subwords) that the model can process at once. This window determines how much context the model can consider when generating text or code.
+A context window in large language models refers to the maximum number of tokens (words or subwords) the model can process simultaneously. This window determines how much context the model can consider when generating text or code.
 
-Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. By limiting the context window, the model can operate more efficiently and make predictions in a reasonable amount of time.
+Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. Limiting the context window allows the model to operate more efficiently and make predictions in a reasonable amount of time.
 
 ## What is an Output Limit?
 
-**Output Limit** refers to the maximum number of tokens that a large language model can generate in a single response. This limit is typically set to ensure that the model's output remains manageable and relevant to the given context.
+**Output Limit** refers to the maximum number of tokens a large language model can generate in a single response. This limit is typically set to ensure the model's output remains manageable and relevant to the context.
 
-When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could potentially continue.
+When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could continue.
 
-The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses, ensuring that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.
+The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses. It also ensures that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.
 
 ## Current foundation model limits
 
-Here is a table with the context window sizes and ouput limits for each of our [supported models](/cody/capabilities/supported-models).
+Here is a table with the context window sizes and output limits for each of our [supported models](/cody/capabilities/supported-models).
 
-|    **Model**    | **Context Window** | **Output Limit** |
-| --------------- | ------------------ | ---------------- |
-| gpt-3.5-turbo   | 16,385 tokens      | 4,096 tokens     |
-| gpt-4           | 8,192 tokens       | 4,096 tokens     |
-| gpt-4-turbo     | 128,000 tokens     | 4,096 tokens     |
-| claude instant  | 100,000 tokens     | 4,096 tokens     |
-| claude-2.0      | 100,000 tokens     | 4,096 tokens     |
-| claude-2.1      | 200,000 tokens     | 4,096 tokens     |
-| claude-3 Haiku  | 200,000 tokens     | 4,096 tokens     |
-| claude-3 Sonnet | 200,000 tokens     | 4,096 tokens     |
-| claude-3 Opus   | 200,000 tokens     | 4,096 tokens     |
-| mixtral 8x7b    | 32,000 tokens      | 4,096 tokens     |
+|    **Model**     | **Context Window** | **Output Limit** |
+| ---------------- | ------------------ | ---------------- |
+| gpt-3.5-turbo    | 16,385 tokens      | 4,096 tokens     |
+| gpt-4            | 8,192 tokens       | 4,096 tokens     |
+| gpt-4-turbo      | 128,000 tokens     | 4,096 tokens     |
+| claude instant   | 100,000 tokens     | 4,096 tokens     |
+| claude-2.0       | 100,000 tokens     | 4,096 tokens     |
+| claude-2.1       | 200,000 tokens     | 4,096 tokens     |
+| claude-3 Haiku   | 200,000 tokens     | 4,096 tokens     |
+| claude-3.5 Haiku | 200,000 tokens     | 4,096 tokens     |
+| claude-3 Sonnet  | 200,000 tokens     | 4,096 tokens     |
+| claude-3 Opus    | 200,000 tokens     | 4,096 tokens     |
+| mixtral 8x7b     | 32,000 tokens      | 4,096 tokens     |
 
-<Callout type="info">These foundation model limits are the inherent limits of the LLM models themselves. For instance, Claude 3 models have a 200K context window compared to 8,192 for GPT-4.</Callout>
+<Callout type="info">These foundation model limits are the LLM models' inherent limits. For instance, Claude 3 models have a 200K context window compared to 8,192 for GPT-4.</Callout>
 
 ## Tradeoffs: Size, Accuracy, Latency and Cost
 
-So why doesn't Cody use the full available context window for each model? There are a few tradeoffs that we need to consider, namely, context size, retrieval accuracy, latency and costs.
+So why does Cody not use each model's full available context window? We need to consider a few tradeoffs, namely, context size, retrieval accuracy, latency, and costs.
+
+### Context Size
+
+A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG-based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism.
+
+If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not increase response quality. Conversely, some queries require a vast array of documents to synthesize the best response, so increasing the context window would be beneficial. We work to balance these nuances against increased latency and cost tradeoffs for input token lengths.
+
+### Retrieval Accuracy
+
+Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases. This means it is important to put the relevant information into as few tokens as possible to avoid confusing the underlying LLM.
+
+As foundation models continue to improve, we see an increase in context retrieval, meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.
+
+### Latency
 
-1. **Context Size**: A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism.  If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not actually increase response quality.  Conversely, some queries require a vast array of documents to syntehsize the best possible response, so an increase in context window would be beneficial.  We work to balance these nuances against the latency and cost tradeoffs that come with increased input token lengths.
+With a larger context window, the model needs to process more data, which can increase the latency or response time. The end user often experiences this as "time to first token" or how long they wait until they see an output start to stream.
 
-2. **Retrieval Accuracy**: Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases.  This means that it is important to put the relevant information into as few tokens as possible, so as not to confuse the underlying LLM.  As foundation models continue to improve, we are seeing increased within context retrieval meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.
+In some cases, longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use-case and user-dependent.
 
-3. **Latency**: With a larger context window, the model needs to process more data, which can increase the latency or response time. This is often experienced by the end user as "time to first token" or how long the user waits until they see an output start to stream.  In some cases longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use case and user dependent.
+### Computational Cost
 
-4. **Computational Cost**: Finally, the costs of processing large context windows scale linearly with the context window size.  In order to provide a high quality response at a reasonable cost to the user, Cody leverages our expertise in code based RAG to drive down the generation costs while maintaining output quality.
+Finally, processing large context windows costs linearly with the context window size. Cody leverages our expertise in code-based RAG to drive down generation costs while maintaining output quality to provide a high-quality response at a reasonable cost to the user.