-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix: Using Local Token Counting instead of making an API request #5418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- gracefully handle the error in case there is an error in the `tiktoken`
daniel-lxs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @HahaBill!
LGTM
|
The rationale behind adding the API request for token counting was to get more accurate counts, since Roo relies heavily on the estimated counts to know when to condense context to avoid overflows (and tiktoken can be fairly inaccurate). A couple things that would persuade me that this is a good idea:
But in the meantime I don't know that we have enough that I'm convinced we should just remove the API-based counting without any other changes to compensate. |
I am currently reading this and your points are valid. We could compensate it by doing:
I am also interested to see how slow this is. I will try to quantitatively measure it. |
…ing at the beginning + do API-based counting at 90% of effective threshold + minor factor/parameters tuning
|
@mrubens @daniel-lxs Hi! I have some updates regarding this PR and did extra implementation based on the feedback. IntroductionFrom quantitative measurement, we see that this is indeed slow. However, This is a tradeoff between speed and accuracy. SolutionI implemented new strategy which gives us both speed and accuracy: 1. At start of the task, I do first three API-based counting to collect samples for calculating the initial 2. Then after three API-based counting, we do full local-based counting with 3. Then it will check if we're at 90% of effective threshold. If yes then do API-based token counting and if not then do local-based token counting with Models used to test the new implementation:
How the const totalRatio = this.samples.reduce((sum, sample) => sum + sample.api / sample.local, 0)
const averageRatio = totalRatio / this.samples.length
this.safetyFactor = averageRatio * TokenCountComparator.ADDITIONAL_SAFETY_FACTOR
// ADDITIONAL_SAFETY_FACTOR = 1.05 -> preventing context window overrunsCode design changes:
ConclusionLet me know what you think :) |
|
Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history. I don't think it's a good idea to handle this with I'm open to discussing this further if you have other ideas. |
Hi @daniel-lxs! Thanks for the review and your openness to more discussion :) I agree with you about On top of introducing context overflow recovery, we can still use the implementation that I did of doing regression and API-based token counting both in the beginning and near the end of conversation. I can also lower the start_api_based_percentage like reduce from 90% to 85% to get more accurate results towards the end. Handling Context Overruns with Hierachical MergingYou made me realized something and made a really good point that we don't want condensation/truncation happening too early because of context loss. Also that leads to mrubens's comment that maybe we should handle context window overruns more gracefully. What we can do is that when we overflow, we do hierarchical merging. This would essentially solve the issue of having context loss and have better contex window overruns handling. That would also mean that we could expand our internal sliding window a bit more. We can really go with different chunk summarization merging techniques to make it fit into the context window but we have to keep the API cost in mind. I imagine the algorithm to be like this:
ConclusionThis proposal would not only allow us to reduce number of API-based token counting calls but also improve condensation/genering content (createMessage) logic and overruns error handling. Hierarchical merging is also an approach of dealing with super long context. Overall solution is:
If these are large changes, we could first have a feature flag and put this into experimental, once we get user's feedback and have enough evidence that this is stable then we can then migrate. Let me know what you think. |
|
@daniel-lxs One user also demands more context overflow handling: #6102 and listed some ideas too. They also pointed out that long context might not be possible to condense and I think that is most likely for non-Gemini models. Maybe the topic of handling context window overruns might be better off being another GitHub issue? I think if we design and get this right, it will help this PR. Edit: Another user facing issues with large context: #6069 |
One more thing, if we would to increase the accuracy of each chunks, codebase indexing could be used for each chunks. |
|
@mrubens Any additional thoughts you might want to share about this? I understand the benefits but I'm not entirely sure if they outweigh the potential downsides. |
|
I appreciate the exploration, but I don’t think the tradeoff is worth it at this point. |
Alright, got it! If I will come up with some other idea that is less complex, I will let you know. |
Related GitHub Issue
Closes: #4526
Closes: #3666
Description
However, in case the tiktoken fails, I gracefully handle that by using the Gemini and Anthropic SDK to make a request to get token count. If that also fails, then
console.warnand return 0Test Procedure
This is a small change, it is just swapping the online counting with local counting. To see that local counting still works, you have to do write a simple prompt to see whether it still counts.
Whether tiktoken count is accurate, that would be another issue.
Pre-Submission Checklist
Screenshots / Videos
No UI changes
Documentation Updates
No documentation updates
Additional Notes
Get in Touch
Important
Switch
countTokens()inAnthropicHandlerandGeminiHandlerto prioritize local counting, with fallbacks to remote API and zero.countTokens()inAnthropicHandlerandGeminiHandlernow uses local token counting first.countTokens()methods.This description was created by
for 33393fe. You can customize this summary. It will automatically update as commits are pushed.