Skip to content

Conversation

@HahaBill
Copy link
Contributor

@HahaBill HahaBill commented Jul 5, 2025

Related GitHub Issue

Closes: #4526
Closes: #3666

Description

However, in case the tiktoken fails, I gracefully handle that by using the Gemini and Anthropic SDK to make a request to get token count. If that also fails, then console.warn and return 0

Test Procedure

This is a small change, it is just swapping the online counting with local counting. To see that local counting still works, you have to do write a simple prompt to see whether it still counts.

Whether tiktoken count is accurate, that would be another issue.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes (if applicable) -> not needed since tiktoken has already unit tests: https://github.com/RooCodeInc/Roo-Code/blob/main/src/utils/__tests__/tiktoken.spec.ts
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

No UI changes

Documentation Updates

No documentation updates

Additional Notes

Get in Touch


Important

Switch countTokens() in AnthropicHandler and GeminiHandler to prioritize local counting, with fallbacks to remote API and zero.

  • Behavior:
    • countTokens() in AnthropicHandler and GeminiHandler now uses local token counting first.
    • If local counting fails, falls back to remote API request for token count.
    • If remote API also fails, logs a warning and returns 0.
  • Error Handling:
    • Logs warnings for failures in both local and remote token counting attempts in countTokens() methods.

This description was created by Ellipsis for 33393fe. You can customize this summary. It will automatically update as commits are pushed.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 5, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Draft / In Progress] in Roo Code Roadmap Jul 6, 2025
@hannesrudolph hannesrudolph added PR - Draft / In Progress and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 6, 2025
@HahaBill HahaBill marked this pull request as ready for review July 7, 2025 15:10
@HahaBill HahaBill requested review from cte, jr and mrubens as code owners July 7, 2025 15:10
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Jul 7, 2025
@HahaBill
Copy link
Contributor Author

HahaBill commented Jul 7, 2025

This PR resolves Case 2 of this issue #4526. If this issue is approved #5444, then Case 1 is no longer needed.

@HahaBill
Copy link
Contributor Author

HahaBill commented Jul 7, 2025

This issue #5444 seems to be approved, that means that this Case 1 in here #4526 is outdated and no longer relevant anymore because preview versions of 2.5 pro will be replaced by gemini-2.5-pro.

@hannesrudolph hannesrudolph moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 7, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HahaBill!

LGTM

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 10, 2025
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Jul 10, 2025
@mrubens
Copy link
Collaborator

mrubens commented Jul 10, 2025

The rationale behind adding the API request for token counting was to get more accurate counts, since Roo relies heavily on the estimated counts to know when to condense context to avoid overflows (and tiktoken can be fairly inaccurate).

A couple things that would persuade me that this is a good idea:

  1. Some quantitative measurement of how much slower this extra request makes things
  2. An accompanying change to handle context window overruns more gracefully than we currently do

But in the meantime I don't know that we have enough that I'm convinced we should just remove the API-based counting without any other changes to compensate.

@daniel-lxs daniel-lxs moved this from PR [Needs Review] to PR [Changes Requested] in Roo Code Roadmap Jul 10, 2025
@HahaBill
Copy link
Contributor Author

The rationale behind adding the API request for token counting was to get more accurate counts, since Roo relies heavily on the estimated counts to know when to condense context to avoid overflows (and tiktoken can be fairly inaccurate).

A couple things that would persuade me that this is a good idea:

  1. Some quantitative measurement of how much slower this extra request makes things

  2. An accompanying change to handle context window overruns more gracefully than we currently do

But in the meantime I don't know that we have enough that I'm convinced we should just remove the API-based counting without any other changes to compensate.

I am currently reading this and your points are valid.

We could compensate it by doing:

  • incorporating usage metadata from responses to have more accurate token counting
  • introduce buffer/margin where we would be bias towards underflows for handling context window overruns
  • allow users to switch between local and API-based counting

I am also interested to see how slow this is. I will try to quantitatively measure it.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 16, 2025
@HahaBill
Copy link
Contributor Author

HahaBill commented Jul 16, 2025

@mrubens @daniel-lxs Hi! I have some updates regarding this PR and did extra implementation based on the feedback.

Introduction

From quantitative measurement, we see that this is indeed slow. However, tiktoken is not super accurate as API-based counting and in fact from my personal observations I saw that local-based counting with tiktoken was a bit higher (around 10-20%) compared to both Anthropic and Gemini, which is better than the other way around since we do not want context window overruns. We will still use API-based counting to have better token count estimation for context condensation

This is a tradeoff between speed and accuracy.

Solution

I implemented new strategy which gives us both speed and accuracy:

1. At start of the task, I do first three API-based counting to collect samples for calculating the initial safetyFactor, which would be then multiplied by local-based token count. With this, we can ensure that local-based counting is similar to API-based counting. Basically, you can view this as a regression problem: x = y * safetyFactor. From my observation, the samples collected are quite stable and consistent, what I mean by that is all the sample pairs have similar factor ratio.

2. Then after three API-based counting, we do full local-based counting with Math.ceil(localEstimate * this.tokenComparator.getSafetyFactor());

3. Then it will check if we're at 90% of effective threshold. If yes then do API-based token counting and if not then do local-based token counting with tiktoken

Models used to test the new implementation:

  • Anthropic: Claude Sonnet 4
  • Gemini: 2.5 Pro

How the safetyFactor is calculated:

const totalRatio = this.samples.reduce((sum, sample) => sum + sample.api / sample.local, 0)
const averageRatio = totalRatio / this.samples.length
this.safetyFactor = averageRatio * TokenCountComparator.ADDITIONAL_SAFETY_FACTOR
// ADDITIONAL_SAFETY_FACTOR = 1.05 -> preventing context window overruns

Code design changes:

  • put the whether to triggering context densation part into context-safety.ts : this is because in the future, we could easily use this at any part of the system and introducing more modularity
  • now every API provider can override apiBasedTokenCount function to write their own APIs and introduce more flexibility. Furthermore, we no longer to write countTokens function in API providers that need it to avoid code duplication

Conclusion

Let me know what you think :)

@daniel-lxs daniel-lxs moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 17, 2025
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Jul 18, 2025
@daniel-lxs
Copy link
Member

Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history.

I don't think it's a good idea to handle this with tiktoken. If truncation or condensing happens too late, the request can fail entirely. If it happens too early, it can lead to unnecessary context loss.

I'm open to discussing this further if you have other ideas.

@HahaBill
Copy link
Contributor Author

HahaBill commented Jul 23, 2025

Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history.

I don't think it's a good idea to handle this with tiktoken. If truncation or condensing happens too late, the request can fail entirely. If it happens too early, it can lead to unnecessary context loss.

I'm open to discussing this further if you have other ideas.

Hi @daniel-lxs! Thanks for the review and your openness to more discussion :)

I agree with you about tiktoken not being accurate and I am not proposing removing API-based token counting completely. Also mrubens's assessment is valid that handling context window overruns is the crucial part if we want to remove API-based token counting.

On top of introducing context overflow recovery, we can still use the implementation that I did of doing regression and API-based token counting both in the beginning and near the end of conversation. I can also lower the start_api_based_percentage like reduce from 90% to 85% to get more accurate results towards the end.

Handling Context Overruns with Hierachical Merging

You made me realized something and made a really good point that we don't want condensation/truncation happening too early because of context loss. Also that leads to mrubens's comment that maybe we should handle context window overruns more gracefully.

What we can do is that when we overflow, we do hierarchical merging. This would essentially solve the issue of having context loss and have better contex window overruns handling. That would also mean that we could expand our internal sliding window a bit more.

We can really go with different chunk summarization merging techniques to make it fit into the context window but we have to keep the API cost in mind. I imagine the algorithm to be like this:

  1. Getting an error from the API response because of context window overflow
  2. Get last summary, conversation history and user's last message to figure out the token counts.
  3. Based on that intelligently divide them into chunks ensuring that we won't get context window overflow errors while not doing too many API calls and going over the token-per-minute limit. We will always try to keep the user's last message intact without summarizing it. I would estimate max 2-3 additional API calls
  4. Perform Hierarchical Merging
  5. Provide condensed token count to users and call createMessage function

Conclusion

This proposal would not only allow us to reduce number of API-based token counting calls but also improve condensation/genering content (createMessage) logic and overruns error handling. Hierarchical merging is also an approach of dealing with super long context.

Overall solution is:

  • Do API-based token counting in the beginning of the task and near the end (85-90%) to ensure accurate estimates to reduce number of context overruns
  • Introduce Hierarchical Merging when we run into context window overruns. That reduces context loss and introduce more handling in the condensation and content generation parts.
  • Collecting samples from API-based token counting to do regression and correctly adjust tiktoken's output. (This might be pre-mature optimization and can be introduced later if beneficial)

If these are large changes, we could first have a feature flag and put this into experimental, once we get user's feedback and have enough evidence that this is stable then we can then migrate.

Let me know what you think.

@HahaBill
Copy link
Contributor Author

HahaBill commented Jul 24, 2025

@daniel-lxs One user also demands more context overflow handling: #6102 and listed some ideas too. They also pointed out that long context might not be possible to condense and I think that is most likely for non-Gemini models.

Maybe the topic of handling context window overruns might be better off being another GitHub issue? I think if we design and get this right, it will help this PR.

Edit: Another user facing issues with large context: #6069

@HahaBill
Copy link
Contributor Author

Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history.
I don't think it's a good idea to handle this with tiktoken. If truncation or condensing happens too late, the request can fail entirely. If it happens too early, it can lead to unnecessary context loss.
I'm open to discussing this further if you have other ideas.

Hi @daniel-lxs! Thanks for the review and your openness to more discussion :)

I agree with you about tiktoken not being accurate and I am not proposing removing API-based token counting completely. Also mrubens's assessment is valid that handling context window overruns is the crucial part if we want to remove API-based token counting.

On top of introducing context overflow recovery, we can still use the implementation that I did of doing regression and API-based token counting both in the beginning and near the end of conversation. I can also lower the start_api_based_percentage like reduce from 90% to 85% to get more accurate results towards the end.

Handling Context Overruns with Hierachical Merging

You made me realized something and made a really good point that we don't want condensation/truncation happening too early because of context loss. Also that leads to mrubens's comment that maybe we should handle context window overruns more gracefully.

What we can do is that when we overflow, we do hierarchical merging. This would essentially solve the issue of having context loss and have better contex window overruns handling. That would also mean that we could expand our internal sliding window a bit more.

We can really go with different chunk summarization merging techniques to make it fit into the context window but we have to keep the API cost in mind. I imagine the algorithm to be like this:

  1. Getting an error from the API response because of context window overflow
  2. Get last summary, conversation history and user's last message to figure out the token counts.
  3. Based on that intelligently divide them into chunks ensuring that we won't get context window overflow errors while not doing too many API calls and going over the token-per-minute limit. We will always try to keep the user's last message intact without summarizing it. I would estimate max 2-3 additional API calls
  4. Perform Hierarchical Merging
  5. Provide condensed token count to users and call createMessage function

Conclusion

This proposal would not only allow us to reduce number of API-based token counting calls but also improve condensation/genering content (createMessage) logic and overruns error handling. Hierarchical merging is also an approach of dealing with super long context.

Overall solution is:

  • Do API-based token counting in the beginning of the task and near the end (85-90%) to ensure accurate estimates to reduce number of context overruns
  • Introduce Hierarchical Merging when we run into context window overruns. That reduces context loss and introduce more handling in the condensation and content generation parts.
  • Collecting samples from API-based token counting to do regression and correctly adjust tiktoken's output. (This might be pre-mature optimization and can be introduced later if beneficial)

If these are large changes, we could first have a feature flag and put this into experimental, once we get user's feedback and have enough evidence that this is stable then we can then migrate.

Let me know what you think.

One more thing, if we would to increase the accuracy of each chunks, codebase indexing could be used for each chunks.

@daniel-lxs
Copy link
Member

@mrubens Any additional thoughts you might want to share about this? I understand the benefits but I'm not entirely sure if they outweigh the potential downsides.

@mrubens mrubens closed this Jul 31, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 31, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Jul 31, 2025
@mrubens
Copy link
Collaborator

mrubens commented Jul 31, 2025

I appreciate the exploration, but I don’t think the tradeoff is worth it at this point.

@HahaBill
Copy link
Contributor Author

HahaBill commented Jul 31, 2025

I appreciate the exploration, but I don’t think the tradeoff is worth it at this point.

Alright, got it! If I will come up with some other idea that is less complex, I will let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer PR - Needs Review size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Improving User Experience with Gemini in Roo Code Gemini/Anthropic: Stop using remote token count APIs

4 participants