Fix: Using Local Token Counting instead of making an API request #5418

HahaBill · 2025-07-05T14:14:46Z

Related GitHub Issue

Closes: #4526
Closes: #3666

Description

However, in case the tiktoken fails, I gracefully handle that by using the Gemini and Anthropic SDK to make a request to get token count. If that also fails, then console.warn and return 0

Test Procedure

This is a small change, it is just swapping the online counting with local counting. To see that local counting still works, you have to do write a simple prompt to see whether it still counts.

Whether tiktoken count is accurate, that would be another issue.

Pre-Submission Checklist

Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
Scope: My changes are focused on the linked issue (one major feature/fix per PR).
Self-Review: I have performed a thorough self-review of my code.
Testing: New and/or updated tests have been added to cover my changes (if applicable) -> not needed since tiktoken has already unit tests: https://github.com/RooCodeInc/Roo-Code/blob/main/src/utils/__tests__/tiktoken.spec.ts
Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

No UI changes

Documentation Updates

No documentation updates

Additional Notes

Get in Touch

Important

Switch countTokens() in AnthropicHandler and GeminiHandler to prioritize local counting, with fallbacks to remote API and zero.

Behavior:
- countTokens() in AnthropicHandler and GeminiHandler now uses local token counting first.
- If local counting fails, falls back to remote API request for token count.
- If remote API also fails, logs a warning and returns 0.
Error Handling:
- Logs warnings for failures in both local and remote token counting attempts in countTokens() methods.

^{This description was created by}^{for 33393fe. You can customize this summary. It will automatically update as commits are pushed.}

- gracefully handle the error in case there is an error in the `tiktoken`

HahaBill · 2025-07-07T15:11:59Z

This PR resolves Case 2 of this issue #4526. If this issue is approved #5444, then Case 1 is no longer needed.

HahaBill · 2025-07-07T20:47:05Z

This issue #5444 seems to be approved, that means that this Case 1 in here #4526 is outdated and no longer relevant anymore because preview versions of 2.5 pro will be replaced by gemini-2.5-pro.

daniel-lxs

Thank you @HahaBill!

LGTM

mrubens · 2025-07-10T17:24:49Z

The rationale behind adding the API request for token counting was to get more accurate counts, since Roo relies heavily on the estimated counts to know when to condense context to avoid overflows (and tiktoken can be fairly inaccurate).

A couple things that would persuade me that this is a good idea:

Some quantitative measurement of how much slower this extra request makes things
An accompanying change to handle context window overruns more gracefully than we currently do

But in the meantime I don't know that we have enough that I'm convinced we should just remove the API-based counting without any other changes to compensate.

HahaBill · 2025-07-10T19:39:59Z

The rationale behind adding the API request for token counting was to get more accurate counts, since Roo relies heavily on the estimated counts to know when to condense context to avoid overflows (and tiktoken can be fairly inaccurate).

A couple things that would persuade me that this is a good idea:

Some quantitative measurement of how much slower this extra request makes things

An accompanying change to handle context window overruns more gracefully than we currently do

But in the meantime I don't know that we have enough that I'm convinced we should just remove the API-based counting without any other changes to compensate.

I am currently reading this and your points are valid.

We could compensate it by doing:

incorporating usage metadata from responses to have more accurate token counting
introduce buffer/margin where we would be bias towards underflows for handling context window overruns
allow users to switch between local and API-based counting

I am also interested to see how slow this is. I will try to quantitatively measure it.

…sed counting

…ing at the beginning + do API-based counting at 90% of effective threshold + minor factor/parameters tuning

…ndensation

HahaBill · 2025-07-16T23:36:23Z

@mrubens @daniel-lxs Hi! I have some updates regarding this PR and did extra implementation based on the feedback.

Introduction

From quantitative measurement, we see that this is indeed slow. However, tiktoken is not super accurate as API-based counting and in fact from my personal observations I saw that local-based counting with tiktoken was a bit higher (around 10-20%) compared to both Anthropic and Gemini, which is better than the other way around since we do not want context window overruns. We will still use API-based counting to have better token count estimation for context condensation

This is a tradeoff between speed and accuracy.

Solution

I implemented new strategy which gives us both speed and accuracy:

1. At start of the task, I do first three API-based counting to collect samples for calculating the initial safetyFactor, which would be then multiplied by local-based token count. With this, we can ensure that local-based counting is similar to API-based counting. Basically, you can view this as a regression problem: x = y * safetyFactor. From my observation, the samples collected are quite stable and consistent, what I mean by that is all the sample pairs have similar factor ratio.

2. Then after three API-based counting, we do full local-based counting with Math.ceil(localEstimate * this.tokenComparator.getSafetyFactor());

3. Then it will check if we're at 90% of effective threshold. If yes then do API-based token counting and if not then do local-based token counting with tiktoken

Models used to test the new implementation:

Anthropic: Claude Sonnet 4
Gemini: 2.5 Pro

How the safetyFactor is calculated:

const totalRatio = this.samples.reduce((sum, sample) => sum + sample.api / sample.local, 0)
const averageRatio = totalRatio / this.samples.length
this.safetyFactor = averageRatio * TokenCountComparator.ADDITIONAL_SAFETY_FACTOR
// ADDITIONAL_SAFETY_FACTOR = 1.05 -> preventing context window overruns

Code design changes:

put the whether to triggering context densation part into context-safety.ts : this is because in the future, we could easily use this at any part of the system and introducing more modularity
now every API provider can override apiBasedTokenCount function to write their own APIs and introduce more flexibility. Furthermore, we no longer to write countTokens function in API providers that need it to avoid code duplication

Conclusion

Let me know what you think :)

…without a safetFactor

daniel-lxs · 2025-07-22T16:37:42Z

Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history.

I don't think it's a good idea to handle this with tiktoken. If truncation or condensing happens too late, the request can fail entirely. If it happens too early, it can lead to unnecessary context loss.

I'm open to discussing this further if you have other ideas.

HahaBill · 2025-07-23T11:04:38Z

Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history.

I don't think it's a good idea to handle this with tiktoken. If truncation or condensing happens too late, the request can fail entirely. If it happens too early, it can lead to unnecessary context loss.

I'm open to discussing this further if you have other ideas.

Hi @daniel-lxs! Thanks for the review and your openness to more discussion :)

I agree with you about tiktoken not being accurate and I am not proposing removing API-based token counting completely. Also mrubens's assessment is valid that handling context window overruns is the crucial part if we want to remove API-based token counting.

On top of introducing context overflow recovery, we can still use the implementation that I did of doing regression and API-based token counting both in the beginning and near the end of conversation. I can also lower the start_api_based_percentage like reduce from 90% to 85% to get more accurate results towards the end.

Handling Context Overruns with Hierachical Merging

You made me realized something and made a really good point that we don't want condensation/truncation happening too early because of context loss. Also that leads to mrubens's comment that maybe we should handle context window overruns more gracefully.

What we can do is that when we overflow, we do hierarchical merging. This would essentially solve the issue of having context loss and have better contex window overruns handling. That would also mean that we could expand our internal sliding window a bit more.

We can really go with different chunk summarization merging techniques to make it fit into the context window but we have to keep the API cost in mind. I imagine the algorithm to be like this:

Getting an error from the API response because of context window overflow
Get last summary, conversation history and user's last message to figure out the token counts.
Based on that intelligently divide them into chunks ensuring that we won't get context window overflow errors while not doing too many API calls and going over the token-per-minute limit. We will always try to keep the user's last message intact without summarizing it. I would estimate max 2-3 additional API calls
Perform Hierarchical Merging
Provide condensed token count to users and call createMessage function

Conclusion

This proposal would not only allow us to reduce number of API-based token counting calls but also improve condensation/genering content (createMessage) logic and overruns error handling. Hierarchical merging is also an approach of dealing with super long context.

Overall solution is:

Do API-based token counting in the beginning of the task and near the end (85-90%) to ensure accurate estimates to reduce number of context overruns
Introduce Hierarchical Merging when we run into context window overruns. That reduces context loss and introduce more handling in the condensation and content generation parts.
Collecting samples from API-based token counting to do regression and correctly adjust tiktoken's output. (This might be pre-mature optimization and can be introduced later if beneficial)

If these are large changes, we could first have a feature flag and put this into experimental, once we get user's feedback and have enough evidence that this is stable then we can then migrate.

Let me know what you think.

HahaBill · 2025-07-24T10:33:22Z

@daniel-lxs One user also demands more context overflow handling: #6102 and listed some ideas too. They also pointed out that long context might not be possible to condense and I think that is most likely for non-Gemini models.

Maybe the topic of handling context window overruns might be better off being another GitHub issue? I think if we design and get this right, it will help this PR.

Edit: Another user facing issues with large context: #6069

HahaBill · 2025-07-25T13:49:47Z

Hey @HahaBill, thank you for looking into this. After reviewing the code, I realized the purpose of this API request is to correctly determine when to truncate or condense the history.
I don't think it's a good idea to handle this with tiktoken. If truncation or condensing happens too late, the request can fail entirely. If it happens too early, it can lead to unnecessary context loss.
I'm open to discussing this further if you have other ideas.

Hi @daniel-lxs! Thanks for the review and your openness to more discussion :)

I agree with you about tiktoken not being accurate and I am not proposing removing API-based token counting completely. Also mrubens's assessment is valid that handling context window overruns is the crucial part if we want to remove API-based token counting.

On top of introducing context overflow recovery, we can still use the implementation that I did of doing regression and API-based token counting both in the beginning and near the end of conversation. I can also lower the start_api_based_percentage like reduce from 90% to 85% to get more accurate results towards the end.

Handling Context Overruns with Hierachical Merging

You made me realized something and made a really good point that we don't want condensation/truncation happening too early because of context loss. Also that leads to mrubens's comment that maybe we should handle context window overruns more gracefully.

What we can do is that when we overflow, we do hierarchical merging. This would essentially solve the issue of having context loss and have better contex window overruns handling. That would also mean that we could expand our internal sliding window a bit more.

We can really go with different chunk summarization merging techniques to make it fit into the context window but we have to keep the API cost in mind. I imagine the algorithm to be like this:

Getting an error from the API response because of context window overflow

Get last summary, conversation history and user's last message to figure out the token counts.

Based on that intelligently divide them into chunks ensuring that we won't get context window overflow errors while not doing too many API calls and going over the token-per-minute limit. We will always try to keep the user's last message intact without summarizing it. I would estimate max 2-3 additional API calls

Perform Hierarchical Merging

Provide condensed token count to users and call createMessage function

Conclusion

This proposal would not only allow us to reduce number of API-based token counting calls but also improve condensation/genering content (createMessage) logic and overruns error handling. Hierarchical merging is also an approach of dealing with super long context.

Overall solution is:

Do API-based token counting in the beginning of the task and near the end (85-90%) to ensure accurate estimates to reduce number of context overruns

Introduce Hierarchical Merging when we run into context window overruns. That reduces context loss and introduce more handling in the condensation and content generation parts.

Collecting samples from API-based token counting to do regression and correctly adjust tiktoken's output. (This might be pre-mature optimization and can be introduced later if beneficial)

If these are large changes, we could first have a feature flag and put this into experimental, once we get user's feedback and have enough evidence that this is stable then we can then migrate.

Let me know what you think.

One more thing, if we would to increase the accuracy of each chunks, codebase indexing could be used for each chunks.

daniel-lxs · 2025-07-25T15:30:38Z

@mrubens Any additional thoughts you might want to share about this? I understand the benefits but I'm not entirely sure if they outweigh the potential downsides.

mrubens · 2025-07-31T00:24:17Z

I appreciate the exploration, but I don’t think the tradeoff is worth it at this point.

HahaBill · 2025-07-31T22:05:06Z

I appreciate the exploration, but I don’t think the tradeoff is worth it at this point.

Alright, got it! If I will come up with some other idea that is less complex, I will let you know.

HahaBill added 2 commits July 5, 2025 14:58

enhancement: do local token-counting instead of making an API request

d6e0ab4

- gracefully handle the error in case there is an error in the `tiktoken`

Merge branch 'RooCodeInc:main' into i/improve-gemini-ux-4526

33393fe

github-project-automation bot added this to Roo Code Roadmap Jul 5, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Jul 5, 2025

github-project-automation bot added this to Roo Code Roadmap Jul 5, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Jul 5, 2025

hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 5, 2025

daniel-lxs moved this from Triage to PR [Draft / In Progress] in Roo Code Roadmap Jul 6, 2025

hannesrudolph added PR - Draft / In Progress and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 6, 2025

HahaBill marked this pull request as ready for review July 7, 2025 15:10

HahaBill requested review from cte, jr and mrubens as code owners July 7, 2025 15:10

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Jul 7, 2025

Merge branch 'RooCodeInc:main' into i/improve-gemini-ux-4526

a73b3bf

hannesrudolph moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 7, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Draft / In Progress labels Jul 7, 2025

daniel-lxs approved these changes Jul 10, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 10, 2025

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Jul 10, 2025

daniel-lxs moved this from PR [Needs Review] to PR [Changes Requested] in Roo Code Roadmap Jul 10, 2025

hannesrudolph added PR - Changes Requested and removed PR - Needs Preliminary Review labels Jul 10, 2025

feat: introducing token-counting strategy using both local and API-ba…

58f6915

…sed counting

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 16, 2025

HahaBill added 7 commits July 16, 2025 21:52

Merge branch 'RooCodeInc:main' into i/improve-gemini-ux-4526

4aa5855

fix: placing ts-ignore correctly as before in sliding window unit test

0a87061

fix: improving token counting strategy - start with 3 API-based count…

bba8f55

…ing at the beginning + do API-based counting at 90% of effective threshold + minor factor/parameters tuning

fix: removing debug logs

400fc43

fix: do Math.ceil to have only integers in the token counts

4d2ead6

fix: remove Math.ceil

f1faec6

fix: remove safetyFactor from local token estimate for the context co…

017a1fb

…ndensation

HahaBill added 2 commits July 17, 2025 20:11

fix: adding more comments and multiply localEstimate by safetyFactor

7963fcc

fix: overrice countTokens in tests to use the default token counting …

ed8b6da

…without a safetFactor

daniel-lxs moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 17, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Changes Requested labels Jul 17, 2025

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Jul 18, 2025

hannesrudolph added PR - Needs Review and removed PR - Needs Preliminary Review labels Jul 18, 2025

Merge branch 'RooCodeInc:main' into i/improve-gemini-ux-4526

b219744

mrubens closed this Jul 31, 2025

github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 31, 2025

github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Jul 31, 2025

Fix: Using Local Token Counting instead of making an API request #5418

Fix: Using Local Token Counting instead of making an API request #5418

Uh oh!

Conversation

HahaBill commented Jul 5, 2025 • edited by MuriloFP Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related GitHub Issue

Description

Test Procedure

Pre-Submission Checklist

Screenshots / Videos

Documentation Updates

Additional Notes

Get in Touch

Uh oh!

HahaBill commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HahaBill commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-lxs left a comment

Choose a reason for hiding this comment

Uh oh!

mrubens commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HahaBill commented Jul 10, 2025

Uh oh!

HahaBill commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Solution

Conclusion

Uh oh!

daniel-lxs commented Jul 22, 2025

Uh oh!

HahaBill commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Handling Context Overruns with Hierachical Merging

Conclusion

Uh oh!

HahaBill commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HahaBill commented Jul 25, 2025

Handling Context Overruns with Hierachical Merging

Conclusion

Uh oh!

daniel-lxs commented Jul 25, 2025

Uh oh!

mrubens commented Jul 31, 2025

Uh oh!

HahaBill commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HahaBill commented Jul 5, 2025 •

edited by MuriloFP

Loading

HahaBill commented Jul 7, 2025 •

edited

Loading

HahaBill commented Jul 7, 2025 •

edited

Loading

mrubens commented Jul 10, 2025 •

edited

Loading

HahaBill commented Jul 16, 2025 •

edited

Loading

HahaBill commented Jul 23, 2025 •

edited

Loading

HahaBill commented Jul 24, 2025 •

edited

Loading

HahaBill commented Jul 31, 2025 •

edited

Loading