-
Notifications
You must be signed in to change notification settings - Fork 2.6k
fix: adjust token clamping threshold from 20% to 80% for GLM-4.5 compatibility #6807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…atibility The previous 20% clamping threshold was too restrictive for models like GLM-4.5 that have legitimate high output token requirements (98,304 tokens out of 131,072 context window = 75%). This change only applies clamping when maxTokens exceeds 80% of the context window, preventing models from using the entire context for output while still allowing models with high output requirements to function properly. Fixes #6806
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing my own code because apparently I trust no one, not even myself.
| // Only apply clamping if maxTokens is more than 80% of context window | ||
| if (model.maxTokens > model.contextWindow * 0.8) { | ||
| // Clamp to 80% to leave room for input | ||
| return Math.floor(model.contextWindow * 0.8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using Math.floor() here intentional? With a context window of 131,072, this gives 104,857 tokens instead of 104,858. While minor, would Math.ceil() or Math.round() better maximize available tokens for edge cases?
| if (model.maxTokens) { | ||
| return Math.min(model.maxTokens, model.contextWindow * 0.2) | ||
| // Only apply clamping if maxTokens is more than 80% of context window | ||
| if (model.maxTokens > model.contextWindow * 0.8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting this magic number to a named constant like MAX_OUTPUT_TOKEN_RATIO = 0.8 for better maintainability. This would make it easier to adjust in the future and clearer about the intent.
| }) | ||
| expect(result).toBe(20_000) // Should use model.maxTokens since it's exactly at 20% | ||
| expect(result).toBe(80_000) // Should use model.maxTokens since it's at 80% | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to add a specific test case for the GLM-4.5 scenario that triggered this issue? Something like:
| }) | |
| test("should handle GLM-4.5 model with 98,304 tokens out of 131,072 context window", () => { | |
| const model: ModelInfo = { | |
| contextWindow: 131_072, | |
| supportsPromptCache: false, | |
| maxTokens: 98_304, // 75% of context window | |
| } | |
| const settings: ProviderSettings = { | |
| apiProvider: "openrouter", | |
| } | |
| const result = getModelMaxOutputTokens({ | |
| modelId: "z.al/glm-4.5", | |
| model, | |
| settings, | |
| format: "openrouter", | |
| }) | |
| expect(result).toBe(98_304) // Should use model.maxTokens since 75% < 80% | |
| }) |
|
|
||
| // If model has explicit maxTokens, clamp it to 20% of the context window | ||
| // If model has explicit maxTokens, only clamp it if it exceeds 80% of the context window | ||
| // This prevents models from using the entire context for output while still allowing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is good, but could we be more explicit about why 80% was chosen? Perhaps mention that this leaves approximately 20% for input tokens and system prompts, which is typically sufficient for most use cases?
This PR fixes issue #6806 where GLM-4.5 models on OpenRouter were failing with token limit errors after upgrading to v3.25.8.
Problem
The commit c52fdc4 introduced a 20% clamping threshold for model max tokens relative to the context window. This was too restrictive for models like GLM-4.5 that legitimately require high output token counts (98,304 tokens out of 131,072 context window = 75%).
Solution
Adjusted the clamping threshold from 20% to 80% of the context window. This:
Changes
getModelMaxOutputTokensfunction insrc/shared/api.tsto use 80% thresholdTesting
src/shared/__tests__/api.spec.ts- 21 tests passingsrc/api/providers/__tests__/openrouter.spec.ts- 12 tests passingsrc/api/transform/__tests__/model-params.spec.ts- 45 tests passingFixes #6806
Important
Adjusts token clamping threshold from 20% to 80% for GLM-4.5 compatibility, updating
getModelMaxOutputTokensand related tests.getModelMaxOutputTokensinapi.ts.maxTokens> 80% of context window.api.spec.ts,openrouter.spec.ts, andmodel-params.spec.tsto reflect new 80% threshold.This description was created by
for 1fb46fc. You can customize this summary. It will automatically update as commits are pushed.