-
Notifications
You must be signed in to change notification settings - Fork 88
Fix errors with metric accumulation #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Samuel Monson <[email protected]>
Signed-off-by: Samuel Monson <[email protected]>
Signed-off-by: Samuel Monson <[email protected]>
f211e30
to
1ea30d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes two metric calculation issues in the benchmark statistics system: a double-counting bug in concurrency calculations when events are merged due to epsilon tolerance, and incorrect token-per-second calculations that excluded the first decode token.
- Fixed concurrency metric accumulation logic to prevent double-counting when events are merged
- Corrected token-per-second calculations to include the first decode token by adding 1 to prompt token counts
- Added comprehensive regression and edge case tests for the concurrency calculation fixes
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
src/guidellm/objects/statistics.py | Restructured concurrency event processing to fix double-counting in merged events |
src/guidellm/benchmark/benchmark.py | Added +1 to prompt tokens to include first decode token in total token calculations |
tests/unit/objects/test_statistics.py | Added regression tests for concurrency double-counting and epsilon edge cases |
Signed-off-by: Samuel Monson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor update, otherwise looks good
<!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> Fixes a issue in metric calculation that caused incorrect statistics at extreme changes in concurrency and an issue where the first decode token was not counted in total tokens per second. <!-- Provide a detailed list of all changes introduced in this pull request. --> - [x] Fixed issue where merged concurrency change events would double-count concurrency - [x] Ensure first decode token is counted when calculating total tokens per second <!-- List the steps needed to test this PR. --> - Run unit tests: `tox -e test-unit -- -m "regression and sanity"` --- - [x] "I certify that all code in this PR is my own, except as noted below." - [x] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`) --------- Signed-off-by: Samuel Monson <[email protected]>
<!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> Fixes a issue in metric calculation that caused incorrect statistics at extreme changes in concurrency and an issue where the first decode token was not counted in total tokens per second. <!-- Provide a detailed list of all changes introduced in this pull request. --> - [x] Fixed issue where merged concurrency change events would double-count concurrency - [x] Ensure first decode token is counted when calculating total tokens per second <!-- List the steps needed to test this PR. --> - Run unit tests: `tox -e test-unit -- -m "regression and sanity"` --- - [x] "I certify that all code in this PR is my own, except as noted below." - [x] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`) --------- Signed-off-by: Samuel Monson <[email protected]>
<!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> Fixes a issue in metric calculation that caused incorrect statistics at extreme changes in concurrency and an issue where the first decode token was not counted in total tokens per second. <!-- Provide a detailed list of all changes introduced in this pull request. --> - [x] Fixed issue where merged concurrency change events would double-count concurrency - [x] Ensure first decode token is counted when calculating total tokens per second <!-- List the steps needed to test this PR. --> - Run unit tests: `tox -e test-unit -- -m "regression and sanity"` --- - [x] "I certify that all code in this PR is my own, except as noted below." - [x] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`) --------- Signed-off-by: Samuel Monson <[email protected]>
Summary
Fixes a issue in metric calculation that caused incorrect statistics at extreme changes in concurrency and an issue where the first decode token was not counted in total tokens per second.
Details
Test Plan
tox -e test-unit -- -m "regression and sanity"
Use of AI
## WRITTEN BY AI ##
)