-
Notifications
You must be signed in to change notification settings - Fork 0
Fix VRAM and RAM cache usage statistics showing incorrect values #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: lstein-master
Are you sure you want to change the base?
Conversation
|
@copilot The statistics look better but they are still somewhat confusing to me. For example, the and Is there an explanation for this, or is it a bug? |
You're absolutely right - this was a bug in my initial fix. The issue was that I was using I've fixed this in commit 1a96765 by switching to This should fix the |
|
@copilot The other anomaly I'm seeing is that the last line of the RAM cache stats is reading: The RAM cache should not be 0G. It is calculated dynamically at startup time and on my system the startup message reads So I expect to see 9.19/12.78G here. The RAM cache used to be hard-coded into the config file, and I think that when the cache was made dynamic its value was not propagated properly to the logging data structure. Please see if you can fix. |
... You're correct - the RAM cache size wasn't being propagated to the stats. The issue was in the Fixed in commit 2d7cf97 by populating |
You're right - using only I've implemented a better solution in commit 034be44: track both
This should eliminate the intermittent 0G readings for GPU nodes like |
|
I have reverted back to the simple way of calculating VRAM usage, and changed the captions to indicate that it is "VRAM Changed" not absolute VRAM used (which is impractical to calculate given caching). |
lstein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
Co-authored-by: lstein <[email protected]>
ed768f0 to
d43b056
Compare
Summary
VRAM statistics incorrectly reported high VRAM usage for non-GPU nodes on second and subsequent runs. Nodes like
string,integer, andcore_metadatashowed 9.920G instead of 0.000G. Additionally, GPU nodes that reused PyTorch's cached memory incorrectly showed 0 VRAM usage intermittently. Finally, the RAM cache size was displaying as 0.00G instead of the calculated dynamic cache size.Root causes:
torch.cuda.max_memory_allocated()showed carryover from previous nodesmemory_allocated()showed 0 when nodes reused cached memorymemory_reserved()showed 0 when reserved memory didn't increaseCacheStats.cache_sizefield was never populated with the actual cache sizeFinal Solution:
Track both
memory_allocated()andmemory_reserved()deltas and use the maximum to handle all scenarios:Populate RAM cache size when stats object is assigned:
This dual-tracking approach ensures accurate statistics for:
allocated_deltacaptures usage even when reserved memory doesn't increasereserved_deltacaptures usageThe overall "VRAM in use" summary continues to use
memory_allocated()to show actively-used memory (not cached memory).Related Issues / Discussions
QA Instructions
Run multiple generations sequentially and examine server log output. Verify:
Merge Plan
N/A - Minimal change, no special merge considerations.
Checklist
What's Newcopy (if doing a release after this PR)Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.