- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
fix some memleak #13306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix some memleak #13306
Conversation
avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function; mca_base_var_register will strdup, and not free old string; opal_patcher_base_framework not close; Signed-off-by: wjjahah <[email protected]>
opal_common_ucx open the opal_memory_base_framework but not close; Signed-off-by: wjjahah <[email protected]>
| Hi @bosilca , this PR is waiting for workflow approval. Could you please approve it when you have a moment? | 
| @bosilca Looks like you queued up the NVIDIA job last night, but it sat in queue for 10+ hours, so I tried to cancel it. But now it seems stuck. Can you check what's happening on the NVIDIA side? | 
| Sorry, not directly related to the topic of this PR: Why are there queues at Nvidia? I'm having the same problem (13261), and I think 13211 is having the same problem. | 
| @xbw22109 Sorry about this. We actually run CI at a variety of different locations -- not just on Github cloud resources (including NVIDIA). That being said, sometimes something goes wrong on these CI resources, and sometimes they need some care and feeding. That happened with NVIDIA's CI resources this past week; they fixed the issue, but we probably missed some of the pending PRs that falsely failed. I've re-queued the jobs on #13261 and #13211. Sorry for the hassle! 😦 FYI: know that you can also re-trigger CI by pushing a new commit to a PR (even if you  | 
| @jsquyres Thanks for the explanation! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to test the UCX change to convince myself it is safe, but I can't right now.
@bwbarrett , @hppritcha can you chime in regarding the patcher change. It won't impact UCX as UCX does it's own memory tracking, but it might impacts others that I'm not aware of.
| return opal_patcher->patch_fini(); | ||
| } | ||
|  | ||
| return OPAL_SUCCESS; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really vary skeptical about this change. While logically it sounds reasonable, the patcher is a very special module, it changes the way the memory allocations/deallocations are tracked, and I'm definitively not sure we can unload it. I personally would be against this change without proper testing, but I defer to others for additional insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just release component resources, and patch_list has been released in line 86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly my concern. The patcher works by interposition, aka. it interposes functions calls to sbrk or similar API. If the shared library where these functions are defined runs out of scope and is unloaded, bad things will happen (as it will call a function in a memory area that has been release and possibly reused).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn is the one that should really chime in on this. Looking at the code, all the right bits are there such that unloading the component should work. But clearly we're not testing it.
| return opal_patcher->patch_fini(); | ||
| } | ||
|  | ||
| return OPAL_SUCCESS; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn is the one that should really chime in on this. Looking at the code, all the right bits are there such that unloading the component should work. But clearly we're not testing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to avoid unnecessary delays I propose you remove the patcher change from this PR and move into another PR (by itself). We would then merge this one, and the extra discussions regarding the patcher would go into the new PR.
Signed-off-by: wjjahah <[email protected]>
| 
 Thanks for the suggestion! I've updated the code. | 
| 
 New PR is this #13322 | 
avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function;
mca_base_var_register will strdup, and not free old string;
opal_patcher_base_framework not close;