Skip to content

Fix memory leaks in Clock#1265

Closed
leekillough wants to merge 2 commits intosstsimulator:develfrom
leekillough:valgrind_errors_part2
Closed

Fix memory leaks in Clock#1265
leekillough wants to merge 2 commits intosstsimulator:develfrom
leekillough:valgrind_errors_part2

Conversation

@leekillough
Copy link
Contributor

This is another Valgrind-related memory leak fix, but because it is not as straightforward, it deserves a separate PR and more careful review.

There is a "definite leak" reported by Valgrind in Clock::staticHandlerMap, which is a vector of pointers to handlers, when running the coreTest_Checkpoint test.

To fix this leak, and to simplify the code, staticHandlerMap was changed from a vector of raw pointers to a vector of std::unique_ptr, so that when the class is destroyed and its member staticHandlerMap is destroyed, all of its pointers are deleted automatically.

However, in Clock::execute(), the handler is executed, and if successful, the handler member in staticHandlerMap is erased. Previously it erased the raw pointer member without deleting it, while now we have to release() it from std::unique_ptr before the mapping entry gets erased, because the handler apparently uses the raw handler pointer elsewhere, passing it around and maybe adding it back to staticHandlerMap later. If we do not release() it and simply erase the std::unique_ptr entry, it will get deleted, which causes it not to work correctly (the previous code did not delete the raw pointer either).

On the surface, this change looks like it does not change the behavior, because it uses std::unique_ptr instead of raw pointers, and it has the same behavior as before, adding allocated handler pointers to staticHandlerMap and erasing them without deleting them when they are successfully executed. When the Clock is destroyed, any remaining members of staticHandlerMap are deleted and erased. So this change superficially looks like it uses smart pointers for better, simpler code without changing its behavior.

But I have seen "definite leak" Valgrind errors fixed by this change in coreTest_Checkpoint, so it has something to do with checkpoint/restart. I am not sure this solution is right. I wish that SST used smart pointers in more places, and that ownership semantics were better defined in APIs. There is this comment in the file, which indicates that staticHandlerMap is not serialized, which may have something to do with the leak in the coreTest_Checkpoint test:

void
Clock::serialize_order(SST::Core::Serialization::serializer& ser)
{
    Action::serialize_order(ser);

    // Won't serialize the handlers; they'll be re-registered at
    // restart
    ser& currentCycle;
    ser& period;
    ser& next;
    ser& scheduled;
}

"Won't serialize the handlers; they'll be re-registered at restart". Somehow turning the handler pointers into smart std::unique_ptrs is able to get rid of one of the leaks in coreTest_Checkpoint.

I suspect that when the checkpointing test does a serialization and Checkpointing, it discards the old value of staticHandlerMap, and that by making the pointers std::unique_ptr, they get deleted when the staticHandlerMap is erased/destroyed/whatever when the checkpoint/restart sequence happens, whereas before this change, the raw pointers to allocated memory simply got thrown away and caused a memory leak.

@github-actions github-actions bot added AT: WIP Mark PR as a Work in Progress (No Autotesting Performed) AT: CLANG-FORMAT FAIL labels Mar 17, 2025
@github-actions
Copy link

CLANG-FORMAT TEST - FAILED (on last commit):
Run > ./scripts/clang-format-test.sh using clang-format v12 to check formatting

@github-actions github-actions bot added AT: CMAKE-FORMAT PASS and removed AT: WIP Mark PR as a Work in Progress (No Autotesting Performed) labels Mar 17, 2025
@github-actions
Copy link

CMAKE-FORMAT TEST - PASSED

@github-actions github-actions bot added AT: WIP Mark PR as a Work in Progress (No Autotesting Performed) AT: CLANG-FORMAT PASS and removed AT: WIP Mark PR as a Work in Progress (No Autotesting Performed) AT: CLANG-FORMAT FAIL labels Mar 17, 2025
@github-actions
Copy link

CLANG-FORMAT TEST - PASSED

@github-actions github-actions bot added AT: WIP Mark PR as a Work in Progress (No Autotesting Performed) and removed AT: WIP Mark PR as a Work in Progress (No Autotesting Performed) labels Mar 17, 2025
@github-actions
Copy link

CMAKE-FORMAT TEST - PASSED

for ( ; iter != staticHandlerMap.end(); iter++ ) {
if ( *iter == handler ) {
if ( iter->get() == handler ) {
staticHandlerMap.erase(iter);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed: This is one place where there was a memory leak: If you call unregisterHandler(), the old code did not delete the handler, but the new code does. Here we call erase() on an element which contains std::unique_ptr, causing it to delete the handler. This fixes a "definite" memory leak reported by Valgrind.

I suppose to minimize code changes we could simply add delete handler; before the erase() in the old code, and then most if not all of the other changes would be unnecessary, but the new code using std::unique_ptr is much cleaner and avoids having to manually delete everything in the destructor. One thing I always try to do, is: If there is a delete in a destructor, see if changing to smart pointers will make it get deleted automatically and avoid the need to do it in the destructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't delete the handler because the clock object doesn't own it. I'm work on another PR that already changes how clock handlers are deleted, taking into account who owns them. Once that merges, we can revisit possible memory leaks. Closing this PR for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm definitely seeing memory leaks in Clock() as I pour through Valgrind reports, so you and I are both onto something about Clock ownership.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, before the checkpointing code went in, there was probably no way to avoid memory leaks because there was no way to automatically delete clock handers that were not currently registered with a Clock object. And, CoreTest_Checkpoint definitely deregisters clocks, so I suspect the deregistered clock handlers are the ones that are leaking.

With checkpointing we had to track clock handlers in BaseComponent in order to serialize everything properly. Now, BaseComponent just owns all the Clock::Handlers registered with it and "loans" them out to the Clock objects. So, BaseComponent can delete them as the simulation finishes. While working on watchpoints, which touch clock handlers, I noticed that we hadn't changed how the handlers were deleted. I already have that implemented and haven't run into any errors with the change.

@sst-autotester
Copy link
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

@feldergast feldergast closed this Mar 17, 2025
@leekillough
Copy link
Contributor Author

If Clock does not own the handlers, then you should consider using std::weak_ptr for the pointers to the handlers in staticHandlerMap, and std::shared_ptr in the "true owner" of the handlers, passing a const std::shared_ptr& to Clock's methods which require a handler, so that Clock temporarily creates a std::shared_ptr from its std::weak_ptr when it needs access to the handler, but Clock never owns or deletes the handler.

std::vector<std::weak_ptr<Clock::HandlerBase>> staticHandlerMap;

bool
Clock::registerHandler(const std::shared_ptr<Clock::HandlerBase>& handler)
{
  staticHandlerMap.push_back(handler);   // Pushes a std::weak_ptr created from a std::shared_ptr
...
}

void
Clock::execute()
{
...

std::shared_ptr<Clock::HandlerBase> handler(*sop_iter);  // Create a std::shared_ptr "borrowing" from std::weak_ptr in staticHandlerMap
 if ( (*handler)(currentCycle) )
   sop_iter = staticHandlerMap.erase(sop_iter);   // Erase the std::weak_ptr
else
  ++sop_iter;

@leekillough leekillough deleted the valgrind_errors_part2 branch March 31, 2025 04:05
@berquist berquist added this to the SST V15.0.0 milestone Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants