Fix segfault race condition at process exit #5477

davisp · 2025-03-13T19:51:33Z

I discovered a couple race conditions leading to segfaults in a separate repository. After debugging I managed to narrow this down to issues around the lifetimes of our static global state instances. As these instances are all destructed as the main thread exits, anything in separate threads can end up attempting to reuse these instances after they have been destructed.

The test included in this PR doesn't actually use threads due to flakiness concerns. Instead we can observe the same phenomenon by attempting to store a static std::optional<Context> instance which ends up having its destructor run after the destructors for GlobalState and the root Logger.

This change updates the GlobalState to be a static std::shared_ptr<GlobalState> and the Logger to be a static Logger*. In the case of GlobalState we then just rely on shared_ptr reference counting to ensure that the instance is alive for as long as a Context needs it. The Logger* on the other hand is just never delete'd so that any log messages generated are included rather than silencing any thread after the main thread exits.

Resolves CORE-40

TYPE: BUG
DESC: Fix segfault race condition at process exit

I discovered a couple race conditions leading to segfaults in a separate repository. After debugging I managed to narrow this down to issues around the lifetimes of our static global state instances. As these instances are all destructed as the main thread exits, anything in separate threads can end up attempting to reuse these instances after they have been destructed. The test included in this PR doesn't actually use threads due to flakiness concerns. Instead we can observe the same phenomenon by attempting to store a `static std::optional<Context>` instance which ends up having its destructor run after the destructors for GlobalState and the root Logger. This change updates the GlobalState to be a `static std::shared_ptr<GlobalState>` and the `Logger` to be a `static Logger*`. In the case of `GlobalState` we then just rely on shared_ptr reference counting to ensure that the instance is alive for as long as a `Context` needs it. The `Logger*` on the other hand is just never delete'd so that any log messages generated are included rather than silencing any thread after the main thread exits.

rroelke · 2025-03-13T20:13:28Z

tiledb/common/logger.cc

+  // Note: We are *not* deallocating this root `Logger` instance so that
+  // during process exit, threads other than main will not trigger segfault's
+  // if they attempt to log after the main thread has exited.
+  static Logger* l = tiledb_new<Logger>(


Is the heap profiler ever enabled by the time you get here? It looks like it is enabled only via C API, not even via environmental option on startup

I suppose if that changes then you will pick it up here which will be nice - unless it expects memory usage to be zero by the time it exits for some reason

To my knowledge the heap profiling is a compile time option so “yes” I guess? Assuming that code still works.

I would have expected it to be something that has to last for the life of the process but there is a C API tiledb_heap_profiler_enable which looks like it connects to the same thing used here

Quick skim suggests that the reporter logic is always available, but if its not enabled at compile time it’ll just not have anything to report.

See:

TileDB/tiledb/common/dynamic_memory/dynamic_memory.h

Line 91 in f4476d5

#if defined(TILEDB_MEMTRACE)

What I'm looking at is:

In tiledb/common/heap_memory.h:

template <typename T, typename... Args> T* tiledb_new(const std::string& label, Args&&... args) { if (!heap_profiler.enabled()) { return new T(std::forward<Args>(args)...); } ... }

In tiledb/common/heap_profiler.h:

extern HeapProfiler heap_profiler; class HeapProfiler { inline bool enabled() const { // We know that this instance has been initalized // if `reserved_memory_` has been set. return reserved_memory_ != nullptr; } };

And in tiledb/common/heap_profiler.cc we have HeapProfiler::HeapProfiler initializing its reserved_memory_ to null. The enable function I mentioned above initializes this

Ah. Looks like that’s a missed optimization issue. The make_shared thing was updated specifically to be compile time switch. You can see the constexpr part here:

TileDB/tiledb/common/dynamic_memory/dynamic_memory.h

Line 213 in f4476d5

if constexpr (detail::global_tracing<void>::enabled::value) {

bekadavis9 · 2025-03-13T20:40:17Z

Haven't given the code a full look-over yet, but on first pass I'm hesitant on the implementation. I'd still like to see StorageManager completely eliminated, and I fear this is a bandage over a larger issue. Note the (draft) job tracking work (#5284) which we should still consider prioritizing.

davisp · 2025-03-13T21:17:39Z

@bekadavis9 I’m right there with you. This was the least worst thing I could think of on first pass while also being much more concerned about just debugging the CI issues discussed in the story.

The GlobalState stuff for instance is mostly for cloud to ask query processing to stop which I’m not entirely sure even works, but if the new cloud stuff doesn’t need it any more (double crossed fingers on that one) maybe the fix there is to just delete the global state.

For the logger side, I still think that should be application provided. It’s still weird to me that we even attempt to write logs to disk instead of just exposing an API but it is what it is for the moment.

However, using the library as is, within our guarantees can segfault and I put “don’t segfault” at a higher priority than “eww this touches code I want to delete”.

rroelke · 2025-05-13T21:35:37Z

We should revisit this. We're seeing it reasonably often in tiledb-tables CI. Even if we will remove the StorageManager component eventually I don't have the impression that landing this change will increase the complexity of doing so - but having these additional tests will give us another axis to be confident that it is done correctly.

ypatia · 2025-05-23T09:51:43Z

We should get some traction here. Fixing this issue somehow is necessary to avoid the segfaults we observe from time to time.
I am a bit concerned honestly about the choice to not free the logger, isn't that leaking memory? Is this acceptable for our library?

davisp · 2025-05-28T16:48:14Z

@ypatia I believe the technical description would be "sorta kinda". First off, I would say that its not technically a leak due to the fact that the pointer is always valid. If we can still access it, its not technically leaked. The comment could probably be updated to something that states the effect more clearly: "Allocate the logger on the heap so that it's destructor is not run at process exit. This prevents segfaults when threads attempt to log after the main thread has exited."

In terms of "could this lead to memory exhaustion in some pathological case?", the only thing I could come up with as even an outside possibility would be if someone were looping dlopen/dlclose $TOTAL_RAM / sizeof(Logger) times. Then theoretically that could leak enough logger instances. However, I don't know enough about how static globals interact in a dlopen scenario to know exactly what would happen there. I could see a possibility for both leaking and not leaking depending on whether those things are re-run on the second and beyond calls to dlopen or whether they're marked as already having run somehow. I'd also add that if it did leak in that case, I'd be vaguely surprised that this were the only thing leaked since we've not designed libtiledb.so to be dlopen'ed in the first place.

ypatia

Thanks for the clarifications!

…ss-exit-segfaults

rroelke reviewed Mar 13, 2025

View reviewed changes

ypatia approved these changes May 29, 2025

View reviewed changes

ypatia assigned rroelke Jun 3, 2025

rroelke added 2 commits June 3, 2025 11:41

Merge remote-tracking branch 'origin/main' into pd/sc-64412/fix-proce…

8db58f5

…ss-exit-segfaults

Fix unit_static_context build by removing unused includes

ad9d1c7

rroelke requested a review from a team as a code owner June 3, 2025 17:14

rroelke merged commit e6dfe4f into main Jun 3, 2025
56 checks passed

rroelke deleted the pd/sc-64412/fix-process-exit-segfaults branch June 3, 2025 20:37

BrewTestBot mentioned this pull request Sep 22, 2025

tiledb 2.29.0 Homebrew/homebrew-core#245285

Merged

Fix segfault race condition at process exit #5477

Fix segfault race condition at process exit #5477

Uh oh!

Conversation

davisp commented Mar 13, 2025 • edited by rroelke Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rroelke Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

davisp Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

rroelke Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

davisp Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

rroelke Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

davisp Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

bekadavis9 commented Mar 13, 2025

Uh oh!

davisp commented Mar 13, 2025

Uh oh!

rroelke commented May 13, 2025

Uh oh!

ypatia commented May 23, 2025

Uh oh!

davisp commented May 28, 2025

Uh oh!

ypatia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davisp commented Mar 13, 2025 •

edited by rroelke

Loading