-
Notifications
You must be signed in to change notification settings - Fork 844
Description
Problem
ASF CI runs are showing segmentation faults during regression test execution.
Here's a recent fedora CI job:
https://ci.trafficserver.apache.org/view/Github/job/Github_Builds/job/fedora/7240/console
[Dec 31 17:54:47.229] traffic_crashlo NOTE: crashlog started, target=5436, debug=false syslog=true, uid=1200 euid=1200
[Dec 31 17:54:47.315] traffic_crashlo WARNING: received 0 of 128 expected signal info bytes
[Dec 31 17:54:47.315] traffic_crashlo WARNING: received 0 of 968 expected thread context bytes
+ popd
[Dec 31 17:54:47.315] traffic_crashlo NOTE: logging to 0x1004880
[Dec 31 17:54:47.315] traffic_crashlo NOTE: readlink failed with No such file or directory
~/workspace/Github_Builds/fedora/src
[Dec 31 17:54:47.315] traffic_crashlo ERROR: wrote crash log to /tmp/ats/var/log/trafficserver/crash-2025-12-31-175447.log
Note that because it is a shutdown crash, the test does not fail. The regression tests all passed, thus the CI job passes. But nevertheless, the process typically crashes on shutdown.
Investigation
Using diagnostic logging, it was identified that the crash is caused by a use-after-free issue in the DbgCtl objects. Despite careful design to handle static initialization/destruction order issues, the reference counting mechanism seems to not prevent a use after free of the std::map.
The rest of the below is from Cursor. I include it in case it is helpful. I'm not entirely convinced its root cause concerning one thread deleting the tag registry impacting the other threads is correct or not. It is clear, however, that simply not freeing the std::map registry does make the test stable.
Root Cause
The Reference Counting Issue
The DbgCtl class uses a shared registry with reference counting:
- Each
DbgCtlobject incrementsregistry_reference_counton construction. - Each
DbgCtldestructor calls_rm_reference(), which decrements the count. - When
ref_countreaches 0, the registry (including thestd::mapof all tags) is deleted.
The fatal flaw: The reference count tracks how many DbgCtl objects currently exist, but it cannot account for whether those objects will be accessed before they're destroyed.
How the Crash Occurs
-
Multiple Threads/Contexts: The application has DbgCtl objects across multiple threads. Some are static/global, some are thread-local, and some may be in function scopes.
-
Thread Exit During Execution: When a thread exits mid-execution (not just at program exit), all its DbgCtl objects destruct:
Thread A exits → its DbgCtl objects destruct → _rm_reference() called repeatedly ref_count: 17 → 16 → 15 → ... → 1 → 0 -
Registry Deleted Prematurely: When the last DbgCtl from Thread A destructs,
ref_counthits 0, and the registry (with itsstd::mapcontaining all 229 tag entries) is deleted. -
Dangling Pointers in Other Threads: Thread B is still running and has DbgCtl objects with
_ptrpointing into the now-deleted registry:
bool DbgCtl::on() const {
// ...
if (!_ptr->second) { // ← Accessing freed memory!
return false;
}
// ...
}- Crash: Thread B calls
.on()or.tag()on its DbgCtl → accesses freed memory → segmentation fault.
Evidence from Diagnostics
From the diagnostic logs during crash reproduction:
Registry 1 (main traffic_server process):
- Created with ref_count starting at 0
- Grew to ref_count = 351 (351 DbgCtl objects created)
- Contains 229 unique tags in the registry map
During test execution (NOT at program exit):
ref_count: 17 → 16 → ... → 2 → 1 → 0
DEBUG: ~Registry() - deleting registry at 0xc6b930 with 229 entries
DEBUG: ~Registry() - finished, registry deleted
[Tests continue running...]
[CRASH - traffic_crashlog invoked]
The crash happens during active test execution, not during static destruction at program exit. This proves that:
- Some threads/scopes had their DbgCtl objects destruct
- The registry was deleted when their ref_count contributions were removed
- Other threads were still running with dangling
_ptrpointers
Why Reference Counting Cannot Solve This
The reference counting is working exactly as designed, but the design cannot handle this scenario:
- What it tracks: Number of currently existing DbgCtl objects
- What it cannot track: Whether those objects will be accessed before destruction
- The gap: When Thread A's DbgCtls destruct and ref_count hits 0, Thread B's DbgCtls still exist but their
_ptrmembers now point to freed memory
No amount of careful reference counting can solve this because:
- C++ provides no way to know if a pointer will be dereferenced in the future
- Thread exit order is non-deterministic
- Static destruction order across compilation units is undefined
- Some DbgCtl objects may be in shared libraries/plugins with different lifetimes
Proposed Fix
Use the Leaky Singleton pattern for the std::map registry and simply let the OS claim the resource on process shutdown. It has been verified that this addresses the crash.