Skip to content

CI Crashes Due to DbgCtl Use-After-Free #12776

@bneradt

Description

@bneradt

Problem

ASF CI runs are showing segmentation faults during regression test execution.

Here's a recent fedora CI job:
https://ci.trafficserver.apache.org/view/Github/job/Github_Builds/job/fedora/7240/console

[Dec 31 17:54:47.229] traffic_crashlo NOTE: crashlog started, target=5436, debug=false syslog=true, uid=1200 euid=1200
[Dec 31 17:54:47.315] traffic_crashlo WARNING: received 0 of 128 expected signal info bytes
[Dec 31 17:54:47.315] traffic_crashlo WARNING: received 0 of 968 expected thread context bytes
+ popd
[Dec 31 17:54:47.315] traffic_crashlo NOTE: logging to 0x1004880
[Dec 31 17:54:47.315] traffic_crashlo NOTE: readlink failed with No such file or directory
~/workspace/Github_Builds/fedora/src
[Dec 31 17:54:47.315] traffic_crashlo ERROR: wrote crash log to /tmp/ats/var/log/trafficserver/crash-2025-12-31-175447.log

Note that because it is a shutdown crash, the test does not fail. The regression tests all passed, thus the CI job passes. But nevertheless, the process typically crashes on shutdown.

Investigation

Using diagnostic logging, it was identified that the crash is caused by a use-after-free issue in the DbgCtl objects. Despite careful design to handle static initialization/destruction order issues, the reference counting mechanism seems to not prevent a use after free of the std::map.

The rest of the below is from Cursor. I include it in case it is helpful. I'm not entirely convinced its root cause concerning one thread deleting the tag registry impacting the other threads is correct or not. It is clear, however, that simply not freeing the std::map registry does make the test stable.

Root Cause

The Reference Counting Issue

The DbgCtl class uses a shared registry with reference counting:

  • Each DbgCtl object increments registry_reference_count on construction.
  • Each DbgCtl destructor calls _rm_reference(), which decrements the count.
  • When ref_count reaches 0, the registry (including the std::map of all tags) is deleted.

The fatal flaw: The reference count tracks how many DbgCtl objects currently exist, but it cannot account for whether those objects will be accessed before they're destroyed.

How the Crash Occurs

  1. Multiple Threads/Contexts: The application has DbgCtl objects across multiple threads. Some are static/global, some are thread-local, and some may be in function scopes.

  2. Thread Exit During Execution: When a thread exits mid-execution (not just at program exit), all its DbgCtl objects destruct:

    Thread A exits → its DbgCtl objects destruct → _rm_reference() called repeatedly
    ref_count: 17 → 16 → 15 → ... → 1 → 0
    
  3. Registry Deleted Prematurely: When the last DbgCtl from Thread A destructs, ref_count hits 0, and the registry (with its std::map containing all 229 tag entries) is deleted.

  4. Dangling Pointers in Other Threads: Thread B is still running and has DbgCtl objects with _ptr pointing into the now-deleted registry:

   bool DbgCtl::on() const {
     // ...
     if (!_ptr->second) {  // ← Accessing freed memory!
       return false;
     }
     // ...
   }
  1. Crash: Thread B calls .on() or .tag() on its DbgCtl → accesses freed memory → segmentation fault.

Evidence from Diagnostics

From the diagnostic logs during crash reproduction:

Registry 1 (main traffic_server process):    
  - Created with ref_count starting at 0     
  - Grew to ref_count = 351 (351 DbgCtl objects created)                                  
  - Contains 229 unique tags in the registry map                                          

During test execution (NOT at program exit):                                              
  ref_count: 17 → 16 → ... → 2 → 1 → 0       
  DEBUG: ~Registry() - deleting registry at 0xc6b930 with 229 entries
  DEBUG: ~Registry() - finished, registry deleted

  [Tests continue running...]                

  [CRASH - traffic_crashlog invoked]         

The crash happens during active test execution, not during static destruction at program exit. This proves that:

  • Some threads/scopes had their DbgCtl objects destruct
  • The registry was deleted when their ref_count contributions were removed
  • Other threads were still running with dangling _ptr pointers

Why Reference Counting Cannot Solve This

The reference counting is working exactly as designed, but the design cannot handle this scenario:

  • What it tracks: Number of currently existing DbgCtl objects
  • What it cannot track: Whether those objects will be accessed before destruction
  • The gap: When Thread A's DbgCtls destruct and ref_count hits 0, Thread B's DbgCtls still exist but their _ptr members now point to freed memory

No amount of careful reference counting can solve this because:

  1. C++ provides no way to know if a pointer will be dereferenced in the future
  2. Thread exit order is non-deterministic
  3. Static destruction order across compilation units is undefined
  4. Some DbgCtl objects may be in shared libraries/plugins with different lifetimes

Proposed Fix

Use the Leaky Singleton pattern for the std::map registry and simply let the OS claim the resource on process shutdown. It has been verified that this addresses the crash.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions