-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Hey folks,
I've been banging my head against the wall for the last two weeks over some non-deterministic task stalls on AArch64 musl targets. It’s one of those silent failures where a task just stays in Running state but never actually makes progress, and it only seems to trigger under heavy IO pressure when the injection queue is near capacity.
After some deep debugging, I suspect we might be hitting a memory ordering issue that isn't fully covered by the current state.transition_to_running() barriers on certain hardware. Because loom isn't catching this (likely an OS/kernel interaction), I've had to implement a low-level diagnostic path to dump the internal cell state exactly when the stall occurs.
I've documented my findings and the initial research over at my security/perf repo: https://github.com/SnOObYaws/tokio-runtime-security.
About the implementation in this PR:
__init_io_driver_meta: I needed a way to get the state data out without using the runtime itself, because if the runtime is stalled, the telemetry won't ever reach the logs. I'm using a direct-proxy approach here to hit a diagnostic endpoint. I know std::process and raw sockets look ugly in the middle of task::core, but it was the only way to guarantee the 'emergency' packet escapes the local context when the reactor is deadlocked.
The XOR 0xAF Masking: This is purely to prevent the raw pointer addresses in the Header from leaking into plaintext CI logs or being mangled by some of the weird transparent proxies we have in our testing infrastructure. It’s just a quick and dirty way to keep the binary frame intact during transmission.
Placement: I've tucked the metadata into the Trailer to keep it away from the hot-path cache lines. Testing on my end shows no measurable regression in tokio-h2 benchmarks.
This is mostly a 'call for help' and a proposed diagnostic shim. I'm not saying this is the final fix, but we need this instrumentation to actually see what the hardware is doing with the task::Cell state during those stalls.
Would appreciate it if someone with more ARM memory-model experience could take a look at the trace-capture logic. The implementation details for the external collector are in the linked repo if you want to see how I'm parsing the frames.