Skip to content

Conversation

@petermm
Copy link
Contributor

@petermm petermm commented Jan 10, 2026

User was reporting random but certain deadlocks when testing httpd webserver - this fixes the ABBA deadlock.

Entirely by AI:
https://ampcode.com/threads/T-019ba1c2-2a7f-77c7-bd33-ce9f303152a2

Verified as a fix, repeating load testing for multiple hours..

Summary

Fix a lock ordering inversion that causes deadlocks under SMP on ESP32 (and potentially other platforms) when sockets are used under heavy load.

Problem

enif_monitor_process and enif_demonitor_process acquire locks in opposite orders:

Function Lock Order
enif_monitor_process processes_tablemonitors
enif_demonitor_process monitorsprocesses_table
destroy_resource_monitors monitorsprocesses_table

This creates an ABBA deadlock when two threads call these functions concurrently—one holds processes_table waiting for monitors, while the other holds monitors waiting for processes_table.

The issue is triggered by otp_socket.c which calls both monitor/demonitor from NIFs, the select thread, and monitor callbacks under load.

With AVM_NO_SMP, synclist_wrlock is a no-op so no deadlock occurs, which explains why disabling SMP works around the issue.

Fix

Change enif_monitor_process to acquire locks in the same order as the other functions: monitorsprocesses_table.

Testing

  • Tested on ESP32 with SMP enabled under heavy socket load, by @schnittchen

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

https://ampcode.com/threads/T-019ba1c2-2a7f-77c7-bd33-ce9f303152a2

## Summary

Fix a lock ordering inversion that causes deadlocks under SMP on ESP32 (and potentially other platforms) when sockets are used under heavy load.

## Problem

`enif_monitor_process` and `enif_demonitor_process` acquire locks in opposite orders:

| Function | Lock Order |
|----------|------------|
| `enif_monitor_process` | `processes_table` → `monitors` |
| `enif_demonitor_process` | `monitors` → `processes_table` |
| `destroy_resource_monitors` | `monitors` → `processes_table` |

This creates an ABBA deadlock when two threads call these functions concurrently—one holds `processes_table` waiting for `monitors`, while the other holds `monitors` waiting for `processes_table`.

The issue is triggered by `otp_socket.c` which calls both monitor/demonitor from NIFs, the select thread, and monitor callbacks under load.

With `AVM_NO_SMP`, `synclist_wrlock` is a no-op so no deadlock occurs, which explains why disabling SMP works around the issue.

## Fix

Change `enif_monitor_process` to acquire locks in the same order as the other functions: `monitors` → `processes_table`.

## Testing

- Tested on ESP32 with SMP enabled under heavy socket load
- Verified no regression on non-SMP builds

Signed-off-by: Peter M <[email protected]>
@pguyot
Copy link
Collaborator

pguyot commented Jan 11, 2026

There is another lock process_table > monitors lock order with context_process_process_info_request_signal that calls context_get_process_info with the process table locked, and context_get_process_info may lock the resource type monitors lock.

So eventually, the lock should be the other way around or just no lock of both the process table and the resource type monitors in the enif_demonitor_process and destroy_resource_monitors functions.

https://ampcode.com/threads/T-019ba1c2-2a7f-77c7-bd33-ce9f303152a2

Fix lock ordering inversion causing deadlocks on ESP32 SMP under heavy socket load.

The functions enif_demonitor_process and destroy_resource_monitors previously
acquired locks in the order: monitors -> processes_table. However,
enif_monitor_process and context_get_process_info (via resource_monitor_to_resource)
use the opposite order: processes_table -> monitors. This created an ABBA
deadlock under SMP when these code paths ran concurrently.

The fix uses a two-phase approach in both functions:
- Phase 1: Acquire monitors lock, find/remove entries, release monitors lock
- Phase 2: Acquire process lock, send signals, release process lock

This ensures the two locks are never held simultaneously, eliminating the
deadlock while preserving correct semantics. If the target process dies
between phases, the demonitor signal is simply skipped (harmless no-op).

Also fixes a pre-existing bug where enif_demonitor_process failed to unlock
the monitors list on the resource_type mismatch error path.

Signed-off-by: Peter M <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants