Skip to content

[UPDATED: TclX patch available] Potential solution to "Make TclX's signal trap handlers safe to use with threaded Tcl" #32

@fredericbonnet

Description

@fredericbonnet

This issue is not a TclX-specific problem but a bug/shortcoming in the Tcl core, TclX just happens to be the simplest way to expose it at script level. The ticket below provides such a script, however it is entirely possible to reproduce this phenomenon using the C API (more specifically, a combination of Tcl async and low-level signal handlers on which TclX relies).

https://core.tcl.tk/tcl/tktview/f4f44174

I was able to reproduce the problem eventually after several hours, and the deadlock is indeed caused by the async thread self-deadlocking while attempting to lock its mutex twice.

The bug is difficult to reproduce because it's very timing-sensitive, however one can force the hand of destiny by artificially slowing down the async thread, that's what I did and it makes the process deadlock immediately. Tcl_AsyncInvoke has a mutex-protected loop over its registered handlers, so I've just added a sleep(0) within the loop and the test script hangs immediately for the exact same reason (double locking).

Tcl uses pthreads on Unix and Win32 critical sections on Windows. Win32 CSs are reentrant but pthread mutexes are not by default (you have to use the PTHREAD_MUTEX_RECURSIVE and this feature is not available on all systems). So when the async thread is interrupted by a signal while in the middle of a mutex-protected operation, it deadlocks itself on Unix but not on Windows.

The Tcl docs say:

The result of locking a mutex twice from the same thread is undefined. On some platforms it will result in a deadlock.

So this is the expected behavior, however nothing prevents the core from using reentrant mutexes on all platforms that Tcl supports. Implementing reentrancy using non-reentrant mutexes is a trivial task, so OS support is a non-issue. I'm going to write a TIP to make Tcl_Mutex reentrant on all platforms starting at version 8.7.

I've already implemented a quick hack to make mutexes reentrant on Unix:

  • a first version uses the native PTHREAD_MUTEX_RECURSIVE,
  • a second version uses regular mutexes with a per-thread call counter.

Both versions fix the deadlock problem with or without my extra sleep(0), and with no impact on Tcl tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions