Skip to content

Commit f78a5f0

Browse files
committed
issue #605: ansible: share a sem_t instead of a pthread_mutex_t
The previous version quite reliably causes worker deadlocks within 10 minutes running: # 100 times: - import_playbook: integration/async/runner_one_job.yml # 100 times: - import_playbook: integration/module_utils/adjacent_to_playbook.yml via .ci/soak/mitogen.sh with PLAYBOOK= set to the above playbook. Attaching to the worker with gdb reveals it in an instruction immediately following a futex() call, which likely returned EINTR due to attaching gdb. Examining the pthread_mutex_t state reveals it to be completely unlocked. pthread_mutex_t on Linux should have zero trouble living in shmem, so it's not clear how this deadlock is happening. Meanwhile POSIX semaphores are explicitly designed for cross-process use and have a completely different internal implementation, so try those instead. 1 hour of soaking reveals no deadlock. This is about avoiding managing a lockable temporary file on disk to contain our counter, and somehow communicating a reference to it into subprocesses (despite the subprocess module closing inherited fds, etc), somehow deleting it reliably at exit, and somehow avoiding concurrent Ansible runs stepping on the same file. For now ctypes is still less pain. A final possibility would be to abandon a shared counter and instead pick a CPU based on the hash of e.g. the new child's process ID. That would likely balance equally well, and might be worth exploring when making this code work on BSD.
1 parent 4fa760c commit f78a5f0

File tree

1 file changed

+13
-13
lines changed

1 file changed

+13
-13
lines changed

ansible_mitogen/affinity.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -92,37 +92,37 @@
9292
_libc = ctypes.CDLL(None, use_errno=True)
9393
_strerror = _libc.strerror
9494
_strerror.restype = ctypes.c_char_p
95-
_pthread_mutex_init = _libc.pthread_mutex_init
96-
_pthread_mutex_lock = _libc.pthread_mutex_lock
97-
_pthread_mutex_unlock = _libc.pthread_mutex_unlock
95+
_sem_init = _libc.sem_init
96+
_sem_wait = _libc.sem_wait
97+
_sem_post = _libc.sem_post
9898
_sched_setaffinity = _libc.sched_setaffinity
9999
except (OSError, AttributeError):
100100
_libc = None
101101
_strerror = None
102-
_pthread_mutex_init = None
103-
_pthread_mutex_lock = None
104-
_pthread_mutex_unlock = None
102+
_sem_init = None
103+
_sem_wait = None
104+
_sem_post = None
105105
_sched_setaffinity = None
106106

107107

108-
class pthread_mutex_t(ctypes.Structure):
108+
class sem_t(ctypes.Structure):
109109
"""
110-
Wrap pthread_mutex_t to allow storing a lock in shared memory.
110+
Wrap sem_t to allow storing a lock in shared memory.
111111
"""
112112
_fields_ = [
113-
('data', ctypes.c_uint8 * 512),
113+
('data', ctypes.c_uint8 * 128),
114114
]
115115

116116
def init(self):
117-
if _pthread_mutex_init(self.data, 0):
117+
if _sem_init(self.data, 1, 1):
118118
raise Exception(_strerror(ctypes.get_errno()))
119119

120120
def acquire(self):
121-
if _pthread_mutex_lock(self.data):
121+
if _sem_wait(self.data):
122122
raise Exception(_strerror(ctypes.get_errno()))
123123

124124
def release(self):
125-
if _pthread_mutex_unlock(self.data):
125+
if _sem_post(self.data):
126126
raise Exception(_strerror(ctypes.get_errno()))
127127

128128

@@ -133,7 +133,7 @@ class State(ctypes.Structure):
133133
the context of the new child process.
134134
"""
135135
_fields_ = [
136-
('lock', pthread_mutex_t),
136+
('lock', sem_t),
137137
('counter', ctypes.c_uint8),
138138
]
139139

0 commit comments

Comments
 (0)