-
Notifications
You must be signed in to change notification settings - Fork 32
Description
What happened?
It was detected that the resource-usage-tracker service did not have a lock inside redis and it was reporting the following error: redis.exceptions.LockNotOwnedError: Cannot reacquire a lock that's no longer owned.
What triggered this situation?
The cause is currently unknown.
What could trigger this situation?
- Lock is removed manually or somehow expires.
- Redis database becoming unavailable for a short amount of time.
Why is this bad?
Immagine having multiple instances relying on this lock, if somehow it is freed (and the lock_context does nothing), a different instance will acquire it and use it as if the resource (protected by the lock) was free to be used.
How does it present itself?
Consider the following code:
async with lock_context(...):
# some user defined
# code hereTwo situations compete to producing the issue:
lock_contextcreates a task which raisesredis.exceptions.LockNotOwnedErrorbut only logs the issue without handling it.lock_context(context manager) has no way of stopping the execution of the code defined by the user.
Is there a way to reproduce it in a test?
Yes, apply the following changes in your repo changes.diff.zip
Then run one of the following tests to make the issue appear:
tests/test_redis_utils.py::test_possible_regression_lock_extension_fails_if_key_is_missingtests/test_redis_utils.py::test_possible_regression_first_extension_delayed_and_expires_key
What was already tried?
To mitigate the issue it was proposed to try and stop the user defined code.
The only possible solution for a context manger is using signals (which are only available on linux).
When running test tests/test_redis_utils.py::test_context_manager_timing_out the code hangs unexpectedly inside the signal's handler functions. This means the entire asyncio runtime will be halted.
This approach cannot be used.
tests/test_redis_utils.py::test_context_manager_timing_out which produces a very unexpected result.
def handler(signum, frame):
> raise RuntimeError(f"Operation timed out after {after} seconds")
E RuntimeError: Operation timed out after 10 seconds
tests/test_redis_utils.py:190: RuntimeError
> /home/silenthk/work/pr-osparc-redis-lock-issues/packages/service-library/tests/test_redis_utils.py(190)handler()
-> raise RuntimeError(f"Operation timed out after {after} seconds")
Different proposals
Feel free to suggest more below.
- @GitHK: I would do the following. The
locking mechanismshould receive atask(which runs the user defined code) which it can cancel if something goes wrong. This solution no longer uses a context manger, but uses a battle tested method of stopping the running user defined code.