fix: Cap `WorkerLock` retry intervals to 15 minutes #19394

jason-famedly · 2026-01-20T12:42:34Z

Fixes the symptoms of #19315 but not the underlying reason causing the number to grow so large in the first place.

ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit

Copied from the original pull request on Famedly's Synapse repo (with some edits):

Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.

Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times and (relatively quickly) end up with an extremely large retry interval with an unrealistic expectation past the heat death of the universe. 1 year in seconds = 31,536,000.

With this change, retry intervals in seconds should look more like:

[
0.2, 
0.4, 
0.8, 
1.6, 
3.2, 
6.4, 
12.8, 
25.6, 
51.2, 
102.4,  # 1.7 minutes
204.8,  # 3.41 minutes
409.6,  # 6.83 minutes
819.2,  # 13.65 minutes  < logging about excessive times will start here, 13th iteration
900,  # 15 minutes < never goes higher than this
]

Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

…es, continue logging at durations greater than 10 minutes

CLAassistant · 2026-01-20T12:42:42Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

changelog.d/19394.bugfix

MadLittleMods · 2026-01-20T16:25:14Z

changelog.d/19394.bugfix

@@ -0,0 +1 @@
+Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.


In #19390 (comment) (another Famedly PR),

I am submitting this PR as an employee of Famedly, who has signed the corporate CLA, and used my company email in the commit.

I assume the same applies here?

Yes, this is correct. Will we have to state such each time we upstream changes?

MadLittleMods · 2026-01-20T16:32:44Z

synapse/handlers/worker_lock.py

+        self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
+        if self._retry_interval > Duration(minutes=10).as_secs():  # >12 iterations


It would be nice to have these as constants WORKER_LOCK_MAX_RETRY_INTERVAL and WORKER_LOCK_WARN_RETRY_INTERVAL (perhaps better name) so we can share better describe these values.

I had actually considered such before also considering that a more flexible approach for different locks may be worth exploring. For example: when a lock is taken because an event is being persisted, the retry interval could be capped to a much smaller value, and the same for the logging of excessive times. Whereas, instead, a lock for purging a room might start with a longer retry interval but keep the cap the same.

Perhaps as defaults, if that exploration bears fruit. I'll shall add that to my notes for more work in this area, but I would rather do such separately

changelog.d/19394.bugfix

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>

denzs · 2026-01-23T09:22:45Z

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

jason-famedly · 2026-01-23T14:43:46Z

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

Yes there is more than one thing going on here. This fix(switch max() to min() and adjust iteration assumptions) is only to fix obnoxiously long strings of numbers that are trying to reach infinity not being introduced in the first place. The underlying cause is something else: timeouts seem to not be honored as well what the request that triggers the situation at all is doing to cause the locks to pile up and not make progress

jason-famedly added 2 commits January 20, 2026 06:27

max() and min() were probably switched. Set max to arbitrary 15 minut…

ff23d00

…es, continue logging at durations greater than 10 minutes

changelog

548c85b

jason-famedly requested a review from a team as a code owner January 20, 2026 12:42

MadLittleMods added the A-Workers label Jan 20, 2026

MadLittleMods reviewed Jan 20, 2026

View reviewed changes

Update changelog.d/19394.bugfix

1d22f90

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>

jason-famedly requested a review from MadLittleMods January 21, 2026 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Cap `WorkerLock` retry intervals to 15 minutes #19394

fix: Cap `WorkerLock` retry intervals to 15 minutes #19394

Uh oh!

jason-famedly commented Jan 20, 2026 •

edited by MadLittleMods

Loading

Uh oh!

CLAassistant commented Jan 20, 2026

Uh oh!

Uh oh!

MadLittleMods Jan 20, 2026

Uh oh!

jason-famedly Jan 21, 2026

Uh oh!

MadLittleMods Jan 20, 2026

Uh oh!

jason-famedly Jan 21, 2026

Uh oh!

Uh oh!

denzs commented Jan 23, 2026 •

edited

Loading

Uh oh!

jason-famedly commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1 @@
		Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.

		self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
		if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations

fix: Cap WorkerLock retry intervals to 15 minutes #19394

Are you sure you want to change the base?

fix: Cap WorkerLock retry intervals to 15 minutes #19394

Uh oh!

Conversation

jason-famedly commented Jan 20, 2026 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Uh oh!

CLAassistant commented Jan 20, 2026

Uh oh!

Uh oh!

MadLittleMods Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

jason-famedly Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

jason-famedly Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

denzs commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jason-famedly commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Cap `WorkerLock` retry intervals to 15 minutes #19394

fix: Cap `WorkerLock` retry intervals to 15 minutes #19394

jason-famedly commented Jan 20, 2026 •

edited by MadLittleMods

Loading

denzs commented Jan 23, 2026 •

edited

Loading