Skip to content

Conversation

@jason-famedly
Copy link

@jason-famedly jason-famedly commented Jan 20, 2026

Fixes the symptoms of #19315 but not the underlying reason causing the number to grow so large in the first place.

ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit

Copied from the original pull request on Famedly's Synapse repo (with some edits):

Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.

Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times and (relatively quickly) end up with an extremely large retry interval with an unrealistic expectation past the heat death of the universe. 1 year in seconds = 31,536,000.

With this change, retry intervals in seconds should look more like:

[
0.2, 
0.4, 
0.8, 
1.6, 
3.2, 
6.4, 
12.8, 
25.6, 
51.2, 
102.4,  # 1.7 minutes
204.8,  # 3.41 minutes
409.6,  # 6.83 minutes
819.2,  # 13.65 minutes  < logging about excessive times will start here, 13th iteration
900,  # 15 minutes < never goes higher than this
]

Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@jason-famedly jason-famedly requested a review from a team as a code owner January 20, 2026 12:42
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@@ -0,0 +1 @@
Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #19390 (comment) (another Famedly PR),

I am submitting this PR as an employee of Famedly, who has signed the corporate CLA, and used my company email in the commit.

I assume the same applies here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct. Will we have to state such each time we upstream changes?

Comment on lines +278 to +279
self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have these as constants WORKER_LOCK_MAX_RETRY_INTERVAL and WORKER_LOCK_WARN_RETRY_INTERVAL (perhaps better name) so we can share better describe these values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had actually considered such before also considering that a more flexible approach for different locks may be worth exploring. For example: when a lock is taken because an event is being persisted, the retry interval could be capped to a much smaller value, and the same for the logging of excessive times. Whereas, instead, a lock for purging a room might start with a longer retry interval but keep the cap the same.

Perhaps as defaults, if that exploration bears fruit. I'll shall add that to my notes for more work in this area, but I would rather do such separately

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
@denzs
Copy link

denzs commented Jan 23, 2026

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

@jason-famedly
Copy link
Author

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

Yes there is more than one thing going on here. This fix(switch max() to min() and adjust iteration assumptions) is only to fix obnoxiously long strings of numbers that are trying to reach infinity not being introduced in the first place. The underlying cause is something else: timeouts seem to not be honored as well what the request that triggers the situation at all is doing to cause the locks to pile up and not make progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants