Skip to content

Conversation

@arteam
Copy link
Contributor

@arteam arteam commented May 13, 2025

Engine.PauseLock#throttle can be called when the lock is being throttled, so we can't guarantee that all permits are available before throttling.

Resolve #126359
See #127173

`Engine.PauseLock#throttle` can be called when the lock is being throttled,
so we can't guarantee that all permits are available before throttling.

Resolve elastic#126359
See elastic#127173
@arteam arteam added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels May 13, 2025
@arteam arteam marked this pull request as ready for review May 13, 2025 19:16
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label May 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@ankikuma
Copy link
Contributor

ankikuma commented May 20, 2025

Hi Artem, sorry for taking so long to get to this. I am not convinced this is the right fix. We do not call throttle and unthrottle directly on the lock. We go through InternalEngine#activateThrottling and InternalEngine#deactivateThrottling instead, which update a counter to make sure that the throttle and unthrottle calls are matched.
Basically, the callers are expected to follow the semantics that for each activateThrottling call there should be a matching deactivateThrottling call and we throw asserts in the code if this is not the case.
I think the problem might be in the way the ThreadPoolMergeScheduler is calling activateThrottling and deactivateThrottling.

But I am not sure yet where the mismatch might be caused.

@ankikuma ankikuma requested a review from albertzaharovits May 20, 2025 18:54
@ankikuma
Copy link
Contributor

I believe there is a race condition between IndexThrottle#activate() and IndexThrottle#deactivate() because both can be called by different threads.

@henningandersen
Copy link
Contributor

I believe there is a race condition between IndexThrottle#activate() and IndexThrottle#deactivate() because both can be called by different threads.

This makes sense to me too, I think the test could be rewritten to use those handles and provoke the issue.

I agree the fix needs to be done differently, we might need to have a lock inside activeate/deactivateThrottling (though other solutions can be done too, but seems more complex and this should not be a heavily called mechanism).

@albertzaharovits
Copy link
Contributor

albertzaharovits commented May 27, 2025

I believe there is a race condition between IndexThrottle#activate() and IndexThrottle#deactivate() because both can be called by different threads.

But the ThreadPoolMergeScheduler calls the throttling methods under a lock:

. What I think is going is a cycle of IndexThrottle#activate -> IndexThrottle#deactivate -> IndexThrottle#activate where some IndexThrottle#acquireThrottle from the first activate haven't released yet, before the second one. If that's the case, I believe Artem's fix proposal, to remove the assert, should fix this test failure (but I don't think that the added test exercises the fault here).

@ankikuma
Copy link
Contributor

ankikuma commented Jun 2, 2025

This has been fixed in #128405. We can close this PR.

@ankikuma ankikuma closed this Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Indexing Meta label for Distributed Indexing team >test Issues or PRs that are addressing/adding tests v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] IndexStatsIT testThrottleStats failing

5 participants