Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

@JeremyDahlgren JeremyDahlgren commented Apr 25, 2025

Replaces the use of a SingleResultDeduplicator by refactoring the cache as a subclass of CancellableSingleObjectCache. Refactored the AllocationStatsService and NodeAllocationStatsAndWeightsCalculator to accept the Runnable used to test for cancellation.

Closes #123248

@JeremyDahlgren JeremyDahlgren added >feature :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team labels Apr 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @JeremyDahlgren, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that there has to be a simpler way to achieve what we want than adding all this bare-handed locking etc. I would have expected something based on ref-counting would be neater: each waiting listener should hold a ref, releasing it on cancellation, and an ongoing computation stops early if the number of waiting refs drops to zero.

@JeremyDahlgren JeremyDahlgren requested a review from a team as a code owner April 27, 2025 00:58
@JeremyDahlgren
Copy link
Contributor Author

My feeling is that there has to be a simpler way to achieve what we want than adding all this bare-handed locking etc. I would have expected something based on ref-counting would be neater: each waiting listener should hold a ref, releasing it on cancellation, and an ongoing computation stops early if the number of waiting refs drops to zero.

I applied your suggested change to onFailure() the listener right when cancellation is detected in CancellableSingleObjectCache, and refactored TransportGetAllocationStatsAction to use it. The changes are less complex now.

@JeremyDahlgren
Copy link
Contributor Author

@DaveCTurner - when you have some time could you take a pass at reviewing this new version that uses CancellableSingleObjectCache? Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request for a missing test, otherwise just tiny nits. Production code looks good.

private volatile long ttlMillis;
private final ThreadPool threadPool;
private final AtomicReference<CachedAllocationStats> cachedStats;
private final ExecutorService executorService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could this be an Executor? No need for it to be AutoCloseable (it's a disaster if you try and close one of the threadpool's executors)

routingAllocation.metadata(),
routingAllocation.routingNodes(),
routingAllocation.clusterInfo(),
() -> {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this into a static constant so that it has a name, something like NEVER_CANCELLED perhaps? Otherwise it leaves the reader wondering what this lambda is for (and also saves allocating a fresh lambda each time it's called, although the compiler might be clever enough to skip that anyway)

/**
* Sets the currently cached item reference to {@code null}, which will result in a {@code refresh()} on the next {@code get()} call.
*/
protected void clearCurrentCachedItem() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't covered by the CancellableSingleObjectCache test suite yet. Also should this be final? We don't want subclasses overriding it.

}

private void verifyAllocationStatsServiceNumCallsEqualTo(int numCalls) {
verify(allocationStatsService, times(numCalls)).stats(argThat(r -> true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same thing?

Suggested change
verify(allocationStatsService, times(numCalls)).stats(argThat(r -> true));
verify(allocationStatsService, times(numCalls)).stats(any());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I applied this in the other uses as well.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JeremyDahlgren JeremyDahlgren merged commit 4e7b99c into elastic:main May 20, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >feature Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TransportGetAllocationStatsAction should be cancellable

3 participants