-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Draft] Add IndexShard.withEngine method #123688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Alternative of elastic#122749, this draft adds an `IndexShard.withEngine` method based on a reentrant lock that can be used to execute an operation while preventing any engine change during execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good direction, hope it works out.
| getEngine().prepareForEngineReset(); | ||
| var newEngine = createEngine(newEngineConfig(replicationTracker)); | ||
| IOUtils.close(currentEngineReference.getAndSet(newEngine)); | ||
| IOUtils.close(getAndSetCurrentEngine(newEngine)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only do this if not operable, since it could be swapped by a reading thread already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could already have been swapped indeed. But since resetEngine is used to transition from hollow to unhollow and vice versa, I had to add a operability parameter to indicate in which state we expect the new engine to be in.
See 1ddac5e
|
Thanks for the feedback @henningandersen. I'm letting CI running on this draft while I'm continuing to work on the corresponding change in serverless (see linked draft PR). Feel free to comment on any of the two PR if you have more ideas/reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @tlrx ! I like how this looks. The only concern I have is the comment around intertwining the mutability requirement with the "withEngine" concept (of keeping the engine intact while an operation is ongoing).
| SubscribableListener.<IndexShard>newForked(l -> { | ||
| var indexShard = indicesService.indexServiceSafe(request.shardId().getIndex()).getShard(request.shardId().id()); | ||
| assert indexShard.routingEntry().isPromotableToPrimary() : "not an indexing shard" + indexShard.routingEntry(); | ||
| indexShard.ensureMutable(l.map(unused -> indexShard)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do real-time (m)GETs need to ensureMutable? I do not think we need a full unhollow engine for them. They should be able to work on hollow shards -- always returning null so that the search shard can service them themselves.
| return getEngine().getFromTranslog(get, mappingLookup, mapperService.documentParser(), searcherWrapper); | ||
| } | ||
| return getEngine().get(get, mappingLookup, mapperService.documentParser(), searcherWrapper); | ||
| return withMutableEngine(engine -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real-time gets should work either with IndexEngine or HollowIndexEngine. We just need to ensure that the engine stays intact while the operations is ongoing (so there's no ACE thrown). So I think we'd need withMutableEngineOrNull here.
| return withEngine(operation, true, false); | ||
| } | ||
|
|
||
| public <R> R withMutableEngineOrNull(Function<Engine, R> operation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should better be called withEngine. And rename current withEngine to something like withEngineInternal or something else.
| * Executes an operation while preventing the shard's engine instance to be changed or closed during the execution. The parameter | ||
| * {@code requiredMutability} can be used to force the engine to be reset to a given mutable or immutable state before executing the | ||
| * operation. The parameter {@code allowNoEngine} is used to allow the operation to be executed with a null engine instance, in which | ||
| * case the {@code requiredMutability} is ignored. When {@code allowNoEngine} is set to {@code `false`} the method will throw an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the code, if allowNoEngine = true, it still tries to reset the engine according to the requiredMutability. I think you should skip resetting the engine if allowNoEngine = true.
In general, I was a bit baffled with the parameters, but I now see what you're doing with them. Maybe a better naming could be:
private <R> R withEngine(Function<Engine, R> operation, boolean requiredMutability, boolean mutable)
and if requiredMutability == false, you call getCurrentEngine with true.
| onNewEngine(newEngine); | ||
| if (currentEngine.isMutable() != mutability) { | ||
| currentEngine.beforeReset(); | ||
| var newEngine = createEngine(newEngineConfig(replicationTracker)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this PR intertwines the "mutability" concept with the "do not reset the engine while this operaiton is ongoing" concept. I do appreciate though the explicitness of withMutableEngine and withImmutableEngine. But we're now limiting the resetEngine to a boolean state of mutability. This was not meant before; one could consider resetting engines for various reasons, not just mutability. And serverless was using the resetEngine to ensure mutability -- maybe another plugin or for another reason we'd like to reset the engine in the future.
I thus wonder if we should/can somehow demultiplex the "mutability" requirement outside of the resetEngine. So basically this PR should just introduce a withEngine(Function<Engine, R> operation) { functionality for operations in general that want to access the engine (like the real-time gets). And we leave the current main's logic around ensureMutable intact -- it will come ultimately to this point of resetEngine, that will use the new engineLock.writeLock() to ensure the engine is intact while resetting it.
That way, the stateless PR can also leave code intact more or less. E.g, unhollowing will resetEngine as it is currently doing in main, and will simply access the new engine afterwards (withEngine or without it, either is fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the writeEngine method also changes the state of the engine and that's the idea I wanted to pursue in this draft. I get your point, and now the cause of the deadlocks have been identified I can probably revert this to make the method not reset the engine and just holds an engine read lock while executing the operation. Honestly I need to try it and have the tests on CI run for a few days.
But having several methods (withEngine to hold a read lock, ensureMutable to trigger an unhollowing, resetEngine to reset, HollowShardsService to know if we should reset etc) make it really hard to reason about the transitions between engine types and their correctness. So having a single entry point for checking if the engine should be reset and triggers it looks seducing to me. I suspect that with some effort we could push the existing logic of HollowShardsService into the IndexShard and the engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @tlrx , I see that conceptually it's not far from the initial idea of using ref counting but this has some nicer properties:
- The lifecycle of the engine is well contained in the provided function and if the reference escapes it, it's a caller bug (it would be nice to make this harder to happen).
- It's completely clear which thread is triggering the engine reset, whereas with a ref counting API you kind of let the caller to pass the engine to other threads as long as it finally dec-refs it and makes it harder to decide about what to do when a ref count reaches 0, dispatch to generic? do the reset in the caller thread.
- I like the fact that the caller it's the one specifying if it needs a mutable engine or not, it sets a precedent and makes implementors to think about why you need to call that, whereas with the current approach it's hidden in a few transport actions. But I see the point of diverging from the current approach.
I'm leaning slightly towards taking this path, but I don't have a strong opinion. I like its simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the reference escapes it, it's a caller bug (it would be nice to make this harder to happen).
I think there are a few inner refs we may allow like a searcher (which IIRC only holds on to store so is also fine). I think we may have to accept javadoc being the primary means to avoid escaping the engine though we could check that the result ofwithEngineis not anEngineto avoid the simplewithEngine(e -> return e)style bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So having a single entry point for checking if the engine should be reset and triggers it looks seducing to me. I suspect that with some effort we could push the existing logic of HollowShardsService into the IndexShard and the engines.
Just to say on this, I think so far we're trying to push as much stateless-only logic into the stateless plugin rather than in core. Although I understand the reasoning if we can make things simpler to read / reason about, it may be worth putting them in core.
But so far, I don't like intertwining the mutability with the withEngine. I'd prefer we leave this PR for implementing the withEngine to ensure that the engine does not change underneath an operation when a resetEngine is called. And in the future, if we deem so, we can move things closer together in core ES. Maybe Henning's idea on a pendingReset() method might be a middle ground.
Not too strong on this though. If team feels more confident with the current approach, I'm fine, because I think it's important we feel confident about the approach as well. If having the pieces feeling like a puzzle through the code, I understand the reason for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, that's an idea I wanted to pursue and see how it goes. There is no need to have withEngine resetting the engine so far I think, so we can move forward without this bit and see how it goes.
Just to say on this, I think so far we're trying to push as much stateless-only logic into the stateless plugin rather than in core.
Sorry, I did not mean to push serverless logic into core but rather have the specific engine types being able to tell if they shouldReset or not, and the specific engine factories to recreate the engine instance accordingly. The mutability variables in this draft were only added to help me move forward with the tests.
Anyway, I'm happy to see that we are agreeing on a way to move forward so I'll go with a simple withEngine method without reset. We'll have the pieces in place if we decided to change things in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like a good direction @tlrx. I wonder if you are thinking about using this mechanism to trigger a reset during indexing when the index is hollow, if that's the case maybe we should explore if we can make the read lock a no-op in environments where we don't expect engine resets midway.
| @Override | ||
| default void afterRefresh(boolean didRefresh) throws IOException {} | ||
|
|
||
| void afterRefresh(boolean didRefresh, ElasticsearchDirectoryReader reader) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some javadocs here mentioning that the lifecycle of the reader is managed by the listener invocation?
| onNewEngine(newEngine); | ||
| if (currentEngine.isMutable() != mutability) { | ||
| currentEngine.beforeReset(); | ||
| var newEngine = createEngine(newEngineConfig(replicationTracker)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @tlrx , I see that conceptually it's not far from the initial idea of using ref counting but this has some nicer properties:
- The lifecycle of the engine is well contained in the provided function and if the reference escapes it, it's a caller bug (it would be nice to make this harder to happen).
- It's completely clear which thread is triggering the engine reset, whereas with a ref counting API you kind of let the caller to pass the engine to other threads as long as it finally dec-refs it and makes it harder to decide about what to do when a ref count reaches 0, dispatch to generic? do the reset in the caller thread.
- I like the fact that the caller it's the one specifying if it needs a mutable engine or not, it sets a precedent and makes implementors to think about why you need to call that, whereas with the current approach it's hidden in a few transport actions. But I see the point of diverging from the current approach.
I'm leaning slightly towards taking this path, but I don't have a strong opinion. I like its simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments while reading through this. I like how it looks but maybe we can simplify to not reset during withEngine? If not, having a pending-reset rather than mutability check could perhaps make this more generic.
| store.incRef(); | ||
| try { | ||
| synchronized (engineMutex) { | ||
| engineLock.writeLock().lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for now, but I wonder if we could relax this to be the readlock only? Perhaps even use with...Engine?
| var release = true; | ||
| try { | ||
| var engine = getCurrentEngine(allowNoEngine); | ||
| if (engine != null && (engine.isMutable() == requiredMutability) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a way where we can instead check something like engine.resetPending(), i.e., care less about the specific case and more about reset in progress?
| var newEngine = createEngine(newEngineConfig(replicationTracker)); | ||
| IOUtils.close(currentEngineReference.getAndSet(newEngine)); | ||
| onNewEngine(newEngine); | ||
| if (currentEngine.isMutable() != mutability) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to previous comment (and Iraklis') can we make this more like currentEngine.pendingReset()?
| public interface ReaderAwareRefreshListener extends ReferenceManager.RefreshListener { | ||
|
|
||
| @Override | ||
| default void afterRefresh(boolean didRefresh) throws IOException {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default implementation here throw something or just assert false?
| onNewEngine(newEngine); | ||
| if (currentEngine.isMutable() != mutability) { | ||
| currentEngine.beforeReset(); | ||
| var newEngine = createEngine(newEngineConfig(replicationTracker)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the reference escapes it, it's a caller bug (it would be nice to make this harder to happen).
I think there are a few inner refs we may allow like a searcher (which IIRC only holds on to store so is also fine). I think we may have to accept javadoc being the primary means to avoid escaping the engine though we could check that the result ofwithEngineis not anEngineto avoid the simplewithEngine(e -> return e)style bug.
|
Implemented in #124635, closing. |
Alternative of #122749, this draft adds an
IndexShard.withEnginemethod based on a reentrant lock that can be used to execute an operation while preventing any engine change during execution.