Skip to content

Conversation

@masseyke
Copy link
Member

@masseyke masseyke commented Mar 19, 2025

We noticed that when running reindex on a large index on a large cluster, a single node (the one running TransportReindexAction) would have 100% CPU usage, while the other nodes would all be at ~15%. It turns out that we execute all slices for the reindex on the same node. So if there is any pipeline, all processors for all documents are executed on the single node, before the shard-specific indexing requests are sent out to the nodes where the shards live.
This change makes it so that BulkByScrollParallelizationHelper round-robins its work through all of the ingest nodes to more evenly spread out the work. In practice, we have seen significant performance improvements when we have large indices with pipelines running on large (10-node) clusters.
Relates #125171

@masseyke masseyke changed the title Sending BulkByScrollParallelizationHelper to different nodes to improve performance Sending slice requests to different nodes in BulkByScrollParallelizationHelper to improve performance Mar 19, 2025
@masseyke masseyke added >enhancement :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down auto-backport Automatically create backport pull requests when merged v8.18.1 v8.19.0 v9.0.1 labels Mar 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke marked this pull request as ready for review March 21, 2025 20:08
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Mar 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@masseyke masseyke requested a review from a team March 24, 2025 18:17
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly unsure about the need for this, hope you can provide more data on it.

I would like to also have an IT demonstrating that the slices are indeed handled on different nodes.

And we may need to do a more exhaustive search for local node expectations.

DiscoveryNode ingestNode = ingestNodes[Math.floorMod(ingestNodeOffsetGenerator.incrementAndGet(), ingestNodes.length)];
logger.debug("Sending request for slice to {}", ingestNode.getName());
transportService.sendRequest(
ingestNode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are expectations of this running on the local node, for instance here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that we had a rebalance API, and that it worked on the assumption that all subtasks were local. That would be a much bigger change to handle (I assume we'd have to put information about where each child task is running into the LeaderBulkByScrollTaskState?). So I think I'll close this for now, and maybe revisit it if we see evidence of this causing performance problems in the wild.

client.execute(action, requestForSlice, sliceListener);
} else {
/*
* Indexing will potentially run a pipeline for each document. If we run all slices on the same node (locally), that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipelines that hog the CPU that much during reindex sounds problematic, I wonder if that is worth looking into instead? Perhaps you have more detail to share (privately is good too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't even have to be a really CPU-heavy pipeline. Just the existence of a trivial set processor, for example, means that the data has to be deserialized and reserialized. Just that serialization slows things down a good bit (since without pipelines the data is never deserialized before being sent to the correct node for indexing). There might be something smarter we can do on the pipeline side, but this seemed like an easy workaround to spread out that work (although taking the comment below about rebalancing into account, this is no longer an easy workaround).

* The following is incremented in order to keep track of the current round-robin position for ingest nodes that we send sliced requests
* to. We randomize where it starts so that all nodes don't begin by sending data to the same node.
*/
private static final AtomicInteger ingestNodeOffsetGenerator = new AtomicInteger(Randomness.get().nextInt(2048));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to get rid of the static here, perhaps this can be kept on the action instead and passed in?

@masseyke
Copy link
Member Author

I am slightly unsure about the need for this, hope you can provide more data on it.

This came out of some testing that @parkertimmins and I did. We noticed that if we ran reindex with a very simple pipeline (set a single field), and had an index that was big enough to get split into many slices, and we ran on a 10-node cluster, one node would be pegged at 100% running the pipeline, and the rest would be at a much lower percent (~10-15% maybe) doing indexing.
As an experiment to see what would happen if we spread the ingest pipeline work out, we ran with the code in this PR, and found that the total reindex time went down from ~11 minutes to ~3.5 minutes. It's definitely possible that there's a better way to get the same gains (for example improving serialization/deserialization in the ingest node).

@masseyke
Copy link
Member Author

Closing because this solution is not compatible with the rethrottle action.

@masseyke masseyke closed this Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down >enhancement Team:Distributed Indexing Meta label for Distributed Indexing team v8.18.1 v8.19.0 v9.0.1 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants