Skip to content

Elasticsearch: Async handling of indexing/deletion requests#8465

Open
tobias-hotz wants to merge 39 commits intogeonetwork:mainfrom
tobias-hotz:async_indexing
Open

Elasticsearch: Async handling of indexing/deletion requests#8465
tobias-hotz wants to merge 39 commits intogeonetwork:mainfrom
tobias-hotz:async_indexing

Conversation

@tobias-hotz
Copy link
Copy Markdown
Contributor

@tobias-hotz tobias-hotz commented Oct 24, 2024

Currently, indexing is batched as a global queue of 200 elements (except when forceRefreshReaders is true). When the threshold is reached, the entries are submitted, and the thread submitting the 200th element waits until the elasticsearch returns the result of the request.
With deletion, we currently always send one request per deletion request and always use the deleteByQuery method.

The current design has a number of flaws:

  • When multiple threads index metadata and both do not use forceRefreshReaders, the queue could grow past 200 (e.g., two threads add at about the same time, and at the time the queue size is checked, the queue already has 201 elements). This causes the queue to grow indefinitely, meaning no further entries are submitted before an explicit call to sendDocumentsToIndex. This can cause Out Of Memory errors (we observed this when using multiple reindexing threads)
  • An indexing thread is spending a significant amount of time waiting for the elasticsearch, while it could already prepare the next batch for the ES to be consumed
  • As we wait after every delete call, deleting many entries takes a lot more time as needed
  • Due to deletion always using deleteByQuery (even when we could delete by uuid), no batching is possible
  • When multiple threads index metadata and both do not use forceRefreshReaders, the queue contains mixed entries of Thread 1 and Thread 2. At most call sites, after all entries have been indexed, sendDocumentsToIndex is called. It is currently possible that Thread 1 is currently sending a batch that contains entries from Thread 2 as well and Thread 2 already finished submitting the rest of the batch, causing Thread 2 to think that all its entries have been sent to Elasticsearch, but they are still being submitted by Thread 1. This is a very rare problem, though

This PR solves all of these problems. The main takeaway is that it significantly improves the performance of deleting and indexing many entries.

This is accomplished by introducing a IIndexSubmitter and IDeletionSubmitter. These new classes handle how new entries are sent to the index. The direct implementations (DirectIndexSubmitter and DirectDeletionSubmitter) are similar to how the old forceIndexChanges parameter worked in that they directly send the data to the index.
With the use of the BatchingIndexSubmitter and BatchingDeletionSubmittor, chunks are sent periodically to the elasticsearch (just as before), but a local queue is used, and we do not wait for the elasticearch. The index responses are handled asynchronously on a different thread instead. We still guarantee that the indexing will be complete once the whole block is done, as the close method sends the rest of the local queue and waits for all async responses to be complete.

We made some performance measurements on a smaller scale. Here is the average result of a bunch of runs with different CSW harvesters:

Operation Average time before change Average time after change Improvement in %
Harvesting ~3600 entries 11 minutes and 17 seconds 06 minutes and 30 seconds ~74%
Harvesting ~1250 entries 06 minutes and 33 seconds 04 minutes and 28 seconds ~47%
Harvesting ~700 entries 03 minutes and 14 seconds 01 minutes and 30 seconds ~131%
Harvesting ~100 entries 00 minutes and 29 seconds 00 minutes and 29 seconds ~35%
Reindexing ~5700 entries 04 minutes and 39 seconds 02 minutes and 40 seconds ~74%
Deleting ~3600 entries 01 minutes and 35 seconds 00 minutes and 20 seconds ~475%
Deleting ~1250 entries 00 minutes and 38 seconds 00 minutes and 09 seconds ~433%
Deleting ~700 entries 00 minutes and 20 seconds 00 minutes and 04 seconds ~500%
Deleting ~100 entries 00 minutes and 05 seconds 00 minutes and 01 seconds ~500%

As you can see, there are very significant performance gains. These numbers were recorded on a local machine, if you use a remote index on a different machine, the effect may be even higher due to latency/throughput limitations.

Checklist

  • I have read the contribution guidelines
  • Pull request provided for main branch, backports managed with label
  • Good housekeeping of code, cleaning up comments, tests, and documentation
  • Clean commit history broken into understandable chucks, avoiding big commits with hundreds of files, cautious of reformatting and whitespace changes
  • Clean commit messages, longer verbose messages are encouraged
  • API Changes are identified in commit messages
  • Testing provided for features or enhancements using automatic tests
  • User documentation provided for new features or enhancements in manual
  • Build documentation provided for development instructions in README.md files
  • Library management using pom.xml dependency management. Update build documentation with intended library use and library tutorials or documentation

Funded by LGL BW

This improves indexing time, especially when the ES connection has a high latency or the ES load is high.
@tobias-hotz tobias-hotz marked this pull request as draft October 24, 2024 13:51
This causes issues with some tests. This issue was already present before the changes
@fxprunayre
Copy link
Copy Markdown
Member

Interesting work @tobias-hotz. We are also investigating how to improve indexing performances for GeoNetwork 5. See draft work geonetwork/geonetwork#19 and
https://github.com/geonetwork/geonetwork/blob/main/src/modules/indexing/src/main/java/org/geonetwork/indexing/IndexingService.java#L108
Maybe some ideas can be shared, or tracking some GN4 hot spots to avoid to make similar mistakes in GN5 can be nice.

@tobias-hotz
Copy link
Copy Markdown
Contributor Author

Hi @fxprunayre
thanks for taking a look.
The idea for this is to improve performance of indexing while breaking none of the existing behaviour. Some code paths assume that an element is available in the index right after the call to index, so this has to be taken into account.
Also, the index preparation is still done in the main thread, as it is not unlikely that some code relies on this, and given how big the GN4 Codebase is, I'd rather go with this.

My first approach was to just return the Future of the index response to the caller, but that was getting pretty messy and it was easy to miss a call site. That's why I chose this approach.
For GN5, it could also be benificial to move the index preparation stuff to another thread.

This change allows allows the multithreaded reindexing to work again (which is somewhat broken at the moment, mainly because of concurrency issues with the single document buffer for the bulk requests). This reduces the time spend on reindexing by a lot. So support for multithreaded indexing is something GN5 should also provide out of the box.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 8, 2024

CLA assistant check
All committers have signed the CLA.

@tobias-hotz tobias-hotz changed the title Elasticsearch Indexing: Handle index responses asynchronously Elasticsearch: Async handling of indexing/deletion requests Dec 10, 2024
@tobias-hotz tobias-hotz marked this pull request as ready for review December 10, 2024 14:36
…indexing

# Conflicts:
#	services/src/main/java/org/fao/geonet/api/processing/DatabaseProcessUtils.java
tobias-hotz and others added 8 commits June 3, 2025 11:17
…ectDeletionSubmitter.java

Co-authored-by: Jose García <josegar74@gmail.com>
…ectIndexSubmitter.java

Co-authored-by: Jose García <josegar74@gmail.com>
…letionSubmitter.java

Co-authored-by: Jose García <josegar74@gmail.com>
…dexSubmitter.java

Co-authored-by: Jose García <josegar74@gmail.com>
…ch/BatchingDeletionSubmitter.java

Co-authored-by: Jose García <josegar74@gmail.com>
…ch/BatchingIndexSubmitter.java

Co-authored-by: Jose García <josegar74@gmail.com>
…ch/BatchingSubmitterBase.java

Co-authored-by: Jose García <josegar74@gmail.com>
…ch/StateBase.java

Co-authored-by: Jose García <josegar74@gmail.com>
@fxprunayre
Copy link
Copy Markdown
Member

FYI, deployed on test env on some projects here. So far, all is working fine.

# Conflicts:
#	core/src/main/java/org/fao/geonet/kernel/datamanager/IMetadataIndexer.java
#	core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
#	core/src/main/java/org/fao/geonet/kernel/search/index/BatchOpsMetadataReindexer.java
#	core/src/test/java/org/fao/geonet/AbstractCoreIntegrationTest.java
@tobias-hotz
Copy link
Copy Markdown
Contributor Author

I've rebased this PR onto the latest main branch once more.
Can someone please do a full review so this has a real chance to get in the next release? We've already got some positive test results from our company and @fxprunayre and promising performance numbers.

@fxprunayre
Copy link
Copy Markdown
Member

Thanks for the additional rebase @tobias-hotz. Indeed the changes looks to work well, no issue reported on indexing on my side. Discussing with @josegar74 yesterday, we propose to make 4.4.9 this month (or early October), and merge that PR just after so that everyone can test it for the 4.4.10 release.

@jahow jahow modified the milestones: 4.4.9, 4.4.10 Oct 7, 2025
# Conflicts:
#	core/src/test/java/org/fao/geonet/AbstractCoreIntegrationTest.java
#	harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/geonet/BaseGeoNetworkAligner.java
# Conflicts:
#	harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/AbstractHarvester.java
#	plugins/datahub-integration/geonetwork-ui
@tobias-hotz
Copy link
Copy Markdown
Contributor Author

Any news now that 4.4.10 is out? I've rebased this twice now since the release of that version

@tobias-hotz
Copy link
Copy Markdown
Contributor Author

Hi @josegar74 and @fxprunayre,
can you give me a status update on this PR? I really don't want to be pushy, but it would really suck if we miss the next release cycle for this PR again.
If there is anything we can do to help get this merged, feel free to ping me.

@fxprunayre
Copy link
Copy Markdown
Member

Hi @tobias-hotz, as reported before, no issue reported on this so far on my side so it sounds good to go but it would be good to have others opinion ...

@tobias-hotz
Copy link
Copy Markdown
Contributor Author

@josegar74 Do you (or another reviewer) have any plans to tackle this? If yes, then I will continue rebasing this, but if the consensus is that this change is too risky or big to review or something else, then I don't need to keep updating the patch

@fxprunayre fxprunayre modified the milestones: 4.4.10, 4.4.11 Mar 27, 2026
@fxprunayre
Copy link
Copy Markdown
Member

Sorry @tobias-hotz for being so slow reviewing PRs (definitely an area to improve).

Discussing with @josegar74 this morning about this work, we propose to make 4.4.10 first (date is still uncertain but should happen in April). It will contain Elasticsearch version update (cf. #9176). Wait for the release to fix conflicts one more time so that we merge it early for 4.4.11.

Is that fine for everyone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants