OGC CSW 2.0.2 Harvesting / Indexing Performance / Bulk-Request

**Is your feature request related to a problem? Please describe.**

###### Background:
After the GeoNetwork upgrade to version 4.2.5, initial harvesting of all metadata was performed. 
We were harvesting from a node with much metadata (> 280000).
The initial harvesting runtime was very long, which resulted in failed harvesting runs several times in succession.
Therefore, multiple harvesting runs over several days were required until all metadata was available in the database and in the index.

###### The question arises: Why does the initial harvesting take so much time?
We used the profiling tool VisualVM to analyze which methods require the most time during the initial harvesting process.
The exact process of indexing during harvesting is described further down in the ticket.

The following harvesting times were asserted:
- The `addMetadata` method requires ~89% of the total harvesting time
   - The `indexMetadata` method requires ~37% of the total harvesting time

> [!IMPORTANT]
> Indexing of new metadata takes about 37% of the total harvesting time. Therefore, performance enhancement in indexing has a high potential to decrease harvesting time.



**Describe the solution you'd like**
> [!TIP]
> Suggestion: The performance of indexing during harvesting can possibly be improved by indexing several metadata uuids at once using bulk requests.

Geonetwork already uses the bulk API, but with CSW harvesting, the bulk request performs with only one metadata set at a time. The `addMetadata` method is called individually for each metadata set. In the code, the parameter `forceRefreshReaders` is set to `true` which causes this behavior.

The performance could be increased by indexing multiple documents at the same time instead of each document individually through bulk requests.
![image](https://github.com/geonetwork/core-geonetwork/assets/56172653/24ab37d6-eec4-4821-ba51-5a3b92b2b6f8)


**Additional context**
Analyzed process of indexing during harvesting with VisualVM

- `kernel.harvest.harvester.csw.Aligner.align()`
    -  `kernel.harvest.harvester.csw.Aligner.insertOrUpdate()` [Source-Code](https://github.com/geonetwork/core-geonetwork/blob/4.2.5/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/csw/Aligner.java)
        - foreach loop over each record → Call `Aligner.addMetadata()`
Result: `addMetadata` is called individually for each added metadata record
      
        - `kernel.harvest.harvester.csw.Aligner.addMetadata()` [Source-Code](https://github.com/geonetwork/core-geonetwork/blob/4.2.5/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/csw/Aligner.java) (**~89% of total harvesting time**)
            - Call `BaseMetadataIndexer.indexMetadata()` with parameter `fourceRefreshReaders = true`
            - Only one metadata record is transferred for indexing
            - *Possible Solution:  Flag metadata record for indexing, but don't index it immediately. Instead, index multiple metadata records at once with a bulk request.*

            - `kernel.datamanager.base.BaseMetadataIndexer.indexMetadata()` [Source-Code](https://github.com/geonetwork/core-geonetwork/blob/4.2.5/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataIndexer.java) (**~37% of total harvesting time**)
                - Call `EsSearchManager.index()` with parameter `fourceRefreshReaders = true`

                - `kernel.search.EsSearchManager.index()` [Source-Code](https://github.com/geonetwork/core-geonetwork/blob/4.2.5/core/src/main/java/org/fao/geonet/kernel/search/EsSearchManager.java)
                    - `fourceRefreshReaders` is `true`
                    - Consequence: A bulk request is carried out with one document / metadata record

                    - `index.es.EsRestClient.bulkRequest()`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OGC CSW 2.0.2 Harvesting / Indexing Performance / Bulk-Request #7981

Background:

The question arises: Why does the initial harvesting take so much time?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

OGC CSW 2.0.2 Harvesting / Indexing Performance / Bulk-Request #7981

Description

Background:

The question arises: Why does the initial harvesting take so much time?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions