Skip to content

GeoNetwork 4.2.14: GN 4.0 Harvester harvests fewer records than expected; suspect pagination off-by-one #9183

@erick-ouellette

Description

@erick-ouellette

Describe the bug

Using Harvesting GN 4.0 at GeoNetwork 4.2.14, I setup harvesting from a instance-A to harvest from instance-B. I have 54 records on instance-B and only 53 records or only ever retrieved by Instance-A. I have ensured that all 54 records are spotless with respect to validation, groups, category, etc and are publicly available 'All'.

I'm increasingly convinced it's nothing related to records. I identify the missing records, re-harvest, it often appears, but another record is dropped. It does not appear 'deterministic', but 'random'. I have tried both UUID collision options of "Skip" and "Overwrite". I'm chasing ghosts.

Harvester reports, depending if records can look like either of these:

53 record(s) harvested in 135 seconds
 3 minutes ago

privilegesAppendedOnExistingRecord: 53
total: 53
unchanged: 53
53 record(s) harvested in 132 seconds
 20 hours ago

added: 1
privilegesAppendedOnExistingRecord: 52
removed: 1
total: 53
unchanged: 52

I suspect pagination off-by-one.

If I'm correct, can the harvester be reconfigured to request larger pages as a work-around? The default in the UI search settings is 30. Does the harvester use the same pagination default as the UI? Or is there an xml or json setting in the software distribution?

Or another workaround could be to use some sort of 'do not delete' records. I read in the docs such a setting should exist, but it is no available in the UI settings of the harvester. If I can configure the harvester to not delete records already harvested but are missing in the retrieval then I may be able to have it keep that extra record after harvest?

To Reproduce
.

Expected behavior
I hoped to have all 54 available valid records harvested

Screenshots

Unauthenticated on instance-B, I see my 54 records publicly available.

Image

On harvested instance-A side:

Image

Log file

Log files with overwrite or skip.

harvester_geonetwork40_wf_test_records_from_DEV_nicebay__20260220142843.log

harvester_geonetwork40_wf_test_records_from_DEV_nicebay__20260219183545.log

Desktop (please complete the following information):

  • Browser Edge
  • GeoNetwork Version 4.2.14
  • Schema iso19139.ca.HNAP 4.2.14
  • Server Application Tomcat 9.0.106; Java Adoptium 8u462b08; ElasticSearch 7.17.15

Additional context
.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions