Skip to content

Inconsistent sorting leads to entries missing from index #9205

@paul-hoeller

Description

@paul-hoeller

My company is maintaining a closed-source fork of GeoNetwork for a client. In our fork, we have noticed and eventually tracked down and fixed a significant bug where the index would be randomly missing entries from the database after a restart. While that fork is based on GeoNetwork 3.12, the bug appears to persist until today, in function synchronizeDbWithIndex().

This is the mechanism that caused the bug in our fork: At server startup, refreshIndex() (this is how synchronizeDbWithIndex() is called in 3.12) is invoked. This queries the database to determine what the state of the index should be. These queries are limited to METADATA_BATCH_PAGE_SIZE = 50,000 results. If there are more entries in the database, multiple queries are performed. Yes, apparently, a completely new query is performed each time, but we only use results 0 to 49,999 from the first query, then results 50,000 to 99,999 from the second query, and so on. The fact that these are separate queries is important because each query can yield the database entries in a different order, so which entries are results 0 to 49,999 and which are 50,000 to 99,999 is not necessarily consistent across queries. Consequently, it is entirely possible for an entry to be in the first chunk for query 1 and in the second chunk for query 2, meaning it is found twice. Or it could be in the second chunk for query 1 and in the first chunk for query 2, meaning it is not found at all. Now, the results are being sorted by changeDate; however, it is possible for many entries to have the exact same changeDate, in which case those entries can and will be returned in varying orders. If this results in database entries not being found, which it does with > 99 % probability for databases with many more than 50,000 entries of which many share the same changeDate, this causes those entries to be deleted from the index by refreshIndex().

The bug was fixed by sorting by id, which must be unique for each entry. Here is the patch for 3.12:

Patch for 3.12 Meant to be applied on top of c3c3339.
diff --git a/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java b/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
index e944bb1020..5d7e24920c 100644
--- a/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
+++ b/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
@@ -50,7 +50,6 @@ import org.fao.geonet.domain.ISODate;
 import org.fao.geonet.domain.Metadata;
 import org.fao.geonet.domain.MetadataCategory;
 import org.fao.geonet.domain.MetadataDataInfo;
-import org.fao.geonet.domain.MetadataDataInfo_;
 import org.fao.geonet.domain.MetadataFileUpload;
 import org.fao.geonet.domain.MetadataFileUpload_;
 import org.fao.geonet.domain.MetadataSourceInfo;
@@ -97,7 +96,6 @@ import org.fao.geonet.repository.MetadataStatusRepository;
 import org.fao.geonet.repository.MetadataValidationRepository;
 import org.fao.geonet.repository.OperationAllowedRepository;
 import org.fao.geonet.repository.PathSpec;
-import org.fao.geonet.repository.SortUtils;
 import org.fao.geonet.repository.Updater;
 import org.fao.geonet.repository.UserRepository;
 import org.fao.geonet.repository.UserSavedSelectionRepository;
@@ -255,10 +253,10 @@ public class BaseMetadataManager implements IMetadataManager {
 
         LOGGER_DATA_MANAGER.debug("INDEX CONTENT:");
 
-        Sort sortByMetadataChangeDate = SortUtils.createSort(Metadata_.dataInfo, MetadataDataInfo_.changeDate);
+        Sort sortById = new Sort(Metadata_.id.getName());
         int currentPage = 0;
         Page<Pair<Integer, ISODate>> results = metadataUtils
-            .findAllIdsAndChangeDates(new PageRequest(currentPage, METADATA_BATCH_PAGE_SIZE, sortByMetadataChangeDate));
+            .findAllIdsAndChangeDates(new PageRequest(currentPage, METADATA_BATCH_PAGE_SIZE, sortById));
 
         // index all metadata in DBMS if needed
         while (results.getNumberOfElements() > 0) {
@@ -294,8 +292,8 @@ public class BaseMetadataManager implements IMetadataManager {
             }
 
             currentPage++;
-            results = metadataRepository
-                .findAllIdsAndChangeDates(new PageRequest(currentPage, METADATA_BATCH_PAGE_SIZE, sortByMetadataChangeDate));
+            results = metadataUtils
+                .findAllIdsAndChangeDates(new PageRequest(currentPage, METADATA_BATCH_PAGE_SIZE, sortById));
         }
 
         // if anything to index then schedule it to be done after servlet is

If synchronizeDbWithIndex() is of any relevance in GeoNetwork 4 (I see that, unlike refreshIndex() in 3, it is no longer called in Geonetwork.start()), then the same bug exists within GeoNetwork 4. (That is, unless the sorting order is actually guaranteed to be consistent and it just isn't in our fork due to some change we made that I'm unaware of.) I have written a variant of the patch for GeoNetwork 4, though this is NOT TESTED:

Patch for 4 Meant to be applied on top of de01d4b.
diff --git a/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java b/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
index 616c732ab9..190101dd62 100644
--- a/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
+++ b/core/src/main/java/org/fao/geonet/kernel/datamanager/base/BaseMetadataManager.java
@@ -202,10 +202,10 @@ public class BaseMetadataManager implements IMetadataManager {
 
         LOGGER_DATA_MANAGER.debug("INDEX CONTENT:");
 
-        Sort sortByMetadataChangeDate = SortUtils.createSort(Sort.Direction.DESC, Metadata_.dataInfo, MetadataDataInfo_.changeDate);
+        Sort sortById = Sort.by(Sort.Direction.ASC, Metadata_.id.getName());
         int currentPage = 0;
         Page<Pair<Integer, ISODate>> results = metadataUtils.findAllIdsAndChangeDates(
-            PageRequest.of(currentPage, METADATA_BATCH_PAGE_SIZE, sortByMetadataChangeDate));
+            PageRequest.of(currentPage, METADATA_BATCH_PAGE_SIZE, sortById));
 
         // index all metadata in DBMS if needed
         while (results.getNumberOfElements() > 0) {
@@ -241,8 +241,8 @@ public class BaseMetadataManager implements IMetadataManager {
             }
 
             currentPage++;
-            results = metadataRepository.findIdsAndChangeDates(
-                PageRequest.of(currentPage, METADATA_BATCH_PAGE_SIZE, sortByMetadataChangeDate));
+            results = metadataUtils.findIdsAndChangeDates(
+                PageRequest.of(currentPage, METADATA_BATCH_PAGE_SIZE, sortById));
         }
 
         // if anything to index then schedule it to be done after servlet is

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions