Skip to content

ReindexJob does not always index all expected pages #58

@adunn49

Description

@adunn49

Module version

1.4.1 (for CMS 5)

Problem statement

When running the ReindexJob job it doesn't always index everything it is expected to. Once the job has finished, if I look on the Search Service admin there is a difference between 'Documents indexed in the database' and 'Documents indexed remotely'. I have verified these numbers and they are reporting correctly so the issue is with the Reindex Job.

Image

It is also worth noting here that if I repeatedly run the ReindexJob job the 'Document indexed remotely' will rise a little each time.

Steps to reproduce

I am running the ReindexJob, within a DDEV environment, for a project that has around 1600 pages that are expected to be indexed. We are indexing just Pages and the structure of the site tree is nested several layers deep with roughly an even spread - this could be important as lots of pages with the same 'Sort' value could be a factor here, which I explain further down.

Start off with an empty index (run the Clean Index task) and then run the ReindexJob job (either from the command prompt or the Jobs admin in the CMS).

Let it run through to completion and navigate to the Search Service admin in the CMS to observe the number of pages indexed.

This number indicates that the number of pages indexed isn't as expected.

Reproducing the issue in skeleton project (cms50 tag)

Configuring search

Add following dependencies:

        "silverstripe/silverstripe-discoverer-bifrost": "^2.0",
        "silverstripe/silverstripe-forager-bifrost": "^1.2",

Add search.yml (note that we set the batch size low):

---
Name: elastic-config
After:
    - '#silverstripe-search-service'
---
SilverStripe\Forager\Service\IndexConfiguration:
    crawl_page_content: false
    include_page_html: true
    batch_size: 10
    indexes:
        main:
            includeClasses:
                Page:
                    fields:
                        title: true

Create pages

For this you'll need a decent number of pages with the same Sort value. I did this with a dev task ...

<?php

namespace App\Tasks;

use SilverStripe\Dev\BuildTask;
use Page;
use SilverStripe\CMS\Model\SiteTree;
use SilverStripe\PolyExecution\PolyOutput;
use Symfony\Component\Console\Input\InputInterface;

class PopulateTestPagesTask extends BuildTask
{
    private static $segment = 'PopulateTestPages';
    protected $title = 'Populate Test Pages';

    public function getDescription() // phpcs:ignore SlevomatCodingStandard.TypeHints
    {
        return 'Creates several test pages for development.';
    }

    public function run($request): int
    {

        for ($i=0; $i < 100; $i += 1) {
            $page = Page::create();
            $page->Title = sprintf('Test Page %d', $i + 1);
            $page->Content = sprintf('<p>This is test page %s</p>', $i + 1);
            $page->Sort = 1;
            $page->write();
            $page->publishSingle();
            echo "Created and published: {$page->Title}<br>";
        }

        return true;
    }

}

Running ReindexJob

With search configured and pointing to a clean elastic index, run the ReindexJob.
The Search Admin identifies that only 68 documents were indexed ...

Image

Additional context

We have a site where it is crucial that when a reindex is run, it actually does index everything it should do. For this reason I’ve been digging into this by running and re-running the job locally and, with some debug output, it seems to me that the list of ids that are processed is sometimes a little random and I have observed the same record id appearing across multiple batches of documents being indexed, and presumably some are omitted.

In particular I’ve been looking at the DataObjectFetcher where it retrieves the batch of records for processing (see https://github.com/silverstripeltd/silverstripe-forager/blob/main/src/DataObject/DataObjectFetcher.php#L125-L130) and trying to understand how it might return the same record across multiple batches. Looking at the SQL statement that is being run (see attached) it would seem to sort results on the ‘Sort’ column, which I think is the default sort column for a page.

The sort value is obviously not unique across the entire site tree (as mentioned previously we have a deep nested site tree and so a large number of pages are likely to have the same Sort value) and I wonder if is possible that records with similar sort values might come back in a different order. One thing perhaps that might affect this (I'm not sure about this), is that whilst records are being indexed I think they get updated (where the SearchIndexed value is updated).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions