Skip to content

Segment Replication: Replica Lag Continuously Increases Under Sustained High IngestionΒ #20686

@Subrahmanyam-Gollapalli

Description

Describe the bug

We are testing segment replication under sustained high ingestion load and observing continuously increasing replica lag. We would like to confirm whether this behaviour is expected or indicates a limitation/bug.

Related component

No response

To Reproduce

1. Create a 2-node cluster
Instance type: i8g.2xlarge
2 nodes
Roles: data + cluster_manager
OpenSearch version: v3.4.0
Replication type: segment replication enabled

2. Create index with 2 primaries and 1 replica
PUT test-seg-repl
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"index.replication.type": "SEGMENT"
}
}

Verify replication type:
GET test-seg-repl/_settings

3. Generate sustained high ingest load
Example bulk workload:
~95,000 docs/sec sustained
Continuous bulk indexing for above 60 minutes

4. Monitor shard-level document counts
Every 5 minutes run:
GET _cat/shards/test-seg-repl?h=shard,prirep,node,docs
Sum:
All primary shard docs
All replica shard docs
Or use the monitoring script (optional reference).

5. Observe replica lag growth
Track:
Primary Docs
Replica Docs
Diff = Primary - Replica

Under sustained ~95k docs/sec ingest:
Primary docs grow steadily
Replica docs update in bursts

Diff (Primary - Replica) continuously increases

Lag does not stabilize.

Expected behavior

Replica lag should stabilize at some steady state under sustained ingestion.

Additional Details

I used a script to get the primary shards and replica shards' indexed summation for every 5minutes and here are the result

Segment replication cluster
Final Summary Table
Time | Primary Docs | Replica Docs | Diff(P-R)

07:41 | 0 | 0 | 0
07:46 | 24784169 | 3175940 | 21608229
07:51 | 54809288 | 12790705 | 42018583
07:56 | 85014044 | 12790705 | 72223339
08:01 | 113333928 | 12790705 | 100543223
08:06 | 142037570 | 12790705 | 129246865
08:11 | 170996024 | 30417645 | 140578379
08:16 | 197927300 | 47854693 | 150072607
08:21 | 227214766 | 47854693 | 179360073
08:26 | 256039188 | 47854693 | 208184495
08:31 | 284573742 | 47854693 | 236719049
08:36 | 312336950 | 47854693 | 264482257

Here, the difference(replica lag) keeps on increasing over time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions