-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the bug
We are testing segment replication under sustained high ingestion load and observing continuously increasing replica lag. We would like to confirm whether this behaviour is expected or indicates a limitation/bug.
Related component
No response
To Reproduce
1. Create a 2-node cluster
Instance type: i8g.2xlarge
2 nodes
Roles: data + cluster_manager
OpenSearch version: v3.4.0
Replication type: segment replication enabled
2. Create index with 2 primaries and 1 replica
PUT test-seg-repl
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"index.replication.type": "SEGMENT"
}
}
Verify replication type:
GET test-seg-repl/_settings
3. Generate sustained high ingest load
Example bulk workload:
~95,000 docs/sec sustained
Continuous bulk indexing for above 60 minutes
4. Monitor shard-level document counts
Every 5 minutes run:
GET _cat/shards/test-seg-repl?h=shard,prirep,node,docs
Sum:
All primary shard docs
All replica shard docs
Or use the monitoring script (optional reference).
5. Observe replica lag growth
Track:
Primary Docs
Replica Docs
Diff = Primary - Replica
Under sustained ~95k docs/sec ingest:
Primary docs grow steadily
Replica docs update in bursts
Diff (Primary - Replica) continuously increases
Lag does not stabilize.
Expected behavior
Replica lag should stabilize at some steady state under sustained ingestion.
Additional Details
I used a script to get the primary shards and replica shards' indexed summation for every 5minutes and here are the result
Segment replication cluster
Final Summary Table
Time | Primary Docs | Replica Docs | Diff(P-R)
07:41 | 0 | 0 | 0
07:46 | 24784169 | 3175940 | 21608229
07:51 | 54809288 | 12790705 | 42018583
07:56 | 85014044 | 12790705 | 72223339
08:01 | 113333928 | 12790705 | 100543223
08:06 | 142037570 | 12790705 | 129246865
08:11 | 170996024 | 30417645 | 140578379
08:16 | 197927300 | 47854693 | 150072607
08:21 | 227214766 | 47854693 | 179360073
08:26 | 256039188 | 47854693 | 208184495
08:31 | 284573742 | 47854693 | 236719049
08:36 | 312336950 | 47854693 | 264482257
Here, the difference(replica lag) keeps on increasing over time.