Vector index merge time regression in 8.16.0 onwards, with MGLRU enabled in Linux kernel

### Elasticsearch Version

8.16.0+

### Installed Plugins

_No response_

### Java Version

_bundled_

### OS Version

Linux kernel 6.3+

### Problem Description

Starting from 8.16.0, with MGLRU enabled in Linux kernel (`cat /sys/kernel/mm/lru_gen/enabled` returns `0x0007`), and Linux kernel version 6.3+, vector index merge process might take a very long time to complete. This applies to non-quantized vector format like in example below. Note that default vector type is quantized (`int8_hnsw`) since version 8.14.

```
        "titleVector" : {
          "type" : "dense_vector",
          "dims" : 768,
          "index" : true,
          "similarity" : "dot_product",
          "index_options" : {
            "type" : "hnsw", <--- HERE
            "m" : 16,
            "ef_construction" : 100
          }
        },
```

The most characteristic symptom is the Lucene merge thread is constantly calculating vector dot product.

(note: below outputs for ES 8.17.2 and Linux kernel 6.8)

```
$ ./jdk/bin/jstack -l -e $(./jdk/bin/jps | grep Elasticsearch | awk '{print $1}') | grep "Lucene Merge Thread" -A35
"elasticsearch[<REDACTED>][[vectors][0]: Lucene Merge Thread #0]" #85 [14473] daemon prio=5 os_prio=0 cpu=2086890.49ms elapsed=9886.33s allocated=11887M defined_classes=212 tid=0x00007f865c0188f0 nid=14473 runnable  [0x00007f86f4afc000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProductBody(org.apache.lucene.core@9.12.0/PanamaVectorUtilSupport.java:122)
        at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProduct(org.apache.lucene.core@9.12.0/PanamaVectorUtilSupport.java:102)
        at org.apache.lucene.util.VectorUtil.dotProduct(org.apache.lucene.core@9.12.0/VectorUtil.java:66)
        at org.apache.lucene.index.VectorSimilarityFunction$2.compare(org.apache.lucene.core@9.12.0/VectorSimilarityFunction.java:55)
        at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(org.apache.lucene.core@9.12.0/DefaultFlatVectorScorer.java:149)
        at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(org.apache.lucene.core@9.12.0/HnswGraphSearcher.java:237)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.addGraphNode(org.apache.lucene.core@9.12.0/HnswGraphBuilder.java:265)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.addGraphNode(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:200)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.addVectors(org.apache.lucene.core@9.12.0/HnswGraphBuilder.java:197)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.run(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:180)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.lambda$build$0(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:86)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$$Lambda/0x00007f86a06496a8.call(org.apache.lucene.core@9.12.0/Unknown Source)
        at java.util.concurrent.FutureTask.run(java.base@23/FutureTask.java:317)
        at org.apache.lucene.search.TaskExecutor$TaskGroup$1.run(org.apache.lucene.core@9.12.0/TaskExecutor.java:120)
        at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(org.apache.lucene.core@9.12.0/TaskExecutor.java:176)
        at org.apache.lucene.search.TaskExecutor.invokeAll(org.apache.lucene.core@9.12.0/TaskExecutor.java:84)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.build(org.apache.lucene.core@9.12.0/HnswConcurrentMergeBuilder.java:90)
        at org.apache.lucene.util.hnsw.IncrementalHnswGraphMerger.merge(org.apache.lucene.core@9.12.0/IncrementalHnswGraphMerger.java:147)
        at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.mergeOneField(org.apache.lucene.core@9.12.0/Lucene99HnswVectorsWriter.java:377)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField(org.apache.lucene.core@9.12.0/PerFieldKnnVectorsFormat.java:121)
        at org.apache.lucene.codecs.KnnVectorsWriter.merge(org.apache.lucene.core@9.12.0/KnnVectorsWriter.java:106)
        at org.apache.lucene.index.SegmentMerger.mergeVectorValues(org.apache.lucene.core@9.12.0/SegmentMerger.java:271)
        at org.apache.lucene.index.SegmentMerger$$Lambda/0x00007f86a0638000.merge(org.apache.lucene.core@9.12.0/Unknown Source)
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(org.apache.lucene.core@9.12.0/SegmentMerger.java:314)
        at org.apache.lucene.index.SegmentMerger.merge(org.apache.lucene.core@9.12.0/SegmentMerger.java:158)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(org.apache.lucene.core@9.12.0/IndexWriter.java:5292)
        at org.apache.lucene.index.IndexWriter.merge(org.apache.lucene.core@9.12.0/IndexWriter.java:4758)
        at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(org.apache.lucene.core@9.12.0/IndexWriter.java:6581)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(org.apache.lucene.core@9.12.0/ConcurrentMergeScheduler.java:660)
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(org.elasticsearch.server@8.17.2/ElasticsearchConcurrentMergeScheduler.java:103)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(org.apache.lucene.core@9.12.0/ConcurrentMergeScheduler.java:721)

   Locked ownable synchronizers:
        - None
```

Hot threads show high I/O wait, notice `other=88.1%`:

```
$ curl -u <REDACTED> "http://localhost:9200/_nodes/hot_threads?threads=10000" -s | grep "\[vectors\]\[0\]: Lucene Merge Thread #0" -A25
   100.0% [cpu=11.9%, other=88.1%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[ip-172-31-32-115][[vectors][0]: Lucene Merge Thread #0]'
     10/10 snapshots sharing following 31 elements
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProductBody(PanamaVectorUtilSupport.java:122)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProduct(PanamaVectorUtilSupport.java:102)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.VectorUtil.dotProduct(VectorUtil.java:66)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.index.VectorSimilarityFunction$2.compare(VectorSimilarityFunction.java:55)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(DefaultFlatVectorScorer.java:149)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:237)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswGraphBuilder.addGraphNode(HnswGraphBuilder.java:265)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.addGraphNode(HnswConcurrentMergeBuilder.java:200)
       app/org.apache.lucene.core@9.12.0/org.apache.lucene.util.hnsw.HnswGraphBuilder.addVectors(HnswGraphBuilder.java:197)
[..]
```

From OS perspective, high level of major page faults...
```
$ sar -B 1 5
Linux 6.8.0-1021-aws (<REDACTED>)         03/10/25        _x86_64_        (4 CPU)

13:30:01     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
13:30:02     77240.00     60.00  19313.00  19310.00  25420.00  25408.00      0.00  50816.00    200.00
13:30:03     77640.00      0.00  19410.00  19410.00     11.00      0.00      0.00      0.00      0.00
13:30:04     78244.00      0.00  19561.00  19561.00  25419.00  25408.00      0.00  50816.00    200.00
13:30:05     78756.00      0.00  19689.00  19689.00  25420.00  25408.00      0.00  50816.00    200.00
13:30:06     78344.00      0.00  19586.00  19586.00  25418.00  25408.00      0.00  50816.00    200.00
Average:     78044.80     12.00  19511.80  19511.20  20337.60  20326.40      0.00  40652.80    200.00
```

... and I/O wait, see `wa` column:

(note: that's on a reproduction system that has nothing else to do, just the merge, on busy systems CPU I/O wait might be "hidden")

```
$ vmstat -a -SM 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 2  1      0    331   8646   6168    0    0  6662   656   96   96  3  2 82 13  0
 0  2      0    264   8715   6169    0    0 71296     0 20083 35919  1  5 51 43  0
 0  2      0    287   8690   6169    0    0 75572     0 21292 38060  1  5 51 43  0
 0  2      0    311   8659   6169    0    0 69932     0 19752 35268  2  4 51 43  0
 0  2      0    350   8632   6169    0    0 74724     8 21119 37623  2  5 51 42  0
```

... and I/O PSI metrics:

```
$ cat /proc/pressure/io
some avg10=54.40 avg60=51.46 avg300=51.83 total=15287830871
full avg10=54.39 avg60=51.41 avg300=51.79 total=15251110144
```

The exact level of major page faults will depend on the system. The above was seen in a system with 2 concurrent Lucene merge threads, with a NVMe SSD disk with 100 us read latency (`r_await`). With higher read latency, major page faults might be proportionally lower.

```
$ iostat -xdz 1
[..]
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1       19402.00  77612.00     0.00   0.00    0.09     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.73  98.40


Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1       19681.00  78720.00     0.00   0.00    0.09     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.75  98.30
[..]
```

The regression is a combination of:
* Lucene using `MADV_RANDOM` madvise for VEC files
* Elasticsearch not overriding this starting from https://github.com/elastic/elasticsearch/pull/110159
* Changed semantics of `MADV_RANDOM` in Linux kernel with MGLRU, specifically after https://github.com/torvalds/linux/commit/8788f6781486769d9598dcaedc3fe0eb12fc3e59

### Steps to Reproduce

The following steps reproduce the problem reliably on hosts with 4 or 8 CPU cores and 16 GiB of RAM, and Intel processors supporting AVX-512 extension. The problem might be present elsewhere, but that's how the repros were run. The data is based on [SO Vector](https://github.com/elastic/rally-tracks/tree/master/so_vector) Rally track which creates a single `vectors` index with 2 shards. This specific payload uses `hnsw` vector index type which is _not_ the default since 8.14.0.

```
# make sure you have 6.3+ Linux kernel
uname -a

# make sure MGLRU is enabled (should return 0x0007)
cat /sys/kernel/mm/lru_gen/enabled

# make sure AVX-512 extensions are available
grep avx512 /proc/cpuinfo

# make sure you have 16 GiB of memory
grep MemTotal /proc/meminfo

# for fancy download
sudo apt-get update
sudo apt install aria2

# download (607 MiB) and unpack Elasticsearch version 8.17.2
aria2c -x 4 https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.17.2-linux-x86_64.tar.gz
tar zxf elasticsearch-8.17.2-linux-x86_64.tar.gz
cd elasticsearch-8.17.2

# download Lucene 9.12.0 with modified logging
mv lib/lucene-core-9.12.0.jar lib/lucene-core-9.12.0.jar.orig
wget https://storage.googleapis.com/so-vector-test-public/lucene-core-9.12.0.jar -O lib/lucene-core-9.12.0.jar

# configure Elasticsearch user/password
./bin/elasticsearch-users useradd test -p test123 -r superuser

# set heap size to 4 GiB
echo "-Xms4g" >> config/jvm.options
echo "-Xmx4g" >> config/jvm.options

# set force merge thread pool size to 2
echo "thread_pool.force_merge.size: 2" >> config/elasticsearch.yml

# download the SO Vector Rally track data with "hnsw" vector index type ingested into ES (31 GiB) and untar it
aria2c -x 4 https://storage.googleapis.com/so-vector-test-public/8.17.2-data.tar
tar -xf 8.17.2-data.tar

# start tmux session
tmux

# start Elasticsearch
./bin/elasticsearch

# wait for this log line:
# [2025-03-10T10:12:57,968][INFO ][o.e.c.r.a.AllocationService] [<REDACTED>] current.health="GREEN" message="Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[vectors][0]]])." previous.health="RED" reason="shards started [[vectors][0]]"

# --- in a new window

# list indices
curl -u test:test123 "http://localhost:9200/_cat/indices?v"

# --- example output ---
# health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
# green  open   vectors zJfxtEMfR5-V8o84Fsr2-w   2   0    2000000            0     30.7gb         30.7gb       30.7gb

# initiate force merge of the "vectors" index
curl -u test:test123 -X POST "http://localhost:9200/vectors/_forcemerge?max_num_segments=1&wait_for_completion=false"

# --- example output ---
# {"task":"KbsRAY4bScCt_8EcwpO1bw:212"}
```

### Logs (if relevant)

~10 minutes after force merge initiation, the graph build process starts. Here, each 1000-vector chunk is processed in around a second:

```
[2025-03-10T10:47:17,579][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [0 2048]
[2025-03-10T10:47:17,672][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 0 in 92/92 ms
[2025-03-10T10:47:17,869][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [0 2048]
[2025-03-10T10:47:18,098][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 0 in 228/228 ms
[2025-03-10T10:47:19,121][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 1000 in 1449/1541 ms
[2025-03-10T10:47:20,290][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 2000 in 1168/2710 ms
[2025-03-10T10:47:20,346][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [2048 4096]
[2025-03-10T10:47:21,435][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 3000 in 1089/1089 ms
[2025-03-10T10:47:22,613][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 4000 in 1178/2267 ms
```

But later the process degrades, and each chunk can take way more time:

```
[2025-03-10T11:15:22,990][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [659456 661504]
[2025-03-10T11:15:27,159][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [591872 593920]
[2025-03-10T11:15:30,313][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 592000 in 3154/3154 ms
[2025-03-10T11:15:36,239][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 660000 in 13249/13249 ms
[2025-03-10T11:15:54,574][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 593000 in 24260/27414 ms
[2025-03-10T11:15:59,629][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 661000 in 23389/36639 ms
```

(note the process ends at 1M vectors in each shard)

### Workaround

Disable MGLRU:
```
sudo sh -c 'echo 0 > /sys/kernel/mm/lru_gen/enabled'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector index merge time regression in 8.16.0 onwards, with MGLRU enabled in Linux kernel #124499

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vector index merge time regression in 8.16.0 onwards, with MGLRU enabled in Linux kernel #124499

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions