-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch Version
8.16.0+
Installed Plugins
No response
Java Version
bundled
OS Version
Linux kernel 6.3+
Problem Description
Starting from 8.16.0, with MGLRU enabled in Linux kernel (cat /sys/kernel/mm/lru_gen/enabled returns 0x0007), and Linux kernel version 6.3+, vector index merge process might take a very long time to complete. This applies to non-quantized vector format like in example below. Note that default vector type is quantized (int8_hnsw) since version 8.14.
"titleVector" : {
"type" : "dense_vector",
"dims" : 768,
"index" : true,
"similarity" : "dot_product",
"index_options" : {
"type" : "hnsw", <--- HERE
"m" : 16,
"ef_construction" : 100
}
},
The most characteristic symptom is the Lucene merge thread is constantly calculating vector dot product.
(note: below outputs for ES 8.17.2 and Linux kernel 6.8)
$ ./jdk/bin/jstack -l -e $(./jdk/bin/jps | grep Elasticsearch | awk '{print $1}') | grep "Lucene Merge Thread" -A35
"elasticsearch[<REDACTED>][[vectors][0]: Lucene Merge Thread #0]" #85 [14473] daemon prio=5 os_prio=0 cpu=2086890.49ms elapsed=9886.33s allocated=11887M defined_classes=212 tid=0x00007f865c0188f0 nid=14473 runnable [0x00007f86f4afc000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProductBody([email protected]/PanamaVectorUtilSupport.java:122)
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProduct([email protected]/PanamaVectorUtilSupport.java:102)
at org.apache.lucene.util.VectorUtil.dotProduct([email protected]/VectorUtil.java:66)
at org.apache.lucene.index.VectorSimilarityFunction$2.compare([email protected]/VectorSimilarityFunction.java:55)
at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score([email protected]/DefaultFlatVectorScorer.java:149)
at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel([email protected]/HnswGraphSearcher.java:237)
at org.apache.lucene.util.hnsw.HnswGraphBuilder.addGraphNode([email protected]/HnswGraphBuilder.java:265)
at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.addGraphNode([email protected]/HnswConcurrentMergeBuilder.java:200)
at org.apache.lucene.util.hnsw.HnswGraphBuilder.addVectors([email protected]/HnswGraphBuilder.java:197)
at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.run([email protected]/HnswConcurrentMergeBuilder.java:180)
at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.lambda$build$0([email protected]/HnswConcurrentMergeBuilder.java:86)
at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$$Lambda/0x00007f86a06496a8.call([email protected]/Unknown Source)
at java.util.concurrent.FutureTask.run(java.base@23/FutureTask.java:317)
at org.apache.lucene.search.TaskExecutor$TaskGroup$1.run([email protected]/TaskExecutor.java:120)
at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll([email protected]/TaskExecutor.java:176)
at org.apache.lucene.search.TaskExecutor.invokeAll([email protected]/TaskExecutor.java:84)
at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.build([email protected]/HnswConcurrentMergeBuilder.java:90)
at org.apache.lucene.util.hnsw.IncrementalHnswGraphMerger.merge([email protected]/IncrementalHnswGraphMerger.java:147)
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.mergeOneField([email protected]/Lucene99HnswVectorsWriter.java:377)
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField([email protected]/PerFieldKnnVectorsFormat.java:121)
at org.apache.lucene.codecs.KnnVectorsWriter.merge([email protected]/KnnVectorsWriter.java:106)
at org.apache.lucene.index.SegmentMerger.mergeVectorValues([email protected]/SegmentMerger.java:271)
at org.apache.lucene.index.SegmentMerger$$Lambda/0x00007f86a0638000.merge([email protected]/Unknown Source)
at org.apache.lucene.index.SegmentMerger.mergeWithLogging([email protected]/SegmentMerger.java:314)
at org.apache.lucene.index.SegmentMerger.merge([email protected]/SegmentMerger.java:158)
at org.apache.lucene.index.IndexWriter.mergeMiddle([email protected]/IndexWriter.java:5292)
at org.apache.lucene.index.IndexWriter.merge([email protected]/IndexWriter.java:4758)
at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge([email protected]/IndexWriter.java:6581)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge([email protected]/ConcurrentMergeScheduler.java:660)
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge([email protected]/ElasticsearchConcurrentMergeScheduler.java:103)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run([email protected]/ConcurrentMergeScheduler.java:721)
Locked ownable synchronizers:
- None
Hot threads show high I/O wait, notice other=88.1%:
$ curl -u <REDACTED> "http://localhost:9200/_nodes/hot_threads?threads=10000" -s | grep "\[vectors\]\[0\]: Lucene Merge Thread #0" -A25
100.0% [cpu=11.9%, other=88.1%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[ip-172-31-32-115][[vectors][0]: Lucene Merge Thread #0]'
10/10 snapshots sharing following 31 elements
app/[email protected]/org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProductBody(PanamaVectorUtilSupport.java:122)
app/[email protected]/org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport.dotProduct(PanamaVectorUtilSupport.java:102)
app/[email protected]/org.apache.lucene.util.VectorUtil.dotProduct(VectorUtil.java:66)
app/[email protected]/org.apache.lucene.index.VectorSimilarityFunction$2.compare(VectorSimilarityFunction.java:55)
app/[email protected]/org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1.score(DefaultFlatVectorScorer.java:149)
app/[email protected]/org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:237)
app/[email protected]/org.apache.lucene.util.hnsw.HnswGraphBuilder.addGraphNode(HnswGraphBuilder.java:265)
app/[email protected]/org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$ConcurrentMergeWorker.addGraphNode(HnswConcurrentMergeBuilder.java:200)
app/[email protected]/org.apache.lucene.util.hnsw.HnswGraphBuilder.addVectors(HnswGraphBuilder.java:197)
[..]
From OS perspective, high level of major page faults...
$ sar -B 1 5
Linux 6.8.0-1021-aws (<REDACTED>) 03/10/25 _x86_64_ (4 CPU)
13:30:01 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
13:30:02 77240.00 60.00 19313.00 19310.00 25420.00 25408.00 0.00 50816.00 200.00
13:30:03 77640.00 0.00 19410.00 19410.00 11.00 0.00 0.00 0.00 0.00
13:30:04 78244.00 0.00 19561.00 19561.00 25419.00 25408.00 0.00 50816.00 200.00
13:30:05 78756.00 0.00 19689.00 19689.00 25420.00 25408.00 0.00 50816.00 200.00
13:30:06 78344.00 0.00 19586.00 19586.00 25418.00 25408.00 0.00 50816.00 200.00
Average: 78044.80 12.00 19511.80 19511.20 20337.60 20326.40 0.00 40652.80 200.00
... and I/O wait, see wa column:
(note: that's on a reproduction system that has nothing else to do, just the merge, on busy systems CPU I/O wait might be "hidden")
$ vmstat -a -SM 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free inact active si so bi bo in cs us sy id wa st
2 1 0 331 8646 6168 0 0 6662 656 96 96 3 2 82 13 0
0 2 0 264 8715 6169 0 0 71296 0 20083 35919 1 5 51 43 0
0 2 0 287 8690 6169 0 0 75572 0 21292 38060 1 5 51 43 0
0 2 0 311 8659 6169 0 0 69932 0 19752 35268 2 4 51 43 0
0 2 0 350 8632 6169 0 0 74724 8 21119 37623 2 5 51 42 0
... and I/O PSI metrics:
$ cat /proc/pressure/io
some avg10=54.40 avg60=51.46 avg300=51.83 total=15287830871
full avg10=54.39 avg60=51.41 avg300=51.79 total=15251110144
The exact level of major page faults will depend on the system. The above was seen in a system with 2 concurrent Lucene merge threads, with a NVMe SSD disk with 100 us read latency (r_await). With higher read latency, major page faults might be proportionally lower.
$ iostat -xdz 1
[..]
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 19402.00 77612.00 0.00 0.00 0.09 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.73 98.40
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 19681.00 78720.00 0.00 0.00 0.09 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.75 98.30
[..]
The regression is a combination of:
- Lucene using
MADV_RANDOMmadvise for VEC files - Elasticsearch not overriding this starting from Enable madvise by default for all buildsΒ #110159
- Changed semantics of
MADV_RANDOMin Linux kernel with MGLRU, specifically after torvalds/linux@8788f67
Steps to Reproduce
The following steps reproduce the problem reliably on hosts with 4 or 8 CPU cores and 16 GiB of RAM, and Intel processors supporting AVX-512 extension. The problem might be present elsewhere, but that's how the repros were run. The data is based on SO Vector Rally track which creates a single vectors index with 2 shards. This specific payload uses hnsw vector index type which is not the default since 8.14.0.
# make sure you have 6.3+ Linux kernel
uname -a
# make sure MGLRU is enabled (should return 0x0007)
cat /sys/kernel/mm/lru_gen/enabled
# make sure AVX-512 extensions are available
grep avx512 /proc/cpuinfo
# make sure you have 16 GiB of memory
grep MemTotal /proc/meminfo
# for fancy download
sudo apt-get update
sudo apt install aria2
# download (607 MiB) and unpack Elasticsearch version 8.17.2
aria2c -x 4 https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.17.2-linux-x86_64.tar.gz
tar zxf elasticsearch-8.17.2-linux-x86_64.tar.gz
cd elasticsearch-8.17.2
# download Lucene 9.12.0 with modified logging
mv lib/lucene-core-9.12.0.jar lib/lucene-core-9.12.0.jar.orig
wget https://storage.googleapis.com/so-vector-test-public/lucene-core-9.12.0.jar -O lib/lucene-core-9.12.0.jar
# configure Elasticsearch user/password
./bin/elasticsearch-users useradd test -p test123 -r superuser
# set heap size to 4 GiB
echo "-Xms4g" >> config/jvm.options
echo "-Xmx4g" >> config/jvm.options
# set force merge thread pool size to 2
echo "thread_pool.force_merge.size: 2" >> config/elasticsearch.yml
# download the SO Vector Rally track data with "hnsw" vector index type ingested into ES (31 GiB) and untar it
aria2c -x 4 https://storage.googleapis.com/so-vector-test-public/8.17.2-data.tar
tar -xf 8.17.2-data.tar
# start tmux session
tmux
# start Elasticsearch
./bin/elasticsearch
# wait for this log line:
# [2025-03-10T10:12:57,968][INFO ][o.e.c.r.a.AllocationService] [<REDACTED>] current.health="GREEN" message="Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[vectors][0]]])." previous.health="RED" reason="shards started [[vectors][0]]"
# --- in a new window
# list indices
curl -u test:test123 "http://localhost:9200/_cat/indices?v"
# --- example output ---
# health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
# green open vectors zJfxtEMfR5-V8o84Fsr2-w 2 0 2000000 0 30.7gb 30.7gb 30.7gb
# initiate force merge of the "vectors" index
curl -u test:test123 -X POST "http://localhost:9200/vectors/_forcemerge?max_num_segments=1&wait_for_completion=false"
# --- example output ---
# {"task":"KbsRAY4bScCt_8EcwpO1bw:212"}
Logs (if relevant)
~10 minutes after force merge initiation, the graph build process starts. Here, each 1000-vector chunk is processed in around a second:
[2025-03-10T10:47:17,579][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [0 2048]
[2025-03-10T10:47:17,672][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 0 in 92/92 ms
[2025-03-10T10:47:17,869][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [0 2048]
[2025-03-10T10:47:18,098][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 0 in 228/228 ms
[2025-03-10T10:47:19,121][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 1000 in 1449/1541 ms
[2025-03-10T10:47:20,290][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 2000 in 1168/2710 ms
[2025-03-10T10:47:20,346][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [2048 4096]
[2025-03-10T10:47:21,435][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 3000 in 1089/1089 ms
[2025-03-10T10:47:22,613][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 4000 in 1178/2267 ms
But later the process degrades, and each chunk can take way more time:
[2025-03-10T11:15:22,990][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [659456 661504]
[2025-03-10T11:15:27,159][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] addVectors [591872 593920]
[2025-03-10T11:15:30,313][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 592000 in 3154/3154 ms
[2025-03-10T11:15:36,239][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 660000 in 13249/13249 ms
[2025-03-10T11:15:54,574][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 593000 in 24260/27414 ms
[2025-03-10T11:15:59,629][INFO ][o.a.l.u.h.HnswGraphBuilder] [<REDACTED>] built 661000 in 23389/36639 ms
(note the process ends at 1M vectors in each shard)
Workaround
Disable MGLRU:
sudo sh -c 'echo 0 > /sys/kernel/mm/lru_gen/enabled'