Skip to content

Commit f9872fe

Browse files
javannacbueschermayya-sharipovaChrisHegartybrianseeders
authored andcommitted
Upgrade to Lucene 10 (elastic#114741)
The most relevant ES changes that upgrading to Lucene 10 requires are: - use the appropriate IOContext - Scorer / ScorerSupplier breaking changes - Regex automaton are no longer determinized by default - minimize moved to test classes - introduce Elasticsearch900Codec - adjust slicing code according to the added support for intra-segment concurrency - disable intra-segment concurrency in tests - adjust accessor methods for many Lucene classes that became a record - adapt to breaking changes in the analysis area Co-authored-by: Christoph Büscher <[email protected]> Co-authored-by: Mayya Sharipova <[email protected]> Co-authored-by: ChrisHegarty <[email protected]> Co-authored-by: Brian Seeders <[email protected]> Co-authored-by: Armin Braun <[email protected]> Co-authored-by: Panagiotis Bailis <[email protected]> Co-authored-by: Benjamin Trent <[email protected]>
1 parent 2f93690 commit f9872fe

File tree

662 files changed

+8792
-3627
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

662 files changed

+8792
-3627
lines changed

.buildkite/pipelines/lucene-snapshot/run-tests.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,6 @@ steps:
5656
matrix:
5757
setup:
5858
BWC_VERSION:
59-
- 7.17.13
6059
- 8.9.1
6160
- 8.10.0
6261
agents:

benchmarks/src/main/java/org/elasticsearch/benchmark/vector/VectorScorerBenchmark.java

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
import org.apache.lucene.store.MMapDirectory;
2020
import org.apache.lucene.util.hnsw.RandomVectorScorer;
2121
import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
22-
import org.apache.lucene.util.quantization.RandomAccessQuantizedByteVectorValues;
22+
import org.apache.lucene.util.quantization.QuantizedByteVectorValues;
2323
import org.apache.lucene.util.quantization.ScalarQuantizer;
2424
import org.elasticsearch.common.logging.LogConfigurator;
2525
import org.elasticsearch.core.IOUtils;
@@ -217,19 +217,17 @@ public float squareDistanceScalar() {
217217
return 1 / (1f + adjustedDistance);
218218
}
219219

220-
RandomAccessQuantizedByteVectorValues vectorValues(int dims, int size, IndexInput in, VectorSimilarityFunction sim) throws IOException {
220+
QuantizedByteVectorValues vectorValues(int dims, int size, IndexInput in, VectorSimilarityFunction sim) throws IOException {
221221
var sq = new ScalarQuantizer(0.1f, 0.9f, (byte) 7);
222222
var slice = in.slice("values", 0, in.length());
223223
return new OffHeapQuantizedByteVectorValues.DenseOffHeapVectorValues(dims, size, sq, false, sim, null, slice);
224224
}
225225

226-
RandomVectorScorerSupplier luceneScoreSupplier(RandomAccessQuantizedByteVectorValues values, VectorSimilarityFunction sim)
227-
throws IOException {
226+
RandomVectorScorerSupplier luceneScoreSupplier(QuantizedByteVectorValues values, VectorSimilarityFunction sim) throws IOException {
228227
return new Lucene99ScalarQuantizedVectorScorer(null).getRandomVectorScorerSupplier(sim, values);
229228
}
230229

231-
RandomVectorScorer luceneScorer(RandomAccessQuantizedByteVectorValues values, VectorSimilarityFunction sim, float[] queryVec)
232-
throws IOException {
230+
RandomVectorScorer luceneScorer(QuantizedByteVectorValues values, VectorSimilarityFunction sim, float[] queryVec) throws IOException {
233231
return new Lucene99ScalarQuantizedVectorScorer(null).getRandomVectorScorer(sim, values, queryVec);
234232
}
235233

build-tools-internal/src/main/resources/forbidden/es-server-signatures.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -59,10 +59,6 @@ org.apache.lucene.util.Version#parseLeniently(java.lang.String)
5959

6060
org.apache.lucene.index.NoMergePolicy#INSTANCE @ explicit use of NoMergePolicy risks forgetting to configure NoMergeScheduler; use org.elasticsearch.common.lucene.Lucene#indexWriterConfigWithNoMerging() instead.
6161

62-
@defaultMessage Spawns a new thread which is solely under lucenes control use ThreadPool#relativeTimeInMillis instead
63-
org.apache.lucene.search.TimeLimitingCollector#getGlobalTimerThread()
64-
org.apache.lucene.search.TimeLimitingCollector#getGlobalCounter()
65-
6662
@defaultMessage Don't interrupt threads use FutureUtils#cancel(Future<T>) instead
6763
java.util.concurrent.Future#cancel(boolean)
6864

build-tools-internal/version.properties

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
elasticsearch = 9.0.0
2-
lucene = 9.12.0
2+
lucene = 10.0.0
33

44
bundled_jdk_vendor = openjdk
55
bundled_jdk = 22.0.1+8@c7ec1332f7bb44aeba2eb341ae18aca4

distribution/src/config/jvm.options

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@
6262
23:-XX:CompileCommand=dontinline,java/lang/invoke/MethodHandle.setAsTypeCache
6363
23:-XX:CompileCommand=dontinline,java/lang/invoke/MethodHandle.asTypeUncached
6464

65+
# Lucene 10: apply MADV_NORMAL advice to enable more aggressive readahead
66+
-Dorg.apache.lucene.store.defaultReadAdvice=normal
67+
6568
## heap dumps
6669

6770
# generate a heap dump when an allocation from the Java heap fails; heap dumps

docs/Versions.asciidoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11

22
include::{docs-root}/shared/versions/stack/{source_branch}.asciidoc[]
33

4-
:lucene_version: 9.12.0
5-
:lucene_version_path: 9_12_0
4+
:lucene_version: 10.0.0
5+
:lucene_version_path: 10_0_0
66
:jdk: 11.0.2
77
:jdk_major: 11
88
:build_type: tar

docs/changelog/113482.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
pr: 113482
2+
summary: The 'persian' analyzer has stemmer by default
3+
area: Analysis
4+
type: breaking
5+
issues:
6+
- 113050
7+
breaking:
8+
title: The 'persian' analyzer has stemmer by default
9+
area: Analysis
10+
details: >-
11+
Lucene 10 has added a final stemming step to its PersianAnalyzer that Elasticsearch
12+
exposes as 'persian' analyzer. Existing indices will keep the old
13+
non-stemming behaviour while new indices will see the updated behaviour with
14+
added stemming.
15+
Users that wish to maintain the non-stemming behaviour need to define their
16+
own analyzer as outlined in
17+
https://www.elastic.co/guide/en/elasticsearch/reference/8.15/analysis-lang-analyzer.html#persian-analyzer.
18+
Users that wish to use the new stemming behaviour for existing indices will
19+
have to reindex their data.
20+
impact: >-
21+
Indexing with the 'persian' analyzer will produce slightly different tokens.
22+
Users should check if this impacts their search results. If they wish to
23+
maintain the legacy non-stemming behaviour they can define their own
24+
analyzer equivalent as explained in
25+
https://www.elastic.co/guide/en/elasticsearch/reference/8.15/analysis-lang-analyzer.html#persian-analyzer.
26+
notable: false
27+

docs/changelog/113614.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
pr: 113614
2+
summary: The 'german2' stemmer is now an alias for the 'german' snowball stemmer
3+
area: Analysis
4+
type: breaking
5+
issues: []
6+
breaking:
7+
title: The "german2" snowball stemmer is now an alias for the "german" stemmer
8+
area: Analysis
9+
details: >-
10+
Lucene 10 has merged the improved "german2" snowball language stemmer with the
11+
"german" stemmer. For Elasticsearch, "german2" is now a deprecated alias for
12+
"german". This may results in slightly different tokens being generated for
13+
terms with umlaut substitution (like "ue" for "ü" etc...)
14+
impact: >-
15+
Replace usages of "german2" with "german" in analysis configuration. Old
16+
indices that use the "german" stemmer should be reindexed if possible.
17+
notable: false
18+

docs/changelog/114124.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
pr: 114124
2+
summary: The Korean dictionary for Nori has been updated
3+
area: Analysis
4+
type: breaking
5+
issues: []
6+
breaking:
7+
title: The Korean dictionary for Nori has been updated
8+
area: Analysis
9+
details: >-
10+
Lucene 10 ships with an updated Korean dictionary (mecab-ko-dic-2.1.1).
11+
For details see https://github.com/apache/lucene/issues/11452. Users
12+
experiencing changes in search behaviour on existing data are advised to
13+
reindex.
14+
impact: >-
15+
The change is small and should generally provide better analysis results.
16+
Existing indices for full-text use cases should be reindexed though.
17+
notable: false
18+

docs/changelog/114146.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
pr: 114146
2+
summary: Snowball stemmers have been upgraded
3+
area: Analysis
4+
type: breaking
5+
issues: []
6+
breaking:
7+
title: Snowball stemmers have been upgraded
8+
area: Analysis
9+
details: >-
10+
Lucene 10 ships with an upgrade of its Snowball stemmers.
11+
For details see https://github.com/apache/lucene/issues/13209. Users using
12+
Snowball stemmers that are experiencing changes in search behaviour on
13+
existing data are advised to reindex.
14+
impact: >-
15+
The upgrade should generally provide improved stemming results. Small changes
16+
in token analysis can lead to mismatches with previously index data, so
17+
existing indices using Snowball stemmers as part of their analysis chain
18+
should be reindexed.
19+
notable: false
20+

0 commit comments

Comments
 (0)