-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Conditional stemming for 'persian' analyzer #113421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
/* | ||
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
* or more contributor license agreements. Licensed under the "Elastic License | ||
* 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side | ||
* Public License v 1"; you may not use this file except in compliance with, at | ||
* your election, the "Elastic License 2.0", the "GNU Affero General Public | ||
* License v3.0 only", or the "Server Side Public License, v 1". | ||
*/ | ||
|
||
package org.elasticsearch.analysis.common; | ||
|
||
import org.apache.lucene.analysis.Analyzer; | ||
import org.elasticsearch.cluster.metadata.IndexMetadata; | ||
import org.elasticsearch.common.settings.Settings; | ||
import org.elasticsearch.env.Environment; | ||
import org.elasticsearch.index.IndexSettings; | ||
import org.elasticsearch.index.IndexVersion; | ||
import org.elasticsearch.index.IndexVersions; | ||
import org.elasticsearch.test.ESTestCase; | ||
import org.elasticsearch.test.ESTokenStreamTestCase; | ||
import org.elasticsearch.test.IndexSettingsModule; | ||
import org.elasticsearch.test.index.IndexVersionUtils; | ||
|
||
import java.io.IOException; | ||
|
||
/** | ||
* Tests Persian Analyzer factory and behavioural changes with Lucene 10 | ||
*/ | ||
public class PersianAnalyzerProviderTests extends ESTokenStreamTestCase { | ||
|
||
public void testPersianAnalyzerPostLucene10() throws IOException { | ||
IndexVersion postLucene10Version = IndexVersionUtils.randomVersionBetween( | ||
random(), | ||
IndexVersions.UPGRADE_TO_LUCENE_10_0_0, | ||
IndexVersion.current() | ||
); | ||
Settings settings = ESTestCase.indexSettings(1, 1) | ||
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString()) | ||
.put(IndexMetadata.SETTING_VERSION_CREATED, postLucene10Version) | ||
.build(); | ||
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings); | ||
Environment environment = new Environment(settings, null); | ||
|
||
PersianAnalyzerProvider persianAnalyzerProvider = new PersianAnalyzerProvider( | ||
idxSettings, | ||
environment, | ||
"my-analyzer", | ||
Settings.EMPTY | ||
); | ||
Analyzer analyzer = persianAnalyzerProvider.get(); | ||
assertAnalyzesTo(analyzer, " من کتاب های زیادی خوانده ام.", new String[] { "كتاب", "زياد", "خوانده" }); | ||
} | ||
|
||
public void testPersianAnalyzerPreLucene10() throws IOException { | ||
IndexVersion preLucene10Version = IndexVersionUtils.randomVersionBetween( | ||
random(), | ||
IndexVersionUtils.getFirstVersion(), | ||
IndexVersionUtils.getPreviousVersion(IndexVersions.UPGRADE_TO_LUCENE_10_0_0) | ||
); | ||
Settings settings = ESTestCase.indexSettings(1, 1) | ||
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString()) | ||
.put(IndexMetadata.SETTING_VERSION_CREATED, preLucene10Version) | ||
.build(); | ||
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings); | ||
Environment environment = new Environment(settings, null); | ||
|
||
PersianAnalyzerProvider persianAnalyzerProvider = new PersianAnalyzerProvider( | ||
idxSettings, | ||
environment, | ||
"my-analyzer", | ||
Settings.EMPTY | ||
); | ||
Analyzer analyzer = persianAnalyzerProvider.get(); | ||
assertAnalyzesTo(analyzer, " من کتاب های زیادی خوانده ام.", new String[] { "كتاب", "زيادي", "خوانده" }); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -901,6 +901,16 @@ | |
- length: { tokens: 1 } | ||
- match: { tokens.0.token: خورد } | ||
|
||
- do: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this to test stemming? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then this test will need to be guarded with skip version or something. ES 9 will have mixed cluster tests with 8.last & 9.current and the 8.last won't have the stemming automatically correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer a new test that is guarded, that way the original test isn't always skipped. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All that would have made sense to me a few months ago. Now we live in a world where we merge Lucene 10 to "main" at some point which not necessarily is the point in which it becomes 9.0. So this is becoming a bit of a head-scratcher for me. I need to figure out if we can skip based on IndexVersion or not (since that is what we condition the behavioural change on), if we need some new sort of capability voodoo for that etc... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fwiw I'm afraid I might have to introduce a "cluster_feature" for this. Maybe it makes sense to have one for "Cluster runs with Lucene 10". |
||
indices.analyze: | ||
index: test | ||
body: | ||
text: كتابها | ||
analyzer: my_analyzer | ||
- length: { tokens: 1 } | ||
- match: { tokens.0.token: كتاب } | ||
|
||
|
||
--- | ||
"portuguese": | ||
- do: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# integration tests for persian analyzer changes from Lucene 9 to Lucene 10 | ||
setup: | ||
Comment on lines
+1
to
+2
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you want to test with old data, then upgrade, then verify the query results don't change, a rolling upgrade test, or one of the full restart tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I was afraid I'd need that full-blown infra and was somehow hoping I could leverage some yaml test. This at least shows that stemming works in the new version of the analyzer, i.e. both search terms are matching both documents which means they are analyzed to the same root form. |
||
- do: | ||
indices.create: | ||
index: test | ||
body: | ||
mappings: | ||
properties: | ||
text: | ||
type: text | ||
analyzer: persian | ||
|
||
--- | ||
"persian search": | ||
- do: | ||
bulk: | ||
refresh: true | ||
body: | ||
- '{"index": {"_index": "test", "_id": "1"}}' | ||
- '{"text" : "كتابها"}' | ||
- '{"index": {"_index": "test", "_id": "2"}}' | ||
- '{"text" : "كتاب"}' | ||
|
||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
index: test | ||
body: | ||
query: | ||
match: | ||
text: | ||
query: كتابها | ||
- match: { hits.total: 2 } | ||
|
||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
index: test | ||
body: | ||
query: | ||
match: | ||
text: | ||
query: كتاب | ||
- match: { hits.total: 2 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"كتاب", "زياد", "خوان"
would make the most sense to me. That is, compared to the one on L74, [1] is correct here, and in both [2] is the same and correct, and what is mentioned as stem in both of them in [0], seems not to be correct to me.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks. Token 0 is the output in both cases though and might be a general issues with the "persian" analyzer in both versions though. The goal here is not to test the quality of the analyzer per-se but the changes between the output in Lucene 9 and Lucene 10. I was mostly interested in the input sentence and if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it makes sense, and is the same in both: I have read a lot of books. The dot is at the wrong end, I'm assuming right-to-left issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably remove the dots, already had trouble copy/pasting this since I'm not used to right-to-left languages unfortunately.