-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(analysis): Add package-based Hunspell dictionary support with ref_path parameter and cache invalidation API #20741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
865aab5
1be4dde
c114694
9361d76
3926f7a
c1dd00b
20ba2c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| [tools] | ||
| java = "corretto-21" | ||
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -117,6 +117,7 @@ | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import java.io.IOException; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import java.io.UnsupportedEncodingException; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import java.nio.file.Files; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import java.nio.file.Files; |
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions adding ref_path validation in MetadataCreateIndexService, but in this diff the only change appears to be adding an unused java.nio.file.Files import (no ref_path-related validation code exists in this class, and ref_path isn’t referenced anywhere in server/src/main/java). Either include the intended validation logic here or update the PR description/scope.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we move this within getIndexSettingsValidationErrors to keep all validations in a single place?
Outdated
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ref_path validation expects the value to start with "packages/" (e.g., "packages/pkg-1234"), but HunspellTokenFilterFactory at line 87 expects ref_path to be just the package ID (e.g., "pkg-1234") without the "packages/" prefix. This creates an inconsistency where the validation will reject valid usage patterns.
Looking at HunspellService.loadDictionaryFromPackage (lines 199-201), it constructs the path as config/packages/{packageId}, meaning it expects just the package ID, not the full path.
The validation should either:
- Accept just the package ID (e.g., "pkg-1234") to match the usage, or
- Update HunspellTokenFilterFactory and HunspellService to expect the full path with "packages/" prefix
Option 1 is recommended for consistency with the implementation and to avoid confusion for users.
Outdated
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validateRefPath method validates the format of index.ref_path at index creation time, but there's no code that actually uses this index-level setting. The HunspellTokenFilterFactory reads ref_path from the analyzer settings (line 87 in HunspellTokenFilterFactory.java), not from indexSettings.getRefPath().
This means the index.ref_path setting and its validation are currently unused and serve no purpose. Either:
- Remove the unused index-level setting and validation if ref_path is only meant to be an analyzer-level parameter, or
- Update HunspellTokenFilterFactory to use indexSettings.getRefPath() as the default if no analyzer-level ref_path is specified
The current implementation creates confusion about whether ref_path is an index-level setting or an analyzer-level parameter.
| /** | |
| * Validates the ref_path setting if present. | |
| * Checks that the path format is valid and the directory exists. | |
| * | |
| * @param settings the index settings | |
| * @param configDir the config directory path | |
| * @return a list containing validation errors or an empty list if valid | |
| */ | |
| private List<String> validateRefPath(Settings settings, Path configDir) { | |
| List<String> validationErrors = new ArrayList<>(); | |
| String refPath = settings.get(IndexSettings.INDEX_REF_PATH_SETTING.getKey()); | |
| if (refPath != null && !refPath.isEmpty()) { | |
| try { | |
| // Validate format: should be in packages/<package_id> format | |
| if (!refPath.startsWith("packages/")) { | |
| validationErrors.add("ref_path [" + refPath + "] must start with 'packages/'"); | |
| return validationErrors; | |
| } | |
| // Resolve and check if path exists | |
| Path resolvedPath = configDir.resolve(refPath).normalize(); | |
| if (!Files.isDirectory(resolvedPath)) { | |
| validationErrors.add("ref_path [" + refPath + "] does not exist or is not a directory"); | |
| } | |
| } catch (Exception e) { | |
| validationErrors.add("invalid ref_path [" + refPath + "]: " + e.getMessage()); | |
| } | |
| } | |
| return validationErrors; | |
| } |
Outdated
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the validateRefPath method. There are no tests verifying that:
- Valid ref_path values (e.g., "packages/pkg-1234") are accepted
- Invalid ref_path values (e.g., "pkg-1234" without "packages/" prefix) are rejected
- Non-existent package directories are rejected
- Path traversal attempts are blocked
Add tests to MetadataCreateIndexServiceTests.java to ensure this validation works correctly.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -917,6 +917,13 @@ public static IndexMergePolicy fromString(String text) { | |
| Property.Dynamic | ||
| ); | ||
|
|
||
| public static final Setting<String> INDEX_REF_PATH_SETTING = Setting.simpleString( | ||
| "index.ref_path", | ||
|
||
| "", | ||
| Property.IndexScope, | ||
| Property.Dynamic | ||
| ); | ||
|
|
||
| private final Index index; | ||
| private final Version version; | ||
| private final Logger logger; | ||
|
|
@@ -974,6 +981,7 @@ public static IndexMergePolicy fromString(String text) { | |
| private volatile boolean allowDerivedField; | ||
| private final boolean derivedSourceEnabled; | ||
| private volatile boolean derivedSourceEnabledForTranslog; | ||
| private volatile String refPath; | ||
|
|
||
| /** | ||
| * The maximum age of a retention lease before it is considered expired. | ||
|
|
@@ -1168,6 +1176,7 @@ public IndexSettings(final IndexMetadata indexMetadata, final Settings nodeSetti | |
| this.defaultAllowUnmappedFields = scopedSettings.get(ALLOW_UNMAPPED); | ||
| this.allowDerivedField = scopedSettings.get(ALLOW_DERIVED_FIELDS); | ||
| this.durability = scopedSettings.get(INDEX_TRANSLOG_DURABILITY_SETTING); | ||
| this.refPath = scopedSettings.get(INDEX_REF_PATH_SETTING); | ||
| this.translogReadForward = INDEX_TRANSLOG_READ_FORWARD_SETTING.get(settings); | ||
| defaultFields = scopedSettings.get(DEFAULT_FIELD_SETTING); | ||
| syncInterval = INDEX_TRANSLOG_SYNC_INTERVAL_SETTING.get(settings); | ||
|
|
@@ -1381,6 +1390,7 @@ public IndexSettings(final IndexMetadata indexMetadata, final Settings nodeSetti | |
| this::setRemoteStoreTranslogRepository | ||
| ); | ||
| scopedSettings.addSettingsUpdateConsumer(StarTreeIndexSettings.STAR_TREE_SEARCH_ENABLED_SETTING, this::setStarTreeIndexEnabled); | ||
| scopedSettings.addSettingsUpdateConsumer(INDEX_REF_PATH_SETTING, this::setRefPath); | ||
| } | ||
|
|
||
| private void setSearchIdleAfter(TimeValue searchIdleAfter) { | ||
|
|
@@ -2002,6 +2012,14 @@ public boolean getStarTreeIndexEnabled() { | |
| return isStarTreeIndexEnabled; | ||
| } | ||
|
|
||
| private void setRefPath(String refPath){ | ||
| this.refPath = refPath; | ||
| } | ||
|
|
||
| public String getRefPath(){ | ||
| return refPath; | ||
| } | ||
|
|
||
| /** | ||
| * Returns the merge policy that should be used for this index. | ||
| * | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -35,13 +35,38 @@ | |||||||||||||||||||||||||||||||||||||||||||
| import org.apache.lucene.analysis.hunspell.Dictionary; | ||||||||||||||||||||||||||||||||||||||||||||
| import org.apache.lucene.analysis.hunspell.HunspellStemFilter; | ||||||||||||||||||||||||||||||||||||||||||||
| import org.opensearch.common.settings.Settings; | ||||||||||||||||||||||||||||||||||||||||||||
| import org.opensearch.env.Environment; | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
| import org.opensearch.env.Environment; |
Outdated
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a double asterisk "* *" on line 46 which appears to be a formatting error in the Javadoc comment. Remove the extra asterisk to maintain proper documentation formatting.
| * * | |
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only check this if ref_path is provided? Is this no longer always required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covered that scenario below, re-written the conditions; first we are checking for ref_path (an additional parameter) if it is present -> then we go for checking locale is present or not.
In the else part if ref-path is not there, we fall back to the original check of locale
Outdated
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref_path validation is currently limited to refPath.contains("/"). This still allows path traversal / cache-key injection cases like ref_path=".." (escapes config/packages/), Windows separators ("\\"), and values containing the cache separator ":" (can break package isolation/invalidation). Tighten validation to reject . / .., any path separators, and the cache-key separator, and ensure the resolved path stays under config/packages/<packageId>.
| // Validate ref_path is just package ID (no slashes allowed) | |
| if (refPath.contains("/")) { | |
| throw new IllegalArgumentException( | |
| String.format(Locale.ROOT, | |
| "ref_path should contain only the package ID, not a full path. Got: [%s]. " + | |
| "Use ref_path for package ID and locale for the dictionary locale.", | |
| refPath) | |
| // Validate ref_path is just a package ID (no path traversal or cache-key separators) | |
| if (".".equals(refPath) || "..".equals(refPath) | |
| || refPath.indexOf('/') != -1 | |
| || refPath.indexOf('\\') != -1 | |
| || refPath.indexOf(':') != -1) { | |
| throw new IllegalArgumentException( | |
| String.format( | |
| Locale.ROOT, | |
| "ref_path should contain only a package ID without path or cache separators. Got: [%s]. " + | |
| "Use ref_path for package ID and locale for the dictionary locale.", | |
| refPath | |
| ) |
Outdated
Copilot
AI
Mar 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the traditional (no ref_path) branch, locale is used directly as both a cache key and a filesystem path segment, but it is not validated. With the new package-key separator logic (':'), allowing ':' or path separators in locale can lead to ambiguous cache keys and potential path traversal. Consider applying the same identifier validation to locale in the traditional branch as well (or otherwise rejecting ':' and path separators).
Outdated
Copilot
AI
Feb 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the new ref_path parameter in HunspellTokenFilterFactory. The existing tests in HunspellTokenFilterFactoryTests.java only cover the traditional locale-based loading, not the package-based loading with ref_path.
Add tests to verify:
- Loading dictionaries with ref_path and locale parameters
- Validation that ref_path without locale throws IllegalArgumentException
- Validation that ref_path containing "/" throws IllegalArgumentException
- Error handling when package directory doesn't exist
- Integration with the updateable flag for hot-reload support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ let's address this comment
Uh oh!
There was an error while loading. Please reload this page.