feat(analysis): Add package-based Hunspell dictionary support with ref_path parameter and cache invalidation API#20741
feat(analysis): Add package-based Hunspell dictionary support with ref_path parameter and cache invalidation API#20741shayush622 wants to merge 7 commits intoopensearch-project:mainfrom
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 0e293c9.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
There was a problem hiding this comment.
Pull request overview
This PR adds support for loading Hunspell dictionaries from package-based directories using a new ref_path parameter, enabling multi-tenant dictionary isolation and hot-reload capabilities through a new cache management REST API.
Changes:
- Introduces package-based dictionary loading with
ref_pathparameter alongside traditional locale-based loading - Adds REST API endpoints for Hunspell cache management (view, invalidate by package/key, invalidate all)
- Implements cache invalidation methods in HunspellService for hot-reload support
- Adds
index.ref_pathindex-level setting with validation (though usage is unclear)
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| server/src/main/java/org/opensearch/indices/analysis/HunspellService.java | Adds getDictionaryFromPackage(), cache invalidation methods, and utility methods for package-based dictionary management |
| server/src/main/java/org/opensearch/rest/action/admin/indices/RestHunspellCacheInvalidateAction.java | New REST handler for cache management with GET/POST endpoints |
| server/src/main/java/org/opensearch/index/analysis/HunspellTokenFilterFactory.java | Updates to support ref_path parameter and hot-reload via updateable flag |
| server/src/main/java/org/opensearch/index/IndexSettings.java | Adds INDEX_REF_PATH_SETTING as dynamic index-level setting |
| server/src/main/java/org/opensearch/cluster/metadata/MetadataCreateIndexService.java | Adds validateRefPath() for index.ref_path validation |
| server/src/main/java/org/opensearch/node/Node.java | Registers RestHunspellCacheInvalidateAction |
| server/src/main/java/org/opensearch/indices/analysis/AnalysisModule.java | Makes getHunspellService() public and adds Environment parameter to HunspellTokenFilterFactory |
| server/src/main/java/org/opensearch/common/settings/IndexScopedSettings.java | Registers INDEX_REF_PATH_SETTING |
| server/src/test/java/org/opensearch/indices/analyze/HunspellServiceTests.java | Adds comprehensive tests for package-based dictionary loading and cache management |
| mise.toml | Adds development environment configuration (unrelated to PR) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!refPath.startsWith("packages/")) { | ||
| validationErrors.add("ref_path [" + refPath + "] must start with 'packages/'"); | ||
| return validationErrors; |
There was a problem hiding this comment.
The ref_path validation expects the value to start with "packages/" (e.g., "packages/pkg-1234"), but HunspellTokenFilterFactory at line 87 expects ref_path to be just the package ID (e.g., "pkg-1234") without the "packages/" prefix. This creates an inconsistency where the validation will reject valid usage patterns.
Looking at HunspellService.loadDictionaryFromPackage (lines 199-201), it constructs the path as config/packages/{packageId}, meaning it expects just the package ID, not the full path.
The validation should either:
- Accept just the package ID (e.g., "pkg-1234") to match the usage, or
- Update HunspellTokenFilterFactory and HunspellService to expect the full path with "packages/" prefix
Option 1 is recommended for consistency with the implementation and to avoid confusion for users.
| // Add for hunspell invalidation cache testing: | ||
| HunspellService hunspellService = analysisModule.getHunspellService(); | ||
| restController.registerHandler(new RestHunspellCacheInvalidateAction(hunspellService)); | ||
|
|
There was a problem hiding this comment.
REST handlers should typically be registered in ActionModule's constructor using the standard pattern with registerHandler.accept(), not directly in Node.java. This breaks the established pattern where all REST handlers are registered in one central location in ActionModule (around lines 850-1077 in ActionModule.java).
Consider moving this registration to ActionModule.java to maintain consistency with other REST handlers in the codebase. The HunspellService can be passed as a parameter to RestHunspellCacheInvalidateAction or retrieved from the AnalysisModule if needed.
| // Add for hunspell invalidation cache testing: | |
| HunspellService hunspellService = analysisModule.getHunspellService(); | |
| restController.registerHandler(new RestHunspellCacheInvalidateAction(hunspellService)); |
| Path packageDir = env.configDir() | ||
| .resolve("packages") | ||
| .resolve(packageId); | ||
|
|
||
| // Security check: ensure path is under config directory | ||
| if (!packageDir.normalize().startsWith(env.configDir().toAbsolutePath())) { | ||
| throw new IllegalArgumentException( | ||
| String.format(Locale.ROOT, "Package path must be under config directory. Package: [%s]", packageId) | ||
| ); | ||
| } | ||
|
|
||
| // Check if package directory exists | ||
| if (!Files.isDirectory(packageDir)) { | ||
| throw new OpenSearchException( | ||
| String.format(Locale.ROOT, | ||
| "Package directory not found: [%s]. Expected at: %s", | ||
| packageId, packageDir) | ||
| ); | ||
| } | ||
|
|
||
| // Auto-detect hunspell directory within package | ||
| Path hunspellDir = packageDir.resolve("hunspell"); | ||
| if (!Files.isDirectory(hunspellDir)) { | ||
| throw new OpenSearchException( | ||
| String.format(Locale.ROOT, | ||
| "Hunspell directory not found in package [%s]. " + | ||
| "Expected 'hunspell' subdirectory at: %s", | ||
| packageId, hunspellDir) | ||
| ); | ||
| } | ||
|
|
||
| // Resolve locale directory within hunspell | ||
| Path dicDir = hunspellDir.resolve(locale); |
There was a problem hiding this comment.
The path traversal check at line 204 is not entirely correct. The check compares packageDir.normalize().startsWith(env.configDir().toAbsolutePath()), but packageDir is not converted to absolute path before normalization. This could allow path traversal attacks with inputs like "../../../etc" because:
- packageDir is relative (config/packages/../../../etc)
- After normalize(), it becomes a different relative path
- The startsWith check may fail to detect the traversal
The check should be: if (!packageDir.toAbsolutePath().normalize().startsWith(env.configDir().toAbsolutePath().normalize())) to ensure both paths are absolute and normalized before comparison.
Additionally, the locale parameter (line 231: dicDir = hunspellDir.resolve(locale)) is not validated for path traversal. A malicious locale value like "../../../etc/passwd" could escape the hunspell directory. Add similar validation for the locale parameter.
| Path packageDir = env.configDir() | |
| .resolve("packages") | |
| .resolve(packageId); | |
| // Security check: ensure path is under config directory | |
| if (!packageDir.normalize().startsWith(env.configDir().toAbsolutePath())) { | |
| throw new IllegalArgumentException( | |
| String.format(Locale.ROOT, "Package path must be under config directory. Package: [%s]", packageId) | |
| ); | |
| } | |
| // Check if package directory exists | |
| if (!Files.isDirectory(packageDir)) { | |
| throw new OpenSearchException( | |
| String.format(Locale.ROOT, | |
| "Package directory not found: [%s]. Expected at: %s", | |
| packageId, packageDir) | |
| ); | |
| } | |
| // Auto-detect hunspell directory within package | |
| Path hunspellDir = packageDir.resolve("hunspell"); | |
| if (!Files.isDirectory(hunspellDir)) { | |
| throw new OpenSearchException( | |
| String.format(Locale.ROOT, | |
| "Hunspell directory not found in package [%s]. " + | |
| "Expected 'hunspell' subdirectory at: %s", | |
| packageId, hunspellDir) | |
| ); | |
| } | |
| // Resolve locale directory within hunspell | |
| Path dicDir = hunspellDir.resolve(locale); | |
| Path configDir = env.configDir().toAbsolutePath().normalize(); | |
| Path packageDir = configDir | |
| .resolve("packages") | |
| .resolve(packageId); | |
| Path packageDirAbs = packageDir.toAbsolutePath().normalize(); | |
| // Security check: ensure path is under config directory | |
| if (!packageDirAbs.startsWith(configDir)) { | |
| throw new IllegalArgumentException( | |
| String.format(Locale.ROOT, "Package path must be under config directory. Package: [%s]", packageId) | |
| ); | |
| } | |
| // Check if package directory exists | |
| if (!Files.isDirectory(packageDirAbs)) { | |
| throw new OpenSearchException( | |
| String.format(Locale.ROOT, | |
| "Package directory not found: [%s]. Expected at: %s", | |
| packageId, packageDirAbs) | |
| ); | |
| } | |
| // Auto-detect hunspell directory within package | |
| Path hunspellDir = packageDirAbs.resolve("hunspell"); | |
| Path hunspellDirAbs = hunspellDir.toAbsolutePath().normalize(); | |
| if (!Files.isDirectory(hunspellDirAbs)) { | |
| throw new OpenSearchException( | |
| String.format(Locale.ROOT, | |
| "Hunspell directory not found in package [%s]. " + | |
| "Expected 'hunspell' subdirectory at: %s", | |
| packageId, hunspellDirAbs) | |
| ); | |
| } | |
| // Resolve locale directory within hunspell and validate against traversal | |
| Path dicDir = hunspellDirAbs.resolve(locale); | |
| Path dicDirAbs = dicDir.toAbsolutePath().normalize(); | |
| if (!dicDirAbs.startsWith(hunspellDirAbs)) { | |
| throw new IllegalArgumentException( | |
| String.format(Locale.ROOT, "Locale path must be under hunspell directory. Package: [%s], locale: [%s]", packageId, locale) | |
| ); | |
| } | |
| dicDir = dicDirAbs; |
| * | ||
| * @return count of invalidated cache entries | ||
| */ | ||
| public int invalidateAllDictionaries() { | ||
| int count = dictionaries.size(); | ||
| dictionaries.clear(); | ||
| logger.info("Invalidated all {} cached hunspell dictionaries", count); |
There was a problem hiding this comment.
The invalidateAllDictionaries method has a race condition. Line 498 reads the size, then line 499 clears the map. Between these two operations, other threads could add or remove entries, making the returned count inaccurate. While this is a minor issue for a diagnostic return value, it could lead to confusion in logs.
Consider using int count = dictionaries.size(); dictionaries.clear(); return count; in a single atomic operation or accepting that the count may be approximate and documenting this behavior.
| * | |
| * @return count of invalidated cache entries | |
| */ | |
| public int invalidateAllDictionaries() { | |
| int count = dictionaries.size(); | |
| dictionaries.clear(); | |
| logger.info("Invalidated all {} cached hunspell dictionaries", count); | |
| * <p> | |
| * Note: The returned count is based on the cache size observed just before clearing and may be | |
| * approximate if other threads are concurrently adding or removing entries. | |
| * | |
| * @return approximate count of invalidated cache entries | |
| */ | |
| public int invalidateAllDictionaries() { | |
| int count = dictionaries.size(); | |
| dictionaries.clear(); | |
| logger.info( | |
| "Invalidated all cached hunspell dictionaries; previous observed cache size was {} (may be approximate due to concurrent updates)", | |
| count | |
| ); |
|
|
||
| /** | ||
| * The token filter factory for the hunspell analyzer | ||
| * * |
There was a problem hiding this comment.
There's a double asterisk "* *" on line 46 which appears to be a formatting error in the Javadoc comment. Remove the extra asterisk to maintain proper documentation formatting.
| * * | |
| * |
| } | ||
|
|
||
| // Resolve locale directory within hunspell | ||
| Path dicDir = hunspellDir.resolve(locale); |
There was a problem hiding this comment.
The locale parameter is resolved to a directory path (line 231) without validation for path traversal. A malicious locale value like "../../../etc/passwd" or "../../.." could escape the hunspell directory and access arbitrary files on the system.
Add validation after line 231 to ensure the resolved dicDir is still under the hunspellDir:
if (!dicDir.toAbsolutePath().normalize().startsWith(hunspellDir.toAbsolutePath().normalize())) {
throw new IllegalArgumentException(
String.format(Locale.ROOT, "Locale path must be under hunspell directory. Locale: [%s]", locale)
);
}
This prevents directory traversal attacks via the locale parameter.
| Path dicDir = hunspellDir.resolve(locale); | |
| Path dicDir = hunspellDir.resolve(locale); | |
| if (!dicDir.toAbsolutePath().normalize().startsWith(hunspellDir.toAbsolutePath().normalize())) { | |
| throw new IllegalArgumentException( | |
| String.format(Locale.ROOT, "Locale path must be under hunspell directory. Locale: [%s]", locale) | |
| ); | |
| } |
...rc/main/java/org/opensearch/rest/action/admin/indices/RestHunspellCacheInvalidateAction.java
Outdated
Show resolved
Hide resolved
| * @throws IllegalArgumentException if packageId or locale is null | ||
| * @throws OpenSearchException if hunspell directory not found or dictionary cannot be loaded | ||
| */ | ||
| public Dictionary getDictionaryFromPackage(String packageId, String locale, Environment env) { |
There was a problem hiding this comment.
The getDictionaryFromPackage method requires an Environment parameter, but HunspellService already has access to the environment from its constructor (used to resolve hunspellDir at line 125). This creates inconsistency:
- getDictionary() uses the environment from the constructor
- getDictionaryFromPackage() requires environment to be passed as a parameter
This inconsistency is error-prone because:
- Callers might pass a different Environment instance than the one used during construction
- It adds unnecessary complexity to the API
Consider removing the env parameter from getDictionaryFromPackage and using the environment from the constructor, similar to how getDictionary() works. Store the environment as a field during construction if needed.
...rc/main/java/org/opensearch/rest/action/admin/indices/RestHunspellCacheInvalidateAction.java
Show resolved
Hide resolved
- Add INDEX_REF_PATH_SETTING for package-based hunspell dictionaries - Add RestHunspellCacheInvalidateAction for cache invalidation endpoint - Update HunspellService with cache management methods - Add ref_path validation in MetadataCreateIndexService Signed-off-by: shayush622 <ayush5267@gmail.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 865aab5.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 865aab5.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
|
||
| List<String> getIndexSettingsValidationErrors(final Settings settings, final boolean forbidPrivateIndexSettings, String indexName) { | ||
| List<String> validationErrors = getIndexSettingsValidationErrors(settings, forbidPrivateIndexSettings, Optional.of(indexName)); | ||
| validationErrors.addAll(validateRefPath(settings, env.configDir())); |
There was a problem hiding this comment.
nit: can we move this within getIndexSettingsValidationErrors to keep all validations in a single place?
| // Get both ref_path and locale parameters | ||
| String refPath = settings.get("ref_path"); // Package ID only (optional) | ||
| String locale = settings.get("locale", settings.get("language", settings.get("lang", null))); | ||
| if (locale == null) { |
There was a problem hiding this comment.
Why only check this if ref_path is provided? Is this no longer always required?
There was a problem hiding this comment.
Covered that scenario below, re-written the conditions; first we are checking for ref_path (an additional parameter) if it is present -> then we go for checking locale is present or not.
In the else part if ref-path is not there, we fall back to the original check of locale
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protected Set<String> responseParams() { | ||
| Set<String> params = new HashSet<>(); | ||
| params.add("package_id"); | ||
| params.add("cache_key"); | ||
| params.add("locale"); | ||
| return unmodifiableSet(params); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
responseParams() is meant for response-format parameters (e.g., filter_path, human), not for request routing params. Including package_id/locale/cache_key here can mask unconsumed parameters (e.g., /_hunspell/cache/_invalidate_all?package_id=... would silently ignore package_id instead of failing strict param checks). Drop this override (or only include real response params) so invalid/unconsumed request parameters are rejected.
| protected Set<String> responseParams() { | |
| Set<String> params = new HashSet<>(); | |
| params.add("package_id"); | |
| params.add("cache_key"); | |
| params.add("locale"); | |
| return unmodifiableSet(params); | |
| } | |
| @Override |
| // Additional check: ensure the resolved package directory is exactly one level under packages/ | ||
| // This prevents packageId=".." or "foo/../bar" from escaping | ||
| if (!packageDirAbsolute.getParent().equals(packagesBaseDirAbsolute)) { | ||
| throw new IllegalArgumentException( | ||
| String.format(Locale.ROOT, | ||
| "Invalid package ID: [%s]. Package ID cannot contain path traversal sequences.", packageId) | ||
| ); | ||
| } |
There was a problem hiding this comment.
loadDictionaryFromPackage claims this check prevents packageId=".." or "foo/../bar" (line comment), but toAbsolutePath().normalize() will turn foo/../bar into bar, so the getParent().equals(...) check still passes and path segments are effectively accepted. If packageId is intended to be an ID (single path segment), add explicit validation rejecting path separators/traversal (and consider the same for locale) rather than relying on normalization + parent checks; also update the misleading comment.
PR Reviewer Guide 🔍(Review updated until commit 20ba2c0)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 20ba2c0 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 1e9fbd5
Suggestions up to commit 87eab2c
Suggestions up to commit 3926f7a
Suggestions up to commit e84b5bc
|
|
❌ Gradle check result for e84b5bc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
.../java/org/opensearch/action/admin/indices/cache/hunspell/HunspellCacheInvalidateRequest.java
Show resolved
Hide resolved
.../java/org/opensearch/action/admin/indices/cache/hunspell/HunspellCacheInvalidateRequest.java
Outdated
Show resolved
Hide resolved
...java/org/opensearch/action/admin/indices/cache/hunspell/HunspellCacheInvalidateResponse.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/opensearch/rest/action/admin/indices/RestHunspellCacheInvalidateAction.java
Outdated
Show resolved
Hide resolved
...n/java/org/opensearch/action/admin/indices/cache/hunspell/HunspellCacheInvalidateAction.java
Outdated
Show resolved
Hide resolved
- Add ref_path parameter for package-based dictionary loading
- Load from config/packages/{packageId}/hunspell/{locale}/
- Add cache invalidation REST API (GET/POST /_hunspell/cache/_invalidate)
- Add TransportAction with cluster:admin permission
- Add comprehensive security validation (path traversal, separators, cache-key injection)
- Add updateable flag for hot-reload via _reload_search_analyzers
- Add comprehensive test coverage
PR feedback addressed:
- Stricter validate() to reject conflicting params
- Path traversal checks now use config/packages/ as base
- ref_path/locale validation rejects ., .., /, \, : characters
Signed-off-by: shayush622 <ayush5267@gmail.com>
|
Persistent review updated to latest commit 3926f7a |
|
❌ Gradle check result for 3926f7a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/indices/analysis/HunspellService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/analysis/HunspellService.java
Show resolved
Hide resolved
...n/java/org/opensearch/action/admin/indices/cache/hunspell/HunspellCacheInvalidateAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/analysis/HunspellService.java
Outdated
Show resolved
Hide resolved
|
Persistent review updated to latest commit 87eab2c |
|
❌ Gradle check result for 87eab2c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: shayush622 <ayush5267@gmail.com>
|
Persistent review updated to latest commit 1e9fbd5 |
- Add ref_path parameter for package-based dictionary loading
- Load from config/packages/{packageId}/hunspell/{locale}/
- Add cache info API: GET /_hunspell/cache (cluster:monitor/hunspell/cache)
- Add cache invalidation API: POST /_hunspell/cache/_invalidate (cluster:admin/hunspell/cache/invalidate)
- Support invalidation by package_id, locale, cache_key, or invalidate_all
- Add security validation (path traversal, separator injection, null bytes)
- Add updateable flag for hot-reload via _reload_search_analyzers
- Use Strings.hasText() and Strings.isNullOrEmpty() for validation consistency
- Consistent response schema with all fields always present
- Add unit tests, REST handler tests, and integration tests
Signed-off-by: shayush622 <ayush5267@gmail.com>
|
Persistent review updated to latest commit 20ba2c0 |
|
❌ Gradle check result for 20ba2c0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Superseded by #20792 — recreated with clean history |
Description
This PR adds support for loading Hunspell dictionaries from package-based directories using a new
ref_pathparameter, enabling multi-tenant dictionary isolation and hot-reload capabilities.Related Issues
Resolves #[20712]
Link to RFC.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.