-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
The Problem
Currently, the Hunspell token filter only supports the locale parameter which loads dictionaries exclusively from config/hunspell/{locale}/. HunspellService caches dictionaries in ConcurrentHashMap. Unlike synonyms (which reload from disk on every analyzer refresh), Hunspell always returns cached dictionaries. This creates the following issues:
-
Single Dictionary Per Locale: Only one dictionary can exist for each locale name (e.g.,
en_US). If two different applications or tenants need different dictionaries for the same locale/language, this is not possible. -
Caching Conflict Risk: The current caching mechanism uses only the locale name as the cache key, preventing true isolation of dictionaries with the same locale name.
-
No Hot-Reload Support: Once a dictionary is loaded into the cache (
ConcurrentHashMap), it persists until node restart. Updating dictionary files on disk does not reflect in the running cluster without a full restart. This prevents dynamic dictionary updates for package re-associations or dictionary version changes.
Example of the Limitation:
# Current state - only one en_US dictionary possible
config/hunspell/en_US/
βββ en_US.dic
βββ en_US.aff
# Desired state - multiple en_US dictionaries for different packages
config/packages/
βββ pkg-123/hunspell/en_US/
βββ pkg-456/hunspell/en_US/
βββ pkg-789/hunspell/en_US/ Describe the solution you'd like
Add a new ref_path parameter to the Hunspell token filter that allows specifying a package-based path to load dictionaries from, with full cache key isolation.
New Parameter: ref_path + locale (separate parameters)
-
ref_path: Package ID only (e.g., "pkg-1234")
-
locale: Dictionary locale (e.g., "en_US") - REQUIRED when using ref_path
-
Format:
{package-id} -
Resolves to:
config/packages/{ref_path}/hunspell/{locale}/ -
Cache Key:
ref_path+ ":" + locale ("{packageId}:{locale}" )
Usage Example
PUT /index_123
{
"settings": {
"analysis": {
"analyzer": {
"abc_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "abc_hunspell"]
}
},
"filter": {
"abc_hunspell": {
"type": "hunspell",
"ref_path": "pkg-123",
"locale": "en_US",
"dedup": true,
"updateable": true // Optional: enables hot-reload via _reload_search_analyzers
}
}
}
}
}PUT /index_456
{
"settings": {
"analysis": {
"analyzer": {
"xyz_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "xyz_hunspell"]
}
},
"filter": {
"xyz_hunspell": {
"type": "hunspell",
"ref_path": "pkg-456",
"locale": "en_US",
"dedup": true,
"updateable": true
}
}
}
}
}Caching Strategy:
// Cache key is "{packageId}:{locale}" format
String cacheKey = refPath; // e.g., "pkg-1234:en_US"
return dictionaries.computeIfAbsent(cacheKey, (key) -> {
return loadDictionaryFromPackage(packageId, locale, env);
});Backward Compatibility
- Existing behavior unchanged:
locale/language/langparameters continue to work as before - Priority:
ref_pathtakes precedence if specified alongsidelocale - No breaking changes: All existing index configurations remain valid
Phase 2: Hot-Reload / Cache Invalidation Support
Extend the existing _reload_search_analyzers transport action to support Hunspell cache invalidation as an internal operation.
OpenSearch Core Changes:
HunspellService.java - Added methods:
- getDictionaryFromPackage(String packageId, String locale, Environment env)
- buildPackageCacheKey(String packageId, String locale) - returns "{packageId}:{locale}"
- isPackageCacheKey(String cacheKey) - checks if key contains ":"
- invalidateDictionary(String cacheKey)
- invalidateDictionariesByPackage(String packageId) - removes all keys starting with "{packageId}:"
- invalidateAllDictionaries()
- reloadDictionaryFromPackage(String packageId, String locale, Environment env)
- getCachedDictionaryKeys() - for diagnostics
- getCachedDictionaryCount()
HunspellTokenFilterFactory.java - Updated:
- Accepts ref_path (package ID) + locale (separate params)
- Validates ref_path contains no slashes (package ID only)
- Supports "updateable" flag for hot-reload via _reload_search_analyzers
NEW: RestHunspellCacheInvalidateAction.java - Added:
- GET /_hunspell/cache - view cache info
- POST /_hunspell/cache/_invalidate - invalidate by package/key
- POST /_hunspell/cache/_invalidate_all - clear all
Node.java - Updated to register new REST handler
Hot-Reload Workflow:
- Deploy new dictionary files to node
- Call POST /_hunspell/cache/_invalidate?package_id=pkg-1234
Then call POST /{index}/_reload_search_analyzers - Fresh dictionary loaded from disk on next analyzer access
New REST API: /_hunspell/cache
| Endpoint | Method | Description |
|---|---|---|
| /_hunspell/cache | GET | View cached keys |
| /_hunspell/cache/_invalidate?package_id=X | POST | Invalidate by package |
| /_hunspell/cache/_invalidate?package_id=X&locale=Y | POST | Invalidate specific |
| /_hunspell/cache/_invalidate?cache_key=X | POST | Invalidate by key |
| /_hunspell/cache/_invalidate_all | POST | Clear all |
Design Decisions & Discussion Notes
Caching Strategy: Node-Level Cache
Per discussion with @cwperks , node-level cache (existing behavior) is a good approach over index-level cache:
- Rationale: Index-level caching would lead to unnecessary caching of Dictionary Objects, resulting in redundant memory overhead. Node-level cache keeps the Dictionary Object in-memory so it can be shared across all indices using the same package and locale.
- Implementation:
ConcurrentHashMap<String, Dictionary>inHunspellService(singleton per node). Cache key is{packageId}:{locale}for package-based dictionaries,{locale}for traditional dictionaries.
Hot-Reload Strategy: Reuse _reload_search_analyzers (Recommended)
Craig recommends reusing the existing _reload_search_analyzers API rather than introducing a new cache invalidation endpoint:
- Proposed behavior: Every invocation of
_reload_search_analyzersshould invalidate the hunspell node-level cache, re-read dictionary data from disk, and repopulate the in-memory data structure. - Assumption: This operation is infrequent β essentially a one-off event triggered after dictionary files are updated on disk. Craig has requested usage data on how often this API is called in practice to validate this assumption.
- New endpoint deferred: Craig is open to a dedicated
/_hunspell/cache/_invalidateendpoint if needed, but considers it potential over-engineering or premature optimization at this stage.
Phased Implementation
Phase 1: ref_path support (PR #20840)
ref_pathparameter for package-based dictionary loading- Node-level cache with multi-tenant isolation
- Backward compatible with traditional
config/hunspell/{locale}/path
Phase 2: Hot-reload via _reload_search_analyzers (to be raised later after ongoing-discussion)
- Integrate cache invalidation into
MapperService.reloadSearchAnalyzers() - Add
updateable: trueflag support forSEARCH_TIMEanalysis mode - Pending: usage data on
_reload_search_analyzerscall frequency
Related component
Plugins
Describe alternatives you've considered
No response
Additional context
Local Testing Results
Test Environment:
- OpenSearch built from source with POC changes
- Two packages with same locale (
en_US) but different dictionaries
Directory Structure Created:
config/packages/
βββ pkg-1234/hunspell/en_US/
β βββ en_US.dic (technology terms: computer, program, code, etc.)
β βββ en_US.aff (affix rules: -s, -ed, -ing)
βββ pkg-2345/hunspell/en_US/
βββ en_US.dic (fruit terms: apple, banana, orange, etc.)
βββ en_US.aff (affix rules: -s, un-, -able, -ness)Test 1: Create indexes with different packages
# Index using pkg-1234 (tech dictionary)
curl -X PUT "localhost:9200/index_pkg_1234" -H "Content-Type: application/json" -d '{
"settings": {
"analysis": {
"analyzer": {
"hunspell_analyzer_1234": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "hunspell_filter_1234"]
}
},
"filter": {
"hunspell_filter_1234": {
"type": "hunspell",
"ref_path": "pkg-1234",
"locale": "en_US",
"dedup": true
}
}
}
}
}'
# Result: Index created successfully
# Index using pkg-2345 (fruit dictionary)
curl -X PUT "localhost:9200/index_pkg_5678" -H "Content-Type: application/json" -d '{
"settings": {
"analysis": {
"analyzer": {
"hunspell_analyzer_5678": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "hunspell_filter_5678"]
}
},
"filter": {
"hunspell_filter_5678": {
"type": "hunspell",
"ref_path": "pkg-5678",
"locale": "en_US",
"dedup": true
}
}
}
}
}'
# Result: Index created successfully Test 2: Verify stemming works correctly per package
# pkg-1234 stemming tech words
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
"analyzer": "hunspell_analyzer_1234",
"text": "computers programs"
}'
# Result: {"tokens":[{"token":"computer",...},{"token":"program",...}]}
# pkg-2345 stemming fruit words
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
"analyzer": "hunspell_analyzer_2345",
"text": "apples bananas"
}'
# Result: {"tokens":[{"token":"apple",...},{"token":"banana",...}]} Test 3: Verify dictionary isolation
# pkg-2345 should NOT stem tech words (not in its dictionary)
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
"analyzer": "hunspell_analyzer_2345",
"text": "computers"
}'
# Result: {"tokens":[{"token":"computers",...}]} (unchanged - word not in dictionary)
# pkg-1234 should NOT stem fruit words (not in its dictionary)
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
"analyzer": "hunspell_analyzer_1234",
"text": "apples"
}'
# Result: {"tokens":[{"token":"apples",...}]} (unchanged - word not in dictionary)Test 4: Cross-package mixed words test
# Mixed words through pkg-1234 (only tech words should stem)
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
"analyzer": "hunspell_analyzer_1234",
"text": "computers apples programs bananas"
}'
# Result: computer, apples, program, bananas
# Mixed words through pkg-2345 (only fruit words should stem)
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
"analyzer": "hunspell_analyzer_2345",
"text": "computers apples programs bananas"
}'
# Result: computers, apple, programs, banana Metadata
Metadata
Assignees
Labels
Type
Projects
Status