Skip to content

[RFC] Add ref_path support for Hunspell token filter to enable multi-package same-locale dictionaries; Enable Hot-reload for HunspellΒ #20712

@shayush622

Description

@shayush622

Is your feature request related to a problem? Please describe

The Problem

Currently, the Hunspell token filter only supports the locale parameter which loads dictionaries exclusively from config/hunspell/{locale}/. HunspellService caches dictionaries in ConcurrentHashMap. Unlike synonyms (which reload from disk on every analyzer refresh), Hunspell always returns cached dictionaries. This creates the following issues:

  1. Single Dictionary Per Locale: Only one dictionary can exist for each locale name (e.g., en_US). If two different applications or tenants need different dictionaries for the same locale/language, this is not possible.

  2. Caching Conflict Risk: The current caching mechanism uses only the locale name as the cache key, preventing true isolation of dictionaries with the same locale name.

  3. No Hot-Reload Support: Once a dictionary is loaded into the cache (ConcurrentHashMap), it persists until node restart. Updating dictionary files on disk does not reflect in the running cluster without a full restart. This prevents dynamic dictionary updates for package re-associations or dictionary version changes.

Example of the Limitation:

# Current state - only one en_US dictionary possible
config/hunspell/en_US/
β”œβ”€β”€ en_US.dic
└── en_US.aff

# Desired state - multiple en_US dictionaries for different packages
config/packages/
β”œβ”€β”€ pkg-123/hunspell/en_US/    
β”œβ”€β”€ pkg-456/hunspell/en_US/        
└── pkg-789/hunspell/en_US/      

Describe the solution you'd like

Add a new ref_path parameter to the Hunspell token filter that allows specifying a package-based path to load dictionaries from, with full cache key isolation.

New Parameter: ref_path + locale (separate parameters)

  • ref_path: Package ID only (e.g., "pkg-1234")

  • locale: Dictionary locale (e.g., "en_US") - REQUIRED when using ref_path

  • Format: {package-id}

  • Resolves to: config/packages/{ref_path}/hunspell/{locale}/

  • Cache Key: ref_path + ":" + locale ("{packageId}:{locale}" )

Usage Example

PUT /index_123
{
  "settings": {
    "analysis": {
      "analyzer": {
        "abc_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "abc_hunspell"]
        }
      },
      "filter": {
        "abc_hunspell": {
            "type": "hunspell",
            "ref_path": "pkg-123",
            "locale": "en_US",
            "dedup": true,
            "updateable": true  // Optional: enables hot-reload via _reload_search_analyzers
          }
      }
    }
  }
}
PUT /index_456
{
  "settings": {
    "analysis": {
      "analyzer": {
        "xyz_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "xyz_hunspell"]
        }
      },
      "filter": {
        "xyz_hunspell": {
            "type": "hunspell",
            "ref_path": "pkg-456",
            "locale": "en_US",
            "dedup": true,
            "updateable": true
          }
      }
    }
  }
}

Caching Strategy:

// Cache key is "{packageId}:{locale}" format
String cacheKey = refPath;  // e.g., "pkg-1234:en_US"
return dictionaries.computeIfAbsent(cacheKey, (key) -> {
    return loadDictionaryFromPackage(packageId, locale, env);
});

Backward Compatibility

  • Existing behavior unchanged: locale/language/lang parameters continue to work as before
  • Priority: ref_path takes precedence if specified alongside locale
  • No breaking changes: All existing index configurations remain valid

Phase 2: Hot-Reload / Cache Invalidation Support

Extend the existing _reload_search_analyzers transport action to support Hunspell cache invalidation as an internal operation.

OpenSearch Core Changes:
HunspellService.java - Added methods:

  • getDictionaryFromPackage(String packageId, String locale, Environment env)
  • buildPackageCacheKey(String packageId, String locale) - returns "{packageId}:{locale}"
  • isPackageCacheKey(String cacheKey) - checks if key contains ":"
  • invalidateDictionary(String cacheKey)
  • invalidateDictionariesByPackage(String packageId) - removes all keys starting with "{packageId}:"
  • invalidateAllDictionaries()
  • reloadDictionaryFromPackage(String packageId, String locale, Environment env)
  • getCachedDictionaryKeys() - for diagnostics
  • getCachedDictionaryCount()

HunspellTokenFilterFactory.java - Updated:

  • Accepts ref_path (package ID) + locale (separate params)
  • Validates ref_path contains no slashes (package ID only)
  • Supports "updateable" flag for hot-reload via _reload_search_analyzers

NEW: RestHunspellCacheInvalidateAction.java - Added:

  • GET /_hunspell/cache - view cache info
  • POST /_hunspell/cache/_invalidate - invalidate by package/key
  • POST /_hunspell/cache/_invalidate_all - clear all

Node.java - Updated to register new REST handler

Hot-Reload Workflow:

  1. Deploy new dictionary files to node
  2. Call POST /_hunspell/cache/_invalidate?package_id=pkg-1234
    Then call POST /{index}/_reload_search_analyzers
  3. Fresh dictionary loaded from disk on next analyzer access

New REST API: /_hunspell/cache

Endpoint Method Description
/_hunspell/cache GET View cached keys
/_hunspell/cache/_invalidate?package_id=X POST Invalidate by package
/_hunspell/cache/_invalidate?package_id=X&locale=Y POST Invalidate specific
/_hunspell/cache/_invalidate?cache_key=X POST Invalidate by key
/_hunspell/cache/_invalidate_all POST Clear all

Design Decisions & Discussion Notes

Caching Strategy: Node-Level Cache

Per discussion with @cwperks , node-level cache (existing behavior) is a good approach over index-level cache:

  • Rationale: Index-level caching would lead to unnecessary caching of Dictionary Objects, resulting in redundant memory overhead. Node-level cache keeps the Dictionary Object in-memory so it can be shared across all indices using the same package and locale.
  • Implementation: ConcurrentHashMap<String, Dictionary> in HunspellService (singleton per node). Cache key is {packageId}:{locale} for package-based dictionaries, {locale} for traditional dictionaries.

Hot-Reload Strategy: Reuse _reload_search_analyzers (Recommended)

Craig recommends reusing the existing _reload_search_analyzers API rather than introducing a new cache invalidation endpoint:

  • Proposed behavior: Every invocation of _reload_search_analyzers should invalidate the hunspell node-level cache, re-read dictionary data from disk, and repopulate the in-memory data structure.
  • Assumption: This operation is infrequent β€” essentially a one-off event triggered after dictionary files are updated on disk. Craig has requested usage data on how often this API is called in practice to validate this assumption.
  • New endpoint deferred: Craig is open to a dedicated /_hunspell/cache/_invalidate endpoint if needed, but considers it potential over-engineering or premature optimization at this stage.

Phased Implementation

Phase 1: ref_path support (PR #20840)

  • ref_path parameter for package-based dictionary loading
  • Node-level cache with multi-tenant isolation
  • Backward compatible with traditional config/hunspell/{locale}/ path

Phase 2: Hot-reload via _reload_search_analyzers (to be raised later after ongoing-discussion)

  • Integrate cache invalidation into MapperService.reloadSearchAnalyzers()
  • Add updateable: true flag support for SEARCH_TIME analysis mode
  • Pending: usage data on _reload_search_analyzers call frequency

Related component

Plugins

Describe alternatives you've considered

No response

Additional context

Local Testing Results

Test Environment:

  • OpenSearch built from source with POC changes
  • Two packages with same locale (en_US) but different dictionaries

Directory Structure Created:

config/packages/
β”œβ”€β”€ pkg-1234/hunspell/en_US/
β”‚   β”œβ”€β”€ en_US.dic  (technology terms: computer, program, code, etc.)
β”‚   └── en_US.aff  (affix rules: -s, -ed, -ing)
└── pkg-2345/hunspell/en_US/
    β”œβ”€β”€ en_US.dic  (fruit terms: apple, banana, orange, etc.)
    └── en_US.aff  (affix rules: -s, un-, -able, -ness)

Test 1: Create indexes with different packages

# Index using pkg-1234 (tech dictionary)
curl -X PUT "localhost:9200/index_pkg_1234" -H "Content-Type: application/json" -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "hunspell_analyzer_1234": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "hunspell_filter_1234"]
        }
      },
      "filter": {
        "hunspell_filter_1234": {
          "type": "hunspell",
          "ref_path": "pkg-1234",
          "locale": "en_US",
          "dedup": true
        }
      }
    }
  }
}'

# Result: Index created successfully 

# Index using pkg-2345 (fruit dictionary)
curl -X PUT "localhost:9200/index_pkg_5678" -H "Content-Type: application/json" -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "hunspell_analyzer_5678": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "hunspell_filter_5678"]
        }
      },
      "filter": {
        "hunspell_filter_5678": {
          "type": "hunspell",
          "ref_path": "pkg-5678",
          "locale": "en_US",
          "dedup": true
        }
      }
    }
  }
}'

# Result: Index created successfully 

Test 2: Verify stemming works correctly per package

# pkg-1234 stemming tech words
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_1234",
  "text": "computers programs"
}'
# Result: {"tokens":[{"token":"computer",...},{"token":"program",...}]} 

# pkg-2345 stemming fruit words
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_2345",
  "text": "apples bananas"
}'
# Result: {"tokens":[{"token":"apple",...},{"token":"banana",...}]} 

Test 3: Verify dictionary isolation

# pkg-2345 should NOT stem tech words (not in its dictionary)
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_2345",
  "text": "computers"
}'
# Result: {"tokens":[{"token":"computers",...}]} (unchanged - word not in dictionary)

# pkg-1234 should NOT stem fruit words (not in its dictionary)
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_1234",
  "text": "apples"
}'
# Result: {"tokens":[{"token":"apples",...}]} (unchanged - word not in dictionary)

Test 4: Cross-package mixed words test

# Mixed words through pkg-1234 (only tech words should stem)
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_1234",
  "text": "computers apples programs bananas"
}'
# Result: computer, apples, program, bananas 

# Mixed words through pkg-2345 (only fruit words should stem)
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_2345",
  "text": "computers apples programs bananas"
}'
# Result: computers, apple, programs, banana 

Metadata

Metadata

Assignees

No one assigned

    Labels

    SearchSearch query, autocomplete ...etcdiscussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    πŸ†• New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions