[RFC] Add ref_path support for Hunspell token filter to enable multi-package same-locale dictionaries; Enable Hot-reload for Hunspell

### Is your feature request related to a problem? Please describe

### The Problem

Currently, the Hunspell token filter only supports the `locale` parameter which loads dictionaries exclusively from `config/hunspell/{locale}/`. HunspellService caches dictionaries in `ConcurrentHashMap`. Unlike synonyms (which reload from disk on every analyzer refresh), Hunspell always returns cached dictionaries. This creates the following issues:

1. __Single Dictionary Per Locale:__ Only one dictionary can exist for each locale name (e.g., `en_US`). If two different applications or tenants need different dictionaries for the same locale/language, this is not possible.

2. __Caching Conflict Risk:__ The current caching mechanism uses only the locale name as the cache key, preventing true isolation of dictionaries with the same locale name.

3. **No Hot-Reload Support**: Once a dictionary is loaded into the cache (`ConcurrentHashMap`), it persists until node restart. Updating dictionary files on disk does not reflect in the running cluster without a full restart. This prevents dynamic dictionary updates for package re-associations or dictionary version changes.


__Example of the Limitation:__

```javascript
# Current state - only one en_US dictionary possible
config/hunspell/en_US/
├── en_US.dic
└── en_US.aff

# Desired state - multiple en_US dictionaries for different packages
config/packages/
├── pkg-123/hunspell/en_US/    
├── pkg-456/hunspell/en_US/        
└── pkg-789/hunspell/en_US/      
```

### Describe the solution you'd like

__Add a new `ref_path` parameter to the Hunspell token filter__ that allows specifying a package-based path to load dictionaries from, with full cache key isolation.

#### New Parameter: `ref_path + locale (separate parameters)`

- ref_path: Package ID only (e.g., "pkg-1234")
- locale: Dictionary locale (e.g., "en_US") - REQUIRED when using ref_path

- __Format:__ `{package-id}`
- __Resolves to:__ `config/packages/{ref_path}/hunspell/{locale}/`
- __Cache Key:__ `ref_path` + ":" + locale ("{packageId}:{locale}" )

#### Usage Example

```json
PUT /index_123
{
  "settings": {
    "analysis": {
      "analyzer": {
        "abc_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "abc_hunspell"]
        }
      },
      "filter": {
        "abc_hunspell": {
            "type": "hunspell",
            "ref_path": "pkg-123",
            "locale": "en_US",
            "dedup": true,
            "updateable": true  // Optional: enables hot-reload via _reload_search_analyzers
          }
      }
    }
  }
}
```

```json
PUT /index_456
{
  "settings": {
    "analysis": {
      "analyzer": {
        "xyz_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "xyz_hunspell"]
        }
      },
      "filter": {
        "xyz_hunspell": {
            "type": "hunspell",
            "ref_path": "pkg-456",
            "locale": "en_US",
            "dedup": true,
            "updateable": true
          }
      }
    }
  }
}
```
__Caching Strategy:__

```java
// Cache key is "{packageId}:{locale}" format
String cacheKey = refPath;  // e.g., "pkg-1234:en_US"
return dictionaries.computeIfAbsent(cacheKey, (key) -> {
    return loadDictionaryFromPackage(packageId, locale, env);
});
```
#### Backward Compatibility

- __Existing behavior unchanged:__ `locale`/`language`/`lang` parameters continue to work as before
- __Priority:__ `ref_path` takes precedence if specified alongside `locale`
- __No breaking changes:__ All existing index configurations remain valid

---

### Phase 2: Hot-Reload / Cache Invalidation Support

Extend the existing `_reload_search_analyzers` transport action to support Hunspell cache invalidation as an internal operation.

**OpenSearch Core Changes:**
HunspellService.java - Added methods:
  - getDictionaryFromPackage(String packageId, String locale, Environment env)
  - buildPackageCacheKey(String packageId, String locale) - returns "{packageId}:{locale}"
  - isPackageCacheKey(String cacheKey) - checks if key contains ":"
  - invalidateDictionary(String cacheKey)
  - invalidateDictionariesByPackage(String packageId) - removes all keys starting with "{packageId}:"
  - invalidateAllDictionaries()
  - reloadDictionaryFromPackage(String packageId, String locale, Environment env)
  - getCachedDictionaryKeys() - for diagnostics
  - getCachedDictionaryCount()

HunspellTokenFilterFactory.java - Updated:
  - Accepts ref_path (package ID) + locale (separate params)
  - Validates ref_path contains no slashes (package ID only)
  - Supports "updateable" flag for hot-reload via _reload_search_analyzers

NEW: RestHunspellCacheInvalidateAction.java - Added:
  - GET /_hunspell/cache - view cache info
  - POST /_hunspell/cache/_invalidate - invalidate by package/key
  - POST /_hunspell/cache/_invalidate_all - clear all

Node.java - Updated to register new REST handler

**Hot-Reload Workflow:**
1. Deploy new dictionary files to node
2.  Call POST /_hunspell/cache/_invalidate?package_id=pkg-1234
      Then call POST /{index}/_reload_search_analyzers
3. Fresh dictionary loaded from disk on next analyzer access

### New REST API: /_hunspell/cache

| Endpoint | Method | Description |
|----------|--------|-------------|
| /_hunspell/cache | GET | View cached keys |
| /_hunspell/cache/_invalidate?package_id=X | POST | Invalidate by package |
| /_hunspell/cache/_invalidate?package_id=X&locale=Y | POST | Invalidate specific |
| /_hunspell/cache/_invalidate?cache_key=X | POST | Invalidate by key |
| /_hunspell/cache/_invalidate_all | POST | Clear all |



###

## Design Decisions & Discussion Notes

### Caching Strategy: Node-Level Cache

Per discussion with @cwperks , **node-level cache** (existing behavior) is a good approach over index-level cache:

- **Rationale:** Index-level caching would lead to unnecessary caching of Dictionary Objects, resulting in redundant memory overhead. Node-level cache keeps the Dictionary Object in-memory so it can be shared across all indices using the same package and locale.
- **Implementation:** `ConcurrentHashMap<String, Dictionary>` in `HunspellService` (singleton per node). Cache key is `{packageId}:{locale}` for package-based dictionaries, `{locale}` for traditional dictionaries.

### Hot-Reload Strategy: Reuse `_reload_search_analyzers` (Recommended)

Craig recommends **reusing the existing `_reload_search_analyzers` API** rather than introducing a new cache invalidation endpoint:

- **Proposed behavior:** Every invocation of `_reload_search_analyzers` should invalidate the hunspell node-level cache, re-read dictionary data from disk, and repopulate the in-memory data structure.
- **Assumption:** This operation is infrequent — essentially a one-off event triggered after dictionary files are updated on disk. Craig has requested usage data on how often this API is called in practice to validate this assumption.
- **New endpoint deferred:** Craig is open to a dedicated `/_hunspell/cache/_invalidate` endpoint if needed, but considers it potential over-engineering or premature optimization at this stage.

## Phased Implementation

### Phase 1: ref_path support (PR #20840) 
- `ref_path` parameter for package-based dictionary loading
- Node-level cache with multi-tenant isolation
- Backward compatible with traditional `config/hunspell/{locale}/` path

### Phase 2: Hot-reload via `_reload_search_analyzers` (to be raised later after ongoing-discussion)
- Integrate cache invalidation into `MapperService.reloadSearchAnalyzers()` 
- Add `updateable: true` flag support for `SEARCH_TIME` analysis mode
- Pending: usage data on `_reload_search_analyzers` call frequency



### Related component

Plugins

### Describe alternatives you've considered

_No response_

### Additional context

#### Local Testing Results

__Test Environment:__

- OpenSearch built from source with POC changes
- Two packages with same locale (`en_US`) but different dictionaries

__Directory Structure Created:__

```javascript
config/packages/
├── pkg-1234/hunspell/en_US/
│   ├── en_US.dic  (technology terms: computer, program, code, etc.)
│   └── en_US.aff  (affix rules: -s, -ed, -ing)
└── pkg-2345/hunspell/en_US/
    ├── en_US.dic  (fruit terms: apple, banana, orange, etc.)
    └── en_US.aff  (affix rules: -s, un-, -able, -ness)
```

__Test 1: Create indexes with different packages__

```bash
# Index using pkg-1234 (tech dictionary)
curl -X PUT "localhost:9200/index_pkg_1234" -H "Content-Type: application/json" -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "hunspell_analyzer_1234": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "hunspell_filter_1234"]
        }
      },
      "filter": {
        "hunspell_filter_1234": {
          "type": "hunspell",
          "ref_path": "pkg-1234",
          "locale": "en_US",
          "dedup": true
        }
      }
    }
  }
}'

# Result: Index created successfully 

# Index using pkg-2345 (fruit dictionary)
curl -X PUT "localhost:9200/index_pkg_5678" -H "Content-Type: application/json" -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "hunspell_analyzer_5678": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "hunspell_filter_5678"]
        }
      },
      "filter": {
        "hunspell_filter_5678": {
          "type": "hunspell",
          "ref_path": "pkg-5678",
          "locale": "en_US",
          "dedup": true
        }
      }
    }
  }
}'

# Result: Index created successfully 
```

__Test 2: Verify stemming works correctly per package__

```bash
# pkg-1234 stemming tech words
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_1234",
  "text": "computers programs"
}'
# Result: {"tokens":[{"token":"computer",...},{"token":"program",...}]} 

# pkg-2345 stemming fruit words
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_2345",
  "text": "apples bananas"
}'
# Result: {"tokens":[{"token":"apple",...},{"token":"banana",...}]} 
```

__Test 3: Verify dictionary isolation__

```bash
# pkg-2345 should NOT stem tech words (not in its dictionary)
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_2345",
  "text": "computers"
}'
# Result: {"tokens":[{"token":"computers",...}]} (unchanged - word not in dictionary)

# pkg-1234 should NOT stem fruit words (not in its dictionary)
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_1234",
  "text": "apples"
}'
# Result: {"tokens":[{"token":"apples",...}]} (unchanged - word not in dictionary)
```

__Test 4: Cross-package mixed words test__

```bash
# Mixed words through pkg-1234 (only tech words should stem)
curl -X POST "localhost:9200/index_pkg_1234/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_1234",
  "text": "computers apples programs bananas"
}'
# Result: computer, apples, program, bananas 

# Mixed words through pkg-2345 (only fruit words should stem)
curl -X POST "localhost:9200/index_pkg_2345/_analyze" -H "Content-Type: application/json" -d '{
  "analyzer": "hunspell_analyzer_2345",
  "text": "computers apples programs bananas"
}'
# Result: computers, apple, programs, banana 
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add ref_path support for Hunspell token filter to enable multi-package same-locale dictionaries; Enable Hot-reload for Hunspell #20712

Is your feature request related to a problem? Please describe

The Problem

Describe the solution you'd like

New Parameter: `ref_path + locale (separate parameters)`

Usage Example

Backward Compatibility

Phase 2: Hot-Reload / Cache Invalidation Support

New REST API: /_hunspell/cache

Design Decisions & Discussion Notes

Caching Strategy: Node-Level Cache

Hot-Reload Strategy: Reuse `_reload_search_analyzers` (Recommended)

Phased Implementation

Phase 1: ref_path support (PR #20840)

Phase 2: Hot-reload via `_reload_search_analyzers` (to be raised later after ongoing-discussion)

Related component

Describe alternatives you've considered

Additional context

Local Testing Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Endpoint	Method	Description
/_hunspell/cache	GET	View cached keys
/_hunspell/cache/_invalidate?package_id=X	POST	Invalidate by package
/_hunspell/cache/_invalidate?package_id=X&locale=Y	POST	Invalidate specific
/_hunspell/cache/_invalidate?cache_key=X	POST	Invalidate by key
/_hunspell/cache/_invalidate_all	POST	Clear all

[RFC] Add ref_path support for Hunspell token filter to enable multi-package same-locale dictionaries; Enable Hot-reload for Hunspell #20712

Description

Is your feature request related to a problem? Please describe

The Problem

Describe the solution you'd like

New Parameter: ref_path + locale (separate parameters)

Usage Example

Backward Compatibility

Phase 2: Hot-Reload / Cache Invalidation Support

New REST API: /_hunspell/cache

Design Decisions & Discussion Notes

Caching Strategy: Node-Level Cache

Hot-Reload Strategy: Reuse _reload_search_analyzers (Recommended)

Phased Implementation

Phase 1: ref_path support (PR #20840)

Phase 2: Hot-reload via _reload_search_analyzers (to be raised later after ongoing-discussion)

Related component

Describe alternatives you've considered

Additional context

Local Testing Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New Parameter: `ref_path + locale (separate parameters)`

Hot-Reload Strategy: Reuse `_reload_search_analyzers` (Recommended)

Phase 2: Hot-reload via `_reload_search_analyzers` (to be raised later after ongoing-discussion)