Skip to content
Closed
2 changes: 2 additions & 0 deletions mise.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[tools]
java = "corretto-21"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mise.toml was previously called out as unrelated to the Hunspell feature and requested to be removed, but it’s present in this diff again. Please drop it from this PR unless there’s a project-wide agreement to add mise tooling config (and, if so, it should be introduced in a separate PR with appropriate documentation).

Copilot uses AI. Check for mistakes.
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.nio.file.Files;
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.nio.file.Files is imported but never used in this file. Please remove the unused import to avoid checkstyle failures.

Suggested change
import java.nio.file.Files;

Copilot uses AI. Check for mistakes.
import java.nio.file.Path;
import java.time.Instant;
import java.util.ArrayList;
Comment on lines 116 to 121
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions adding ref_path validation in MetadataCreateIndexService, but in this diff the only change appears to be adding an unused java.nio.file.Files import (no ref_path-related validation code exists in this class, and ref_path isn’t referenced anywhere in server/src/main/java). Either include the intended validation logic here or update the PR description/scope.

Copilot uses AI. Check for mistakes.
Expand Down Expand Up @@ -1701,6 +1702,7 @@ private static void validateErrors(String indexName, List<String> validationErro

List<String> getIndexSettingsValidationErrors(final Settings settings, final boolean forbidPrivateIndexSettings, String indexName) {
List<String> validationErrors = getIndexSettingsValidationErrors(settings, forbidPrivateIndexSettings, Optional.of(indexName));
validationErrors.addAll(validateRefPath(settings, env.configDir()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move this within getIndexSettingsValidationErrors to keep all validations in a single place?

return validationErrors;
}

Expand Down Expand Up @@ -1732,6 +1734,38 @@ List<String> getIndexSettingsValidationErrors(
}
return validationErrors;
}
/**
* Validates the ref_path setting if present.
* Checks that the path format is valid and the directory exists.
*
* @param settings the index settings
* @param configDir the config directory path
* @return a list containing validation errors or an empty list if valid
*/
private List<String> validateRefPath(Settings settings, Path configDir) {
List<String> validationErrors = new ArrayList<>();
String refPath = settings.get(IndexSettings.INDEX_REF_PATH_SETTING.getKey());

if (refPath != null && !refPath.isEmpty()) {
try {
// Validate format: should be in packages/<package_id> format
if (!refPath.startsWith("packages/")) {
validationErrors.add("ref_path [" + refPath + "] must start with 'packages/'");
return validationErrors;
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ref_path validation expects the value to start with "packages/" (e.g., "packages/pkg-1234"), but HunspellTokenFilterFactory at line 87 expects ref_path to be just the package ID (e.g., "pkg-1234") without the "packages/" prefix. This creates an inconsistency where the validation will reject valid usage patterns.

Looking at HunspellService.loadDictionaryFromPackage (lines 199-201), it constructs the path as config/packages/{packageId}, meaning it expects just the package ID, not the full path.

The validation should either:

  1. Accept just the package ID (e.g., "pkg-1234") to match the usage, or
  2. Update HunspellTokenFilterFactory and HunspellService to expect the full path with "packages/" prefix

Option 1 is recommended for consistency with the implementation and to avoid confusion for users.

Copilot uses AI. Check for mistakes.
}

// Resolve and check if path exists
Path resolvedPath = configDir.resolve(refPath).normalize();
if (!Files.isDirectory(resolvedPath)) {
validationErrors.add("ref_path [" + refPath + "] does not exist or is not a directory");
}
} catch (Exception e) {
validationErrors.add("invalid ref_path [" + refPath + "]: " + e.getMessage());
}
}

return validationErrors;
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validateRefPath method validates the format of index.ref_path at index creation time, but there's no code that actually uses this index-level setting. The HunspellTokenFilterFactory reads ref_path from the analyzer settings (line 87 in HunspellTokenFilterFactory.java), not from indexSettings.getRefPath().

This means the index.ref_path setting and its validation are currently unused and serve no purpose. Either:

  1. Remove the unused index-level setting and validation if ref_path is only meant to be an analyzer-level parameter, or
  2. Update HunspellTokenFilterFactory to use indexSettings.getRefPath() as the default if no analyzer-level ref_path is specified

The current implementation creates confusion about whether ref_path is an index-level setting or an analyzer-level parameter.

Suggested change
/**
* Validates the ref_path setting if present.
* Checks that the path format is valid and the directory exists.
*
* @param settings the index settings
* @param configDir the config directory path
* @return a list containing validation errors or an empty list if valid
*/
private List<String> validateRefPath(Settings settings, Path configDir) {
List<String> validationErrors = new ArrayList<>();
String refPath = settings.get(IndexSettings.INDEX_REF_PATH_SETTING.getKey());
if (refPath != null && !refPath.isEmpty()) {
try {
// Validate format: should be in packages/<package_id> format
if (!refPath.startsWith("packages/")) {
validationErrors.add("ref_path [" + refPath + "] must start with 'packages/'");
return validationErrors;
}
// Resolve and check if path exists
Path resolvedPath = configDir.resolve(refPath).normalize();
if (!Files.isDirectory(resolvedPath)) {
validationErrors.add("ref_path [" + refPath + "] does not exist or is not a directory");
}
} catch (Exception e) {
validationErrors.add("invalid ref_path [" + refPath + "]: " + e.getMessage());
}
}
return validationErrors;
}

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the validateRefPath method. There are no tests verifying that:

  1. Valid ref_path values (e.g., "packages/pkg-1234") are accepted
  2. Invalid ref_path values (e.g., "pkg-1234" without "packages/" prefix) are rejected
  3. Non-existent package directories are rejected
  4. Path traversal attempts are blocked

Add tests to MetadataCreateIndexServiceTests.java to ensure this validation works correctly.

Copilot uses AI. Check for mistakes.

private static List<String> validatePrivateSettingsNotExplicitlySet(Settings settings, IndexScopedSettings indexScopedSettings) {
List<String> validationErrors = new ArrayList<>();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ public final class IndexScopedSettings extends AbstractScopedSettings {
IndexSortConfig.INDEX_SORT_ORDER_SETTING,
IndexSortConfig.INDEX_SORT_MISSING_SETTING,
IndexSortConfig.INDEX_SORT_MODE_SETTING,
IndexSettings.INDEX_REF_PATH_SETTING,
IndexSettings.INDEX_TRANSLOG_DURABILITY_SETTING,
IndexSettings.INDEX_TRANSLOG_READ_FORWARD_SETTING,
IndexSettings.INDEX_WARMER_ENABLED_SETTING,
Expand Down
18 changes: 18 additions & 0 deletions server/src/main/java/org/opensearch/index/IndexSettings.java
Original file line number Diff line number Diff line change
Expand Up @@ -917,6 +917,13 @@ public static IndexMergePolicy fromString(String text) {
Property.Dynamic
);

public static final Setting<String> INDEX_REF_PATH_SETTING = Setting.simpleString(
"index.ref_path",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider scoping this further to be grouped with other similar index settings. i.e. index.analyze.ref_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will do as recommended.

"",
Property.IndexScope,
Property.Dynamic
);

private final Index index;
private final Version version;
private final Logger logger;
Expand Down Expand Up @@ -974,6 +981,7 @@ public static IndexMergePolicy fromString(String text) {
private volatile boolean allowDerivedField;
private final boolean derivedSourceEnabled;
private volatile boolean derivedSourceEnabledForTranslog;
private volatile String refPath;

/**
* The maximum age of a retention lease before it is considered expired.
Expand Down Expand Up @@ -1168,6 +1176,7 @@ public IndexSettings(final IndexMetadata indexMetadata, final Settings nodeSetti
this.defaultAllowUnmappedFields = scopedSettings.get(ALLOW_UNMAPPED);
this.allowDerivedField = scopedSettings.get(ALLOW_DERIVED_FIELDS);
this.durability = scopedSettings.get(INDEX_TRANSLOG_DURABILITY_SETTING);
this.refPath = scopedSettings.get(INDEX_REF_PATH_SETTING);
this.translogReadForward = INDEX_TRANSLOG_READ_FORWARD_SETTING.get(settings);
defaultFields = scopedSettings.get(DEFAULT_FIELD_SETTING);
syncInterval = INDEX_TRANSLOG_SYNC_INTERVAL_SETTING.get(settings);
Expand Down Expand Up @@ -1381,6 +1390,7 @@ public IndexSettings(final IndexMetadata indexMetadata, final Settings nodeSetti
this::setRemoteStoreTranslogRepository
);
scopedSettings.addSettingsUpdateConsumer(StarTreeIndexSettings.STAR_TREE_SEARCH_ENABLED_SETTING, this::setStarTreeIndexEnabled);
scopedSettings.addSettingsUpdateConsumer(INDEX_REF_PATH_SETTING, this::setRefPath);
}

private void setSearchIdleAfter(TimeValue searchIdleAfter) {
Expand Down Expand Up @@ -2002,6 +2012,14 @@ public boolean getStarTreeIndexEnabled() {
return isStarTreeIndexEnabled;
}

private void setRefPath(String refPath){
this.refPath = refPath;
}

public String getRefPath(){
return refPath;
}

/**
* Returns the merge policy that should be used for this index.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,38 @@
import org.apache.lucene.analysis.hunspell.Dictionary;
import org.apache.lucene.analysis.hunspell.HunspellStemFilter;
import org.opensearch.common.settings.Settings;
import org.opensearch.env.Environment;
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import org.opensearch.env.Environment was added but is not used in this class. Please remove it to avoid checkstyle failures.

Suggested change
import org.opensearch.env.Environment;

Copilot uses AI. Check for mistakes.
import org.opensearch.index.IndexSettings;
import org.opensearch.indices.analysis.HunspellService;

import java.util.Locale;

/**
* The token filter factory for the hunspell analyzer
* *
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a double asterisk "* *" on line 46 which appears to be a formatting error in the Javadoc comment. Remove the extra asterisk to maintain proper documentation formatting.

Suggested change
* *
*

Copilot uses AI. Check for mistakes.
* Supports hot-reload when used with {@code updateable: true} setting.
* The dictionary is loaded from either:
* <ul>
* <li>A ref_path (package ID, e.g., "pkg-1234") combined with locale for package-based dictionaries</li>
* <li>A locale (e.g., "en_US") for traditional hunspell dictionaries from config/hunspell/</li>
* </ul>
*
* <h2>Usage Examples:</h2>
* <pre>
* // Traditional locale-based (loads from config/hunspell/en_US/)
* {
* "type": "hunspell",
* "locale": "en_US"
* }
*
* // Package-based (loads from config/packages/pkg-1234/hunspell/en_US/)
* {
* "type": "hunspell",
* "ref_path": "pkg-1234",
* "locale": "en_US"
* }
* </pre>
*
*
* @opensearch.internal
*/
Expand All @@ -50,18 +75,58 @@ public class HunspellTokenFilterFactory extends AbstractTokenFilterFactory {
private final Dictionary dictionary;
private final boolean dedup;
private final boolean longestOnly;
private final AnalysisMode analysisMode;

public HunspellTokenFilterFactory(IndexSettings indexSettings, String name, Settings settings, HunspellService hunspellService) {
public HunspellTokenFilterFactory(IndexSettings indexSettings, String name, Settings settings, HunspellService hunspellService, Environment env) {
super(indexSettings, name, settings);
// Check for updateable flag - enables hot-reload support (same pattern as SynonymTokenFilterFactory)
boolean updateable = settings.getAsBoolean("updateable", false);
this.analysisMode = updateable ? AnalysisMode.SEARCH_TIME : AnalysisMode.ALL;

// Get both ref_path and locale parameters
String refPath = settings.get("ref_path"); // Package ID only (optional)
String locale = settings.get("locale", settings.get("language", settings.get("lang", null)));
if (locale == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only check this if ref_path is provided? Is this no longer always required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered that scenario below, re-written the conditions; first we are checking for ref_path (an additional parameter) if it is present -> then we go for checking locale is present or not.
In the else part if ref-path is not there, we fall back to the original check of locale

throw new IllegalArgumentException("missing [locale | language | lang] configuration for hunspell token filter");
}

dictionary = hunspellService.getDictionary(locale);
if (dictionary == null) {
throw new IllegalArgumentException(String.format(Locale.ROOT, "Unknown hunspell dictionary for locale [%s]", locale));

if (refPath != null) {
// Package-based loading: ref_path (package ID) + locale (required)
if (locale == null) {
throw new IllegalArgumentException(
"When using ref_path, the 'locale' parameter is required for hunspell token filter"
);
}

// Validate ref_path is just package ID (no slashes allowed)
if (refPath.contains("/")) {
throw new IllegalArgumentException(
String.format(Locale.ROOT,
"ref_path should contain only the package ID, not a full path. Got: [%s]. " +
"Use ref_path for package ID and locale for the dictionary locale.",
refPath)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref_path validation is currently limited to refPath.contains("/"). This still allows path traversal / cache-key injection cases like ref_path=".." (escapes config/packages/), Windows separators ("\\"), and values containing the cache separator ":" (can break package isolation/invalidation). Tighten validation to reject . / .., any path separators, and the cache-key separator, and ensure the resolved path stays under config/packages/<packageId>.

Suggested change
// Validate ref_path is just package ID (no slashes allowed)
if (refPath.contains("/")) {
throw new IllegalArgumentException(
String.format(Locale.ROOT,
"ref_path should contain only the package ID, not a full path. Got: [%s]. " +
"Use ref_path for package ID and locale for the dictionary locale.",
refPath)
// Validate ref_path is just a package ID (no path traversal or cache-key separators)
if (".".equals(refPath) || "..".equals(refPath)
|| refPath.indexOf('/') != -1
|| refPath.indexOf('\\') != -1
|| refPath.indexOf(':') != -1) {
throw new IllegalArgumentException(
String.format(
Locale.ROOT,
"ref_path should contain only a package ID without path or cache separators. Got: [%s]. " +
"Use ref_path for package ID and locale for the dictionary locale.",
refPath
)

Copilot uses AI. Check for mistakes.
);
}

// Load from package directory: config/packages/{ref_path}/hunspell/{locale}/
dictionary = hunspellService.getDictionaryFromPackage(refPath, locale, env);
if (dictionary == null) {
throw new IllegalArgumentException(
String.format(Locale.ROOT,
"Could not find hunspell dictionary for locale [%s] in package [%s]",
locale, refPath)
);
}
} else if (locale != null) {
// Traditional locale-based loading (backward compatible)
// Loads from config/hunspell/{locale}/
dictionary = hunspellService.getDictionary(locale);
if (dictionary == null) {
throw new IllegalArgumentException(
String.format(Locale.ROOT, "Unknown hunspell dictionary for locale [%s]", locale)
);
}
} else {
throw new IllegalArgumentException(
"missing [locale | language | lang] configuration for hunspell token filter"
);
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the traditional (no ref_path) branch, locale is used directly as both a cache key and a filesystem path segment, but it is not validated. With the new package-key separator logic (':'), allowing ':' or path separators in locale can lead to ambiguous cache keys and potential path traversal. Consider applying the same identifier validation to locale in the traditional branch as well (or otherwise rejecting ':' and path separators).

Copilot uses AI. Check for mistakes.
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the new ref_path parameter in HunspellTokenFilterFactory. The existing tests in HunspellTokenFilterFactoryTests.java only cover the traditional locale-based loading, not the package-based loading with ref_path.

Add tests to verify:

  1. Loading dictionaries with ref_path and locale parameters
  2. Validation that ref_path without locale throws IllegalArgumentException
  3. Validation that ref_path containing "/" throws IllegalArgumentException
  4. Error handling when package directory doesn't exist
  5. Integration with the updateable flag for hot-reload support

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ let's address this comment


dedup = settings.getAsBoolean("dedup", true);
Expand All @@ -73,6 +138,16 @@ public TokenStream create(TokenStream tokenStream) {
return new HunspellStemFilter(tokenStream, dictionary, dedup, longestOnly);
}

/**
* Returns the analysis mode for this filter.
* When {@code updateable: true} is set, returns {@code SEARCH_TIME} which enables hot-reload
* via the _reload_search_analyzers API.
*/
@Override
public AnalysisMode getAnalysisMode() {
return this.analysisMode;
}

public boolean dedup() {
return dedup;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ public AnalysisModule(Environment environment, List<AnalysisPlugin> plugins) thr
);
}

HunspellService getHunspellService() {
public HunspellService getHunspellService() {
return hunspellService;
}

Expand Down Expand Up @@ -161,7 +161,7 @@ public boolean requiresAnalysisSettings() {
tokenFilters.register(
"hunspell",
requiresAnalysisSettings(
(indexSettings, env, name, settings) -> new HunspellTokenFilterFactory(indexSettings, name, settings, hunspellService)
(indexSettings, env, name, settings) -> new HunspellTokenFilterFactory(indexSettings, name, settings, hunspellService, env)
)
);

Expand Down
Loading
Loading