feat(tantivy): add language detection and language-aware tokenization #2771
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat(tantivy): add language detection and language-aware tokenization
Summary
This PR adds language support to the tantivy plugin by:
whichlangdependency to the workspace for language detection (supports 16 languages)hypr-languagecrate with:detectfeature: Detects language from text using whichlangtantivyfeature: Maps ISO639 language codes to tantivy's stemmer Language enumlanguagefielddetect_language()andget_tokenizer_name_for_language()helper functionsUpdates since last revision
Added comprehensive test coverage (12 tests total):
Review & Testing Checklist for Human
languagefield, custom "multilang" tokenizer). Opening existing indexes will likely fail. Verify if migration handling or forced reindex is needed.detect_languageandget_tokenizer_name_for_languagefunctions are exported but NOT called during document indexing. The infrastructure is ready but actual language detection during indexing is not implemented. Confirm if this is intentional (caller expected to use these) or if indexing should auto-detect.SearchDocumentnow haslanguage: Option<String>field (TypeScript bindings updated). Verify downstream consumers handle this.Suggested test plan:
detect_language()with sample text in different languages to verify detection worksNotes
get_tokenizer_name_for_language().Link to Devin run: https://app.devin.ai/sessions/48716941fa824b5e871562d3be3c8eae
Requested by: yujonglee (@yujonglee)