feat(tantivy): add language detection and language-aware tokenization #2771

devin-ai-integration · 2026-01-03T14:01:22Z

feat(tantivy): add language detection and language-aware tokenization

Summary

This PR adds language support to the tantivy plugin by:

Adding whichlang dependency to the workspace for language detection (supports 16 languages)
Extending hypr-language crate with:
- detect feature: Detects language from text using whichlang
- tantivy feature: Maps ISO639 language codes to tantivy's stemmer Language enum
Updating the tantivy plugin schema to include a language field
Registering 18 language-specific tokenizers with stemming support (Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish)
Exporting detect_language() and get_tokenizer_name_for_language() helper functions

Updates since last revision

Added comprehensive test coverage (12 tests total):

Language detection tests for English, Spanish, French, German, Chinese, and Japanese
Tokenizer name mapping tests for all 18 supported languages
Tests verifying unsupported languages (Chinese, Japanese, Korean, Hindi, Vietnamese) fall back to "multilang" tokenizer
Schema validation tests for the new language field
Tokenizer registration tests
English stemmer functionality test verifying "running" → "run", "jumps" → "jump", "quickly" → "quick"

Review & Testing Checklist for Human

Schema migration: Existing search indexes have a different schema (new language field, custom "multilang" tokenizer). Opening existing indexes will likely fail. Verify if migration handling or forced reindex is needed.
Incomplete wiring: The detect_language and get_tokenizer_name_for_language functions are exported but NOT called during document indexing. The infrastructure is ready but actual language detection during indexing is not implemented. Confirm if this is intentional (caller expected to use these) or if indexing should auto-detect.
Breaking change: SearchDocument now has language: Option<String> field (TypeScript bindings updated). Verify downstream consumers handle this.

Suggested test plan:

Delete existing search index and verify fresh index creation works
Test that search still functions correctly with the new schema
Manually call detect_language() with sample text in different languages to verify detection works

Notes

The language-specific tokenizers with stemming are registered but the schema currently uses "multilang" (no stemming) for all text fields. To use language-specific stemming, the caller would need to use the appropriate tokenizer name from get_tokenizer_name_for_language().
Languages detected by whichlang but without tantivy stemmer support (Chinese, Hindi, Japanese, Korean, Vietnamese) will fall back to "multilang" tokenizer.

Link to Devin run: https://app.devin.ai/sessions/48716941fa824b5e871562d3be3c8eae
Requested by: yujonglee (@yujonglee)

- Add whichlang dependency to workspace for language detection - Extend hypr-language crate with detect feature using whichlang - Add tantivy feature to hypr-language for stemmer language mapping - Update tantivy plugin schema to include language field - Register language-specific tokenizers with stemming support - Add detect_language and get_tokenizer_name_for_language helpers - Support 18 languages with stemming: Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish Co-Authored-By: yujonglee <[email protected]>

devin-ai-integration · 2026-01-03T14:01:26Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR that start with 'DevinAI' or '@devin'.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

netlify · 2026-01-03T14:02:16Z

✅ Deploy Preview for hyprnote canceled.

Name	Link
🔨 Latest commit	`b75f673`
🔍 Latest deploy log	https://app.netlify.com/projects/hyprnote/deploys/6959225f929159000895f929

netlify · 2026-01-03T14:02:16Z

✅ Deploy Preview for howto-fix-macos-audio-selection canceled.

Name	Link
🔨 Latest commit	`b75f673`
🔍 Latest deploy log	https://app.netlify.com/projects/howto-fix-macos-audio-selection/deploys/6959225feccf100008e44ea4

netlify · 2026-01-03T14:02:21Z

✅ Deploy Preview for hyprnote-storybook canceled.

Name	Link
🔨 Latest commit	`b75f673`
🔍 Latest deploy log	https://app.netlify.com/projects/hyprnote-storybook/deploys/6959225fd02b190009b71247

…enization - Add tests for language detection (English, Spanish, French, German, Chinese, Japanese) - Add tests for tokenizer name mapping for supported languages - Add tests for unsupported languages falling back to multilang tokenizer - Add tests for schema language field - Add tests for tokenizer registration - Add test for English stemmer functionality Co-Authored-By: yujonglee <[email protected]>

Co-Authored-By: yujonglee <[email protected]>

devin-ai-integration bot assigned yujonglee Jan 3, 2026

devin-ai-integration bot requested a review from yujonglee January 3, 2026 14:01

devin-ai-integration bot and others added 2 commits January 3, 2026 14:05

chore(tantivy): update TypeScript bindings with language field

b75f673

Co-Authored-By: yujonglee <[email protected]>

yujonglee merged commit 10b6f3e into main Jan 3, 2026
25 of 26 checks passed

yujonglee deleted the devin/1767448220-tantivy-language-support branch January 3, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tantivy): add language detection and language-aware tokenization #2771

feat(tantivy): add language detection and language-aware tokenization #2771

Uh oh!

devin-ai-integration bot commented Jan 3, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Jan 3, 2026

Uh oh!

netlify bot commented Jan 3, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 3, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(tantivy): add language detection and language-aware tokenization #2771

feat(tantivy): add language detection and language-aware tokenization #2771

Uh oh!

Conversation

devin-ai-integration bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(tantivy): add language detection and language-aware tokenization

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Jan 3, 2026

🤖 Devin AI Engineer

Uh oh!

netlify bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for hyprnote canceled.

Uh oh!

netlify bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for howto-fix-macos-audio-selection canceled.

Uh oh!

netlify bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for hyprnote-storybook canceled.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devin-ai-integration bot commented Jan 3, 2026 •

edited

Loading

netlify bot commented Jan 3, 2026 •

edited

Loading

netlify bot commented Jan 3, 2026 •

edited

Loading

netlify bot commented Jan 3, 2026 •

edited

Loading