Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Jan 3, 2026

feat(tantivy): add language detection and language-aware tokenization

Summary

This PR adds language support to the tantivy plugin by:

  1. Adding whichlang dependency to the workspace for language detection (supports 16 languages)
  2. Extending hypr-language crate with:
    • detect feature: Detects language from text using whichlang
    • tantivy feature: Maps ISO639 language codes to tantivy's stemmer Language enum
  3. Updating the tantivy plugin schema to include a language field
  4. Registering 18 language-specific tokenizers with stemming support (Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish)
  5. Exporting detect_language() and get_tokenizer_name_for_language() helper functions

Updates since last revision

Added comprehensive test coverage (12 tests total):

  • Language detection tests for English, Spanish, French, German, Chinese, and Japanese
  • Tokenizer name mapping tests for all 18 supported languages
  • Tests verifying unsupported languages (Chinese, Japanese, Korean, Hindi, Vietnamese) fall back to "multilang" tokenizer
  • Schema validation tests for the new language field
  • Tokenizer registration tests
  • English stemmer functionality test verifying "running" → "run", "jumps" → "jump", "quickly" → "quick"

Review & Testing Checklist for Human

  • Schema migration: Existing search indexes have a different schema (new language field, custom "multilang" tokenizer). Opening existing indexes will likely fail. Verify if migration handling or forced reindex is needed.
  • Incomplete wiring: The detect_language and get_tokenizer_name_for_language functions are exported but NOT called during document indexing. The infrastructure is ready but actual language detection during indexing is not implemented. Confirm if this is intentional (caller expected to use these) or if indexing should auto-detect.
  • Breaking change: SearchDocument now has language: Option<String> field (TypeScript bindings updated). Verify downstream consumers handle this.

Suggested test plan:

  1. Delete existing search index and verify fresh index creation works
  2. Test that search still functions correctly with the new schema
  3. Manually call detect_language() with sample text in different languages to verify detection works

Notes

  • The language-specific tokenizers with stemming are registered but the schema currently uses "multilang" (no stemming) for all text fields. To use language-specific stemming, the caller would need to use the appropriate tokenizer name from get_tokenizer_name_for_language().
  • Languages detected by whichlang but without tantivy stemmer support (Chinese, Hindi, Japanese, Korean, Vietnamese) will fall back to "multilang" tokenizer.

Link to Devin run: https://app.devin.ai/sessions/48716941fa824b5e871562d3be3c8eae
Requested by: yujonglee (@yujonglee)

- Add whichlang dependency to workspace for language detection
- Extend hypr-language crate with detect feature using whichlang
- Add tantivy feature to hypr-language for stemmer language mapping
- Update tantivy plugin schema to include language field
- Register language-specific tokenizers with stemming support
- Add detect_language and get_tokenizer_name_for_language helpers
- Support 18 languages with stemming: Arabic, Danish, Dutch, English,
  Finnish, French, German, Greek, Hungarian, Italian, Norwegian,
  Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish

Co-Authored-By: yujonglee <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR that start with 'DevinAI' or '@devin'.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@netlify
Copy link

netlify bot commented Jan 3, 2026

Deploy Preview for hyprnote canceled.

Name Link
🔨 Latest commit b75f673
🔍 Latest deploy log https://app.netlify.com/projects/hyprnote/deploys/6959225f929159000895f929

@netlify
Copy link

netlify bot commented Jan 3, 2026

Deploy Preview for howto-fix-macos-audio-selection canceled.

Name Link
🔨 Latest commit b75f673
🔍 Latest deploy log https://app.netlify.com/projects/howto-fix-macos-audio-selection/deploys/6959225feccf100008e44ea4

@netlify
Copy link

netlify bot commented Jan 3, 2026

Deploy Preview for hyprnote-storybook canceled.

Name Link
🔨 Latest commit b75f673
🔍 Latest deploy log https://app.netlify.com/projects/hyprnote-storybook/deploys/6959225fd02b190009b71247

devin-ai-integration bot and others added 2 commits January 3, 2026 14:05
…enization

- Add tests for language detection (English, Spanish, French, German, Chinese, Japanese)
- Add tests for tokenizer name mapping for supported languages
- Add tests for unsupported languages falling back to multilang tokenizer
- Add tests for schema language field
- Add tests for tokenizer registration
- Add test for English stemmer functionality

Co-Authored-By: yujonglee <[email protected]>
@yujonglee yujonglee merged commit 10b6f3e into main Jan 3, 2026
25 of 26 checks passed
@yujonglee yujonglee deleted the devin/1767448220-tantivy-language-support branch January 3, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants