-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Summary
The current Auto Detect search in STTM-WEB is not reliably finding expected results when users:
- type a full/partial panktee
- type in common roman spellings (e.g.
jo mange thakur) - type with misspellings / phonetic variations (e.g.
jooo menge thakoor) - type what they remember “loosely” instead of exact spelling
This creates a poor user experience because users often search from memory, and the search should still find the expected shabad/panktee.
We currently use Meilisearch. Meilisearch already supports typo tolerance and configurable relevancy (ranking rules + searchable attribute order), so we should improve our indexing + query strategy to better support transliteration and fuzzy matching. ([meilisearch.com]1)
Problem Statement
Current issues observed
-
Full panktee search is weak
- Searching a full line / near-full line does not reliably return the correct shabad/panktee.
-
Roman transliteration queries often fail
-
Example:
jo mange thakurtu data datar
-
Users expect the known shabad to appear, but results are missing or ranked poorly.
-
-
Mistyped / phonetic queries fail
-
Example:
jooo menge thakoor
-
Search should still return the intended shabad.
-
-
Search seems overly dependent on “first letter of each shabad” style matching
- This is useful as a fallback, but should not be the primary behavior for most user searches.
Goal
Make Auto Detect search behave more like how humans search:
- If user types roman transliteration (even imperfectly), return the intended shabad.
- If user types Gurmukhi, return strong Gurbani matches.
- If user types acrostic/first-letter style, still support that as fallback.
- If user types meaning/English recall phrase, support that as lower-priority fallback.
Desired Ranking Priority (Auto Detect)
For Roman queries (Latin script)
- Strong transliteration match (exact / prefix / fuzzy) ✅ highest priority
- Gurmukhi direct match (if relevant)
- First-letter/acrostic Gurbani match
- Meaning/translation match (lower priority fallback)
For Gurmukhi queries
- Direct Gurbani match ✅ highest priority
- First-letter/acrostic Gurbani match
- Transliteration match (if useful)
- Meaning/translation match
Proposed Solution (Implementation Direction)
1) Expand indexed search fields (Meilisearch documents)
For each searchable unit (shabad / panktee / line), index multiple searchable fields, not just one.
Suggested fields (example names):
gurbani(original Gurmukhi text)gurbani_first_letters(existing/derived acrostic)transliteration(current transliteration)transliteration_normalized(normalized roman text for fuzzy matching)meaning/translation(English meaning / gloss)- (optional)
transliteration_aliases(common spellings if available)
Why: Meilisearch relevancy depends heavily on which attributes are searchable and their order. Earlier searchable attributes are treated as more relevant. ([meilisearch.com]2)
2) Add query normalization (app-side) before sending to Meilisearch
Roman-input search quality will improve significantly if we normalize user input before querying.
Examples of normalization:
-
trim extra spaces
-
lowercase
-
collapse repeated characters:
jooo→joo/jo(configurable)
-
normalize common phonetic variants:
thakoor→thakurmenge→mange(if rule-based mapping is safe)
-
remove punctuation/noise
This can be done in a conservative way (don’t over-normalize).
Note: Meilisearch typo tolerance helps, but by default typo tolerance is limited based on word length (e.g., short words like
jo,tuare less tolerant), so app-side normalization is important for roman transliteration use cases. ([meilisearch.com]1)
3) Detect script type (Roman vs Gurmukhi) and run ranked search strategy
Add lightweight query classification:
- Gurmukhi query
- Roman query
- Mixed query (edge case)
Then run search in priority order (multi-pass or weighted merge):
Option A (recommended): Multi-pass search + merge in app layer
Run multiple queries (or searches against different fields) and merge results with explicit priority buckets.
Example for Roman query:
- Pass 1: transliteration + transliteration_normalized
- Pass 2: gurbani_first_letters
- Pass 3: meaning
- Merge + dedupe + preserve priority
This gives us deterministic behavior and avoids fighting global index settings.
Option B: Single-pass search with tuned searchable attribute order
Possible, but harder to make behave differently for Roman vs Gurmukhi queries.
4) Tune Meilisearch typo tolerance and relevancy settings
Meilisearch supports:
- typo tolerance settings
- ranking rules
- searchable attribute ordering ([meilisearch.com]1)
We should evaluate:
- enabling/tuning typo tolerance on transliteration fields
- adjusting typo thresholds (
minWordSizeForTypos) if needed - ensuring transliteration fields participate in search and ranking correctly
5) Add observability for search quality (debug mode / logging)
For QA and tuning:
- log query
- detected script
- normalized query
- search passes run
- top N results returned
- which field matched (if available)
- ranking score (optional during dev/QA)
This will make it easier to iterate quickly and compare improvements.
(Meilisearch can return ranking scores when configured in search parameters, useful for debugging relevance tuning.) ([meilisearch.com]3)
Acceptance Criteria
Functional
- Searching
jo mange thakurreturns the expected shabad in top results (ideally top 1–3) - Searching
tu data datarreturns the expected shabad in top results - Searching typo-heavy roman input like
jooo menge thakoorstill returns the intended shabad in results - Searching a full or near-full panktee in Gurmukhi reliably returns correct result(s)
- Existing first-letter/acrostic search continues to work
- Meaning-based search still works as fallback and does not overpower direct Gurbani/transliteration matches
Ranking behavior
- Roman queries prioritize transliteration matches over first-letter matches
- Gurmukhi queries prioritize direct Gurbani matches over transliteration/meaning matches
- Results are deduplicated when same shabad matches across multiple passes/fields
Quality / Regression
- No major regression in search response time (define threshold)
- No major regression in known existing STTM search flows
- Test coverage added for representative Roman/Gurmukhi/fuzzy cases
Suggested Test Queries (Initial QA Set)
Roman exact/common spellings
jo mange thakurtu data datarhar har naam nidhan hai
Roman fuzzy/mistyped
jooo menge thakoorjo maange thakurtoo data datar
Gurmukhi
- (full panktee exact)
- (partial panktee)
- (first-letter style query)
Meaning fallback
those who ask from Yougiver of gifts(or other known meaning phrases)
Out of Scope (for this ticket)
- Full semantic search / embeddings
- ML-based phonetic transliteration correction
- Personalized search ranking
- Cross-language intent understanding beyond current indexed fields
Implementation Notes / Hints
-
Start with small controlled dataset (few known shabads) for tuning.
-
Compare before/after relevance using fixed benchmark queries.
-
Prefer incremental rollout:
- Add fields + indexing
- Add query normalization
- Add script-aware ranking / multi-pass merge
- Tune typo tolerance + thresholds
Why this matters
Users often remember:
- a few words,
- approximate roman spelling,
- a sound-alike version,
- or a meaning snippet.
Auto Detect should feel forgiving and intuitive — especially for Sangat searching from memory.