Skip to content

Improving Auto Detect Search #1857

@indersinghkhalis

Description

@indersinghkhalis

Summary

The current Auto Detect search in STTM-WEB is not reliably finding expected results when users:

  • type a full/partial panktee
  • type in common roman spellings (e.g. jo mange thakur)
  • type with misspellings / phonetic variations (e.g. jooo menge thakoor)
  • type what they remember “loosely” instead of exact spelling

This creates a poor user experience because users often search from memory, and the search should still find the expected shabad/panktee.

We currently use Meilisearch. Meilisearch already supports typo tolerance and configurable relevancy (ranking rules + searchable attribute order), so we should improve our indexing + query strategy to better support transliteration and fuzzy matching. ([meilisearch.com]1)


Problem Statement

Current issues observed

  1. Full panktee search is weak

    • Searching a full line / near-full line does not reliably return the correct shabad/panktee.
  2. Roman transliteration queries often fail

    • Example:

      • jo mange thakur
      • tu data datar
    • Users expect the known shabad to appear, but results are missing or ranked poorly.

  3. Mistyped / phonetic queries fail

    • Example:

      • jooo menge thakoor
    • Search should still return the intended shabad.

  4. Search seems overly dependent on “first letter of each shabad” style matching

    • This is useful as a fallback, but should not be the primary behavior for most user searches.

Goal

Make Auto Detect search behave more like how humans search:

  • If user types roman transliteration (even imperfectly), return the intended shabad.
  • If user types Gurmukhi, return strong Gurbani matches.
  • If user types acrostic/first-letter style, still support that as fallback.
  • If user types meaning/English recall phrase, support that as lower-priority fallback.

Desired Ranking Priority (Auto Detect)

For Roman queries (Latin script)

  1. Strong transliteration match (exact / prefix / fuzzy) ✅ highest priority
  2. Gurmukhi direct match (if relevant)
  3. First-letter/acrostic Gurbani match
  4. Meaning/translation match (lower priority fallback)

For Gurmukhi queries

  1. Direct Gurbani match ✅ highest priority
  2. First-letter/acrostic Gurbani match
  3. Transliteration match (if useful)
  4. Meaning/translation match

Proposed Solution (Implementation Direction)

1) Expand indexed search fields (Meilisearch documents)

For each searchable unit (shabad / panktee / line), index multiple searchable fields, not just one.

Suggested fields (example names):

  • gurbani (original Gurmukhi text)
  • gurbani_first_letters (existing/derived acrostic)
  • transliteration (current transliteration)
  • transliteration_normalized (normalized roman text for fuzzy matching)
  • meaning / translation (English meaning / gloss)
  • (optional) transliteration_aliases (common spellings if available)

Why: Meilisearch relevancy depends heavily on which attributes are searchable and their order. Earlier searchable attributes are treated as more relevant. ([meilisearch.com]2)


2) Add query normalization (app-side) before sending to Meilisearch

Roman-input search quality will improve significantly if we normalize user input before querying.

Examples of normalization:

  • trim extra spaces

  • lowercase

  • collapse repeated characters:

    • jooojoo / jo (configurable)
  • normalize common phonetic variants:

    • thakoorthakur
    • mengemange (if rule-based mapping is safe)
  • remove punctuation/noise

This can be done in a conservative way (don’t over-normalize).

Note: Meilisearch typo tolerance helps, but by default typo tolerance is limited based on word length (e.g., short words like jo, tu are less tolerant), so app-side normalization is important for roman transliteration use cases. ([meilisearch.com]1)


3) Detect script type (Roman vs Gurmukhi) and run ranked search strategy

Add lightweight query classification:

  • Gurmukhi query
  • Roman query
  • Mixed query (edge case)

Then run search in priority order (multi-pass or weighted merge):

Option A (recommended): Multi-pass search + merge in app layer

Run multiple queries (or searches against different fields) and merge results with explicit priority buckets.

Example for Roman query:

  • Pass 1: transliteration + transliteration_normalized
  • Pass 2: gurbani_first_letters
  • Pass 3: meaning
  • Merge + dedupe + preserve priority

This gives us deterministic behavior and avoids fighting global index settings.

Option B: Single-pass search with tuned searchable attribute order

Possible, but harder to make behave differently for Roman vs Gurmukhi queries.


4) Tune Meilisearch typo tolerance and relevancy settings

Meilisearch supports:

  • typo tolerance settings
  • ranking rules
  • searchable attribute ordering ([meilisearch.com]1)

We should evaluate:

  • enabling/tuning typo tolerance on transliteration fields
  • adjusting typo thresholds (minWordSizeForTypos) if needed
  • ensuring transliteration fields participate in search and ranking correctly

5) Add observability for search quality (debug mode / logging)

For QA and tuning:

  • log query
  • detected script
  • normalized query
  • search passes run
  • top N results returned
  • which field matched (if available)
  • ranking score (optional during dev/QA)

This will make it easier to iterate quickly and compare improvements.

(Meilisearch can return ranking scores when configured in search parameters, useful for debugging relevance tuning.) ([meilisearch.com]3)


Acceptance Criteria

Functional

  • Searching jo mange thakur returns the expected shabad in top results (ideally top 1–3)
  • Searching tu data datar returns the expected shabad in top results
  • Searching typo-heavy roman input like jooo menge thakoor still returns the intended shabad in results
  • Searching a full or near-full panktee in Gurmukhi reliably returns correct result(s)
  • Existing first-letter/acrostic search continues to work
  • Meaning-based search still works as fallback and does not overpower direct Gurbani/transliteration matches

Ranking behavior

  • Roman queries prioritize transliteration matches over first-letter matches
  • Gurmukhi queries prioritize direct Gurbani matches over transliteration/meaning matches
  • Results are deduplicated when same shabad matches across multiple passes/fields

Quality / Regression

  • No major regression in search response time (define threshold)
  • No major regression in known existing STTM search flows
  • Test coverage added for representative Roman/Gurmukhi/fuzzy cases

Suggested Test Queries (Initial QA Set)

Roman exact/common spellings

  • jo mange thakur
  • tu data datar
  • har har naam nidhan hai

Roman fuzzy/mistyped

  • jooo menge thakoor
  • jo maange thakur
  • too data datar

Gurmukhi

  • (full panktee exact)
  • (partial panktee)
  • (first-letter style query)

Meaning fallback

  • those who ask from You
  • giver of gifts (or other known meaning phrases)

Out of Scope (for this ticket)

  • Full semantic search / embeddings
  • ML-based phonetic transliteration correction
  • Personalized search ranking
  • Cross-language intent understanding beyond current indexed fields

Implementation Notes / Hints

  • Start with small controlled dataset (few known shabads) for tuning.

  • Compare before/after relevance using fixed benchmark queries.

  • Prefer incremental rollout:

    1. Add fields + indexing
    2. Add query normalization
    3. Add script-aware ranking / multi-pass merge
    4. Tune typo tolerance + thresholds

Why this matters

Users often remember:

  • a few words,
  • approximate roman spelling,
  • a sound-alike version,
  • or a meaning snippet.

Auto Detect should feel forgiving and intuitive — especially for Sangat searching from memory.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions