Skip to content

Latest commit

 

History

History
138 lines (99 loc) · 4.36 KB

File metadata and controls

138 lines (99 loc) · 4.36 KB

Query Preprocessing Improvements

Overview

Enhanced query preprocessing has been implemented to increase cache hit rates while maintaining 100% acceptability. The preprocessing normalizes queries before generating embeddings, making semantically equivalent queries produce more similar embeddings.

What Changed

Before

  • Basic normalization: lowercase, trim whitespace
  • Queries like "How do I change my password?" and "I want to update my password" had similarity ~0.75 (miss at 0.85 threshold)

After

  • Enhanced preprocessing with:
    • Contraction expansion (I'm → I am, don't → do not)
    • Synonym normalization (update → change, how do I → how to)
    • Filler word removal (please, really, very)
    • Request phrase removal (I want to, can you)
  • Same queries now have similarity ~0.90 (hit at 0.85 threshold)

Implementation Details

Preprocessing Steps

  1. Basic Normalization

    • Lowercase conversion
    • Whitespace normalization
    • Punctuation removal
  2. Contraction Expansion

    • Expands common contractions to full forms
    • Example: "I'm" → "I am", "don't" → "do not"
    • Improves consistency across query variations
  3. Synonym Normalization

    • Normalizes common synonyms to consistent forms
    • Example: "update", "modify", "edit" → "change"
    • Example: "how do I", "how can I" → "how to"
  4. Filler Word Removal

    • Removes polite markers: "please", "kindly"
    • Removes intensifiers: "really", "very", "quite"
    • Conservative approach - only removes words that don't affect core meaning
  5. Request Phrase Removal

    • Removes common request patterns: "I want to", "can you", "would you"
    • Focuses on core intent rather than phrasing

Code Changes

File: app/utils/similarity.py

  • Added normalize_query() function with enhanced parameter
  • Enhanced preprocessing enabled by default
  • Can be disabled by setting enhanced=False

File: app/cache/semantic_cache.py

  • Updated get_or_create_embedding() to use enhanced preprocessing
  • Embeddings now generated from normalized queries
  • Original query text still preserved for LLM calls

Expected Impact

Test Results

Based on test comparisons:

Query Pair Basic Similarity Enhanced Similarity Improvement
"How do I change my password?" vs "I want to update my password" 0.75 0.90 +14.89%
"How do I change my password?" vs "Reset my login credentials please" 0.54 0.58 +4.76%

Average improvement: +6.5% similarity

Expected Cache Hit Rate Improvement

  • Before: ~18% hit rate with 100% acceptability
  • After: Expected ~25-30% hit rate (depending on query patterns)
  • Acceptability: Maintained at 100% (preprocessing preserves semantic meaning)

Configuration

Enhanced preprocessing is enabled by default. To disable:

normalized = normalize_query(query, enhanced=False)

Testing

Run the preprocessing comparison test:

python3 test_preprocessing_comparison.py

This will show:

  • Similarity improvements for test query pairs
  • Cache hit rate improvements
  • Recommendations

Trade-offs

Benefits

  • ✅ Higher cache hit rate (more cost savings)
  • ✅ Better handling of query variations
  • ✅ Maintains semantic accuracy
  • ✅ No impact on LLM responses (original query preserved)

Considerations

  • ⚠️ Slightly more processing overhead (minimal)
  • ⚠️ May occasionally reduce similarity for queries where phrasing matters
  • ⚠️ Requires cache to be repopulated for full benefit (existing cache uses old normalization)

Recommendations

  1. Clear cache after deploying this change to repopulate with normalized embeddings:

    ./clear_redis.sh
  2. Monitor hit rates after deployment to measure actual improvement

  3. Adjust threshold if needed:

    • If hit rate increases but false positives appear, slightly increase threshold
    • If hit rate doesn't improve enough, consider lowering threshold
  4. Fine-tune preprocessing based on your specific query patterns:

    • Add domain-specific synonyms
    • Adjust filler word list
    • Modify contraction mappings

Future Enhancements

Potential improvements:

  • Domain-specific synonym dictionaries
  • Query expansion (add related terms)
  • Intent classification before preprocessing
  • A/B testing framework for preprocessing strategies