Next Steps: Slovak Language Search Implementation

Date: 2025-12-12 Priority: CRITICAL Estimated Effort: 2 weeks

🎯 What We Discovered

The Real Problem

Your performance issue is NOT an architectural flaw - it's a missing PostgreSQL configuration!

Root Cause:

Slovak language requires diacritic-aware search (č, š, ž, á, í, ó, ú, etc.)
PostgreSQL doesn't have built-in Slovak text search configuration
POSITION search type was a workaround using regex to rank exact diacritic matches higher
Regex on idx_tsvector::text bypasses GIN index → 100× slower

Evidence:

Original POC (line 128): "Need to implement special characters like Slovak 'č', 'š'"
Slovak FTS expert article (linuxos.sk): Shows proper solution using custom configs
Code analysis: POSITION search checks for exact vs unaccented matches via regex

Why It Was Fast, Then Slow

Timeline:

Initially: Simple to_tsvector('simple', text) - fast but no diacritic ranking
"Fixed search cases": Added POSITION search with regex for Slovak diacritics
Performance degraded: Regex runs on every row (O(n × 6t) instead of O(log n))

✅ The Solution

Create Slovak Text Search Configuration

Instead of regex workaround, use PostgreSQL's native multi-weight indexing:

-- 1. Create Slovak configuration
CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION sk_unaccent
  ALTER MAPPING FOR word, asciiword
  WITH unaccent, simple;

-- 2. Multi-weight indexing
setweight(to_tsvector('simple', 'ruža'), 'A')       -- Exact Slovak: highest rank
|| setweight(to_tsvector('sk_unaccent', 'ruža'), 'B')  -- Normalized: medium rank
|| setweight(to_tsvector('simple', unaccent('ruža')), 'C')  -- Unaccented: fallback

-- 3. Search with native ts_rank (NO REGEX!)
SELECT *, ts_rank(array[1.0, 0.7, 0.4, 0.2], idx_tsvector, query) AS rank
FROM idx_product_ts
WHERE idx_tsvector @@ to_tsquery('sk_unaccent', 'ruža')
ORDER BY rank DESC;

Benefits:

✅ 100× faster (uses GIN index)
✅ Slovak diacritics rank correctly
✅ Supports Czech, Polish, Hungarian too
✅ No regex needed
✅ Scales to millions of rows

🚀 Implementation Plan

Phase 1: Database Setup (1 day)

File: Create migration script migration/V1.0__Slovak_Text_Search.sql

-- Create configurations for Central European languages
CREATE EXTENSION IF NOT EXISTS unaccent;

CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION sk_unaccent
  ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
  WITH unaccent, simple;

-- Repeat for Czech, Polish, Hungarian
CREATE TEXT SEARCH CONFIGURATION cs_unaccent (COPY = sk_unaccent);
CREATE TEXT SEARCH CONFIGURATION pl_unaccent (COPY = sk_unaccent);
CREATE TEXT SEARCH CONFIGURATION hu_unaccent (COPY = sk_unaccent);

Test:

-- Should normalize diacritics
SELECT to_tsvector('sk_unaccent', 'červená ruža');
SELECT to_tsquery('sk_unaccent', 'cervena & ruza');

Phase 2: Code Changes (2-3 days)

File: PGTextSearchIndexProvider.java

Change 1: Update getTSConfig() (Line 94-95)

private String getTSConfig(Properties ctx, String trxName) {
  String language = Env.getAD_Language(ctx);

  switch (language) {
    case "sk_SK": return "sk_unaccent";
    case "cs_CZ": return "cs_unaccent";
    case "pl_PL": return "pl_unaccent";
    case "hu_HU": return "hu_unaccent";
    default: return "simple";
  }
}

Change 2: Multi-weight indexing in documentContentToTsvector() (Lines 426-454)

// Weight A: Exact match
documentContent.append("setweight(to_tsvector('simple', ?::text), 'A') || ");
params.add(value);

// Weight B: Language-normalized
documentContent.append("setweight(to_tsvector('").append(tsConfig).append("', ?::text), 'B') || ");
params.add(value);

// Weight C: Fully unaccented
documentContent.append("setweight(to_tsvector('simple', unaccent(?::text)), 'C') || ");
params.add(value);

Change 3: DELETE POSITION search type (Lines 670-715) - REMOVE ENTIRELY

Change 4: Update TS_RANK with weight array (Lines 657-669)

case TS_RANK:
  rankSql.append("ts_rank(")
         .append("array[1.0, 0.7, 0.4, 0.2], ")  // A, B, C, D weights
         .append("idx_tsvector, ")
         .append("to_tsquery(?::regconfig, ?::text))");
  params.add(tsConfig);
  params.add(sanitizedQuery);
  break;

Phase 3: Reindexing (1 day)

Run CreateSearchIndex process for all indexes
Monitor progress and index sizes
Verify GIN indexes with EXPLAIN ANALYZE

Phase 4: Testing (2-3 days)

Test Cases:

Slovak Exact Match:

Query: "ruža"
Expected: Products with "ruža" rank #1

Slovak Unaccented:

Query: "ruza"
Expected: Still finds "ruža", ranks slightly lower

Czech Variant:

Query: "růže"
Expected: Finds both Czech and Slovak roses

Performance Benchmark:

EXPLAIN ANALYZE
SELECT * FROM idx_product_ts
WHERE idx_tsvector @@ to_tsquery('sk_unaccent', 'červená & ruža')
ORDER BY ts_rank(array[1.0, 0.7, 0.4, 0.2], idx_tsvector, query) DESC;

Expected: Index Scan using GIN, <100ms for 10K rows

Phase 5: Rollout (1 day)

Deploy to staging
User acceptance testing with real Slovak products
Production deployment
Monitor metrics

📊 Expected Results

Performance Improvement

Scenario	Current (POSITION)	After (TS_RANK)	Improvement
1,000 rows	500ms	5ms	100×
10,000 rows	5,000ms	50ms	100×
100,000 rows	50,000ms (unusable)	100ms	500×

Search Quality

Query Type	Current	After	Notes
Slovak exact ("ruža")	✅ Works	✅ Better ranking	Weight A = highest
Unaccented ("ruza")	✅ Works	✅ Works	Weight C = fallback
Czech variant ("růže")	⚠️ Lower rank	✅ Good rank	Weight B = medium
Typo ("ruzha")	❌ No results	⚠️ Fuzzy possible	Future enhancement

🎓 Lessons Learned

What Went Wrong

Missing PostgreSQL expertise: Didn't know about custom text search configs
Workaround culture: Used regex instead of researching proper solution
No performance baseline: Didn't measure before/after when adding POSITION
Inadequate documentation: Slovak requirements not documented in code

What Went Right

Plugin architecture: Easy to add new providers/configurations
Event-driven sync: Real-time indexing works well
Configuration flexibility: AD_SearchIndex tables allow customization
Code organization: Clear separation makes fixes easier

Best Practices Going Forward

Document language requirements in CLAUDE.md and code comments
Performance benchmarks before/after architectural changes
Use PostgreSQL native features before implementing workarounds
Consult language-specific FTS resources (like linuxos.sk article)
Test with production data from the beginning

📚 Resources Created

docs/slovak-language-architecture.md - Complete root cause analysis and solution
docs/postgres-fts-performance-recap.md - Performance analysis (existing)
CLAUDE.md - Updated with Slovak language context
.claude/agents - Symlinked to cloudempiere-workspace (for future development)
.claude/commands - Symlinked to cloudempiere-workspace

🤝 Next Actions

For You:

Review docs/slovak-language-architecture.md
Decide on timeline (recommended: 2 weeks)
Approve database migration approach
Provide test data (Slovak product names, descriptions)

For Development:

Create migration script for Slovak text search config
Implement code changes in PGTextSearchIndexProvider.java
Write unit tests for Slovak language scenarios
Performance testing with real data
Staging deployment

For Future:

Consider ispell dictionary for Slovak stemming (Phase 2)
Add synonym support for product search
Implement spell correction for typos
Vector search for semantic similarity (see architectural analysis)

💡 Quick Win Option

If you need immediate relief:

Simply switch from POSITION to TS_RANK in the UI:

File: ZkSearchIndexUI.java:189

// Change this:
SearchType.POSITION

// To this:
SearchType.TS_RANK

Impact: 50-100× faster immediately, but loses Slovak diacritic ranking quality

Then: Implement proper Slovak config solution for both speed AND quality

Questions? Need clarification on any step?

I can help with:

Writing the migration scripts
Implementing the code changes
Creating test cases
Setting up benchmarks
Reviewing before deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next Steps: Slovak Language Search Implementation

🎯 What We Discovered

The Real Problem

Why It Was Fast, Then Slow

✅ The Solution

Create Slovak Text Search Configuration

🚀 Implementation Plan

Phase 1: Database Setup (1 day)

Phase 2: Code Changes (2-3 days)

Phase 3: Reindexing (1 day)

Phase 4: Testing (2-3 days)

Phase 5: Rollout (1 day)

📊 Expected Results

Performance Improvement

Search Quality

🎓 Lessons Learned

What Went Wrong

What Went Right

Best Practices Going Forward

📚 Resources Created

🤝 Next Actions

💡 Quick Win Option

FilesExpand file tree

next-steps.md

Latest commit

History

next-steps.md

File metadata and controls

Next Steps: Slovak Language Search Implementation

🎯 What We Discovered

The Real Problem

Why It Was Fast, Then Slow

✅ The Solution

Create Slovak Text Search Configuration

🚀 Implementation Plan

Phase 1: Database Setup (1 day)

Phase 2: Code Changes (2-3 days)

Phase 3: Reindexing (1 day)

Phase 4: Testing (2-3 days)

Phase 5: Rollout (1 day)

📊 Expected Results

Performance Improvement

Search Quality

🎓 Lessons Learned

What Went Wrong

What Went Right

Best Practices Going Forward

📚 Resources Created

🤝 Next Actions

💡 Quick Win Option