Date: 2025-12-12 Priority: CRITICAL Estimated Effort: 2 weeks
Your performance issue is NOT an architectural flaw - it's a missing PostgreSQL configuration!
Root Cause:
- Slovak language requires diacritic-aware search (č, š, ž, á, í, ó, ú, etc.)
- PostgreSQL doesn't have built-in Slovak text search configuration
- POSITION search type was a workaround using regex to rank exact diacritic matches higher
- Regex on
idx_tsvector::textbypasses GIN index → 100× slower
Evidence:
- Original POC (line 128): "Need to implement special characters like Slovak 'č', 'š'"
- Slovak FTS expert article (linuxos.sk): Shows proper solution using custom configs
- Code analysis: POSITION search checks for exact vs unaccented matches via regex
Timeline:
- Initially: Simple
to_tsvector('simple', text)- fast but no diacritic ranking - "Fixed search cases": Added POSITION search with regex for Slovak diacritics
- Performance degraded: Regex runs on every row (O(n × 6t) instead of O(log n))
Instead of regex workaround, use PostgreSQL's native multi-weight indexing:
-- 1. Create Slovak configuration
CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION sk_unaccent
ALTER MAPPING FOR word, asciiword
WITH unaccent, simple;
-- 2. Multi-weight indexing
setweight(to_tsvector('simple', 'ruža'), 'A') -- Exact Slovak: highest rank
|| setweight(to_tsvector('sk_unaccent', 'ruža'), 'B') -- Normalized: medium rank
|| setweight(to_tsvector('simple', unaccent('ruža')), 'C') -- Unaccented: fallback
-- 3. Search with native ts_rank (NO REGEX!)
SELECT *, ts_rank(array[1.0, 0.7, 0.4, 0.2], idx_tsvector, query) AS rank
FROM idx_product_ts
WHERE idx_tsvector @@ to_tsquery('sk_unaccent', 'ruža')
ORDER BY rank DESC;Benefits:
- ✅ 100× faster (uses GIN index)
- ✅ Slovak diacritics rank correctly
- ✅ Supports Czech, Polish, Hungarian too
- ✅ No regex needed
- ✅ Scales to millions of rows
File: Create migration script migration/V1.0__Slovak_Text_Search.sql
-- Create configurations for Central European languages
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION sk_unaccent
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH unaccent, simple;
-- Repeat for Czech, Polish, Hungarian
CREATE TEXT SEARCH CONFIGURATION cs_unaccent (COPY = sk_unaccent);
CREATE TEXT SEARCH CONFIGURATION pl_unaccent (COPY = sk_unaccent);
CREATE TEXT SEARCH CONFIGURATION hu_unaccent (COPY = sk_unaccent);Test:
-- Should normalize diacritics
SELECT to_tsvector('sk_unaccent', 'červená ruža');
SELECT to_tsquery('sk_unaccent', 'cervena & ruza');File: PGTextSearchIndexProvider.java
Change 1: Update getTSConfig() (Line 94-95)
private String getTSConfig(Properties ctx, String trxName) {
String language = Env.getAD_Language(ctx);
switch (language) {
case "sk_SK": return "sk_unaccent";
case "cs_CZ": return "cs_unaccent";
case "pl_PL": return "pl_unaccent";
case "hu_HU": return "hu_unaccent";
default: return "simple";
}
}Change 2: Multi-weight indexing in documentContentToTsvector() (Lines 426-454)
// Weight A: Exact match
documentContent.append("setweight(to_tsvector('simple', ?::text), 'A') || ");
params.add(value);
// Weight B: Language-normalized
documentContent.append("setweight(to_tsvector('").append(tsConfig).append("', ?::text), 'B') || ");
params.add(value);
// Weight C: Fully unaccented
documentContent.append("setweight(to_tsvector('simple', unaccent(?::text)), 'C') || ");
params.add(value);Change 3: DELETE POSITION search type (Lines 670-715) - REMOVE ENTIRELY
Change 4: Update TS_RANK with weight array (Lines 657-669)
case TS_RANK:
rankSql.append("ts_rank(")
.append("array[1.0, 0.7, 0.4, 0.2], ") // A, B, C, D weights
.append("idx_tsvector, ")
.append("to_tsquery(?::regconfig, ?::text))");
params.add(tsConfig);
params.add(sanitizedQuery);
break;- Run
CreateSearchIndexprocess for all indexes - Monitor progress and index sizes
- Verify GIN indexes with
EXPLAIN ANALYZE
Test Cases:
-
Slovak Exact Match:
Query: "ruža" Expected: Products with "ruža" rank #1 -
Slovak Unaccented:
Query: "ruza" Expected: Still finds "ruža", ranks slightly lower -
Czech Variant:
Query: "růže" Expected: Finds both Czech and Slovak roses -
Performance Benchmark:
EXPLAIN ANALYZE SELECT * FROM idx_product_ts WHERE idx_tsvector @@ to_tsquery('sk_unaccent', 'červená & ruža') ORDER BY ts_rank(array[1.0, 0.7, 0.4, 0.2], idx_tsvector, query) DESC; Expected: Index Scan using GIN, <100ms for 10K rows
- Deploy to staging
- User acceptance testing with real Slovak products
- Production deployment
- Monitor metrics
| Scenario | Current (POSITION) | After (TS_RANK) | Improvement |
|---|---|---|---|
| 1,000 rows | 500ms | 5ms | 100× |
| 10,000 rows | 5,000ms | 50ms | 100× |
| 100,000 rows | 50,000ms (unusable) | 100ms | 500× |
| Query Type | Current | After | Notes |
|---|---|---|---|
| Slovak exact ("ruža") | ✅ Works | ✅ Better ranking | Weight A = highest |
| Unaccented ("ruza") | ✅ Works | ✅ Works | Weight C = fallback |
| Czech variant ("růže") | ✅ Good rank | Weight B = medium | |
| Typo ("ruzha") | ❌ No results | Future enhancement |
- Missing PostgreSQL expertise: Didn't know about custom text search configs
- Workaround culture: Used regex instead of researching proper solution
- No performance baseline: Didn't measure before/after when adding POSITION
- Inadequate documentation: Slovak requirements not documented in code
- Plugin architecture: Easy to add new providers/configurations
- Event-driven sync: Real-time indexing works well
- Configuration flexibility: AD_SearchIndex tables allow customization
- Code organization: Clear separation makes fixes easier
- Document language requirements in CLAUDE.md and code comments
- Performance benchmarks before/after architectural changes
- Use PostgreSQL native features before implementing workarounds
- Consult language-specific FTS resources (like linuxos.sk article)
- Test with production data from the beginning
docs/slovak-language-architecture.md- Complete root cause analysis and solutiondocs/postgres-fts-performance-recap.md- Performance analysis (existing)CLAUDE.md- Updated with Slovak language context.claude/agents- Symlinked to cloudempiere-workspace (for future development).claude/commands- Symlinked to cloudempiere-workspace
For You:
- Review
docs/slovak-language-architecture.md - Decide on timeline (recommended: 2 weeks)
- Approve database migration approach
- Provide test data (Slovak product names, descriptions)
For Development:
- Create migration script for Slovak text search config
- Implement code changes in
PGTextSearchIndexProvider.java - Write unit tests for Slovak language scenarios
- Performance testing with real data
- Staging deployment
For Future:
- Consider ispell dictionary for Slovak stemming (Phase 2)
- Add synonym support for product search
- Implement spell correction for typos
- Vector search for semantic similarity (see architectural analysis)
If you need immediate relief:
Simply switch from POSITION to TS_RANK in the UI:
File: ZkSearchIndexUI.java:189
// Change this:
SearchType.POSITION
// To this:
SearchType.TS_RANKImpact: 50-100× faster immediately, but loses Slovak diacritic ranking quality
Then: Implement proper Slovak config solution for both speed AND quality
Questions? Need clarification on any step?
I can help with:
- Writing the migration scripts
- Implementing the code changes
- Creating test cases
- Setting up benchmarks
- Reviewing before deployment