Date: 2025-12-12 Context: Performance degradation after implementing Slovak language support Status: Critical architectural issue identified
The POSITION search type was introduced to handle Slovak language-specific requirements that standard PostgreSQL full-text search couldn't satisfy. This created a performance anti-pattern (regex on tsvector) that makes the system unusable at scale.
Root Cause: Lack of proper Slovak text search configuration in PostgreSQL Impact: 50-100× performance degradation Solution: Create Slovak-specific text search configuration with proper dictionaries
From docs/SearchIndex (iDempiere) e8f64240f05942388b4b5f89687657f5.md:
Initial Implementation:
- PostgreSQL triggers with
unaccentextension - Very primitive: "allow use like or ilike"
- Performance: Fast initially
- Limitation: "doesn't work for Slovak language (selected on ad_client)"
-- Early POC approach
UPDATE m_product_trl p SET globalsearchtermsvector =
to_tsvector(coalesce(v_VendorProductNo, ' ')) ||
to_tsvector(p.Name) ||
to_tsvector(unaccent_upper(p.Name)) ||
to_tsvector(v_value) ||
to_tsvector(COALESCE(v_upc,'')) ||
to_tsvector(COALESCE(v_sku,''))
WHERE p.m_product_id = NEW.m_product_id;Challenges Identified (Line 128):
"Need to understand and implement special characters like Slovak 'č', 'š', unaccent, and synonym support."
Timeline:
- Initially fast: Using simple
to_tsvector()withunaccent - Slovak requirements emerged: Exact diacritic matches needed higher ranking
- "Fixed some search cases": Added POSITION search type with regex
- Performance degraded: 50-100× slower due to regex on every row
What POSITION Search Actually Does:
// PGTextSearchIndexProvider.java:692-715
// Tries to differentiate:
// 1. Exact match with diacritics (e.g., "růžová") → score 0.5
// 2. Unaccented match (e.g., "ruzova") → score 1.0
// 3. Extract position weights from tsvector text representation
// Uses regex because PostgreSQL FTS doesn't have Slovak config
regexp_matches(idx_tsvector::text, E'\\yrůžová\\y') // ← Destroys GIN indexKey Insights:
-
Diacritics Are Critical:
- Slovak: á, ä, č, ď, é, í, ĺ, ľ, ň, ó, ô, ŕ, š, ť, ú, ý, ž
- "závislosť" ≠ "zavislost" (different meanings!)
- Users expect to find both when searching either variant
-
Solution: Custom Text Search Configuration:
CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple); ALTER TEXT SEARCH CONFIGURATION sk_unaccent ALTER MAPPING FOR word, asciiword WITH unaccent, simple;
-
ispell Dictionary for Slovak:
- Handles morphological variations (grammatical cases)
- Can be combined with unaccent for flexible matching
# Strip diacritics from dictionary unidecode < sk.dict > sk_unaccent.dict
-
Performance:
- GIN indexes: ~1ms for common words
- RUM indexes: 12-64× faster for ranked sorting
- Critical: Use native PostgreSQL functions, NOT regex!
-
Stop Words:
- Slovak stop words: "a", "alebo", "ani", "však", "že"
- Should be excluded from indexing
Real Reason (not documented in code):
POSITION search was created to solve this ranking problem:
User searches: "ruža" (rose)
Desired ranking:
1. "ruža" (exact Slovak match) ← HIGHEST
2. "růže" (Czech variant) ← MEDIUM
3. "ruza" (unaccented) ← LOWEST
Problem: PostgreSQL doesn't have Slovak config
Solution attempted: Regex to check for exact diacritics
Result: Performance catastrophe
The regex pattern:
// Line 692: Check if exact Slovak diacritics present
CASE WHEN EXISTS (
SELECT 1 FROM regexp_matches(idx_tsvector::text, E'\\yruža\\y')
) THEN 0.5 // Exact match bonus
ELSE ...Why this is wrong:
- Casting
idx_tsvector::textbypasses GIN index - Regex runs on EVERY row (O(n))
- Could be solved with proper Slovak text search config (O(log n))
Concept: Index same content with multiple text search configs, assign different weights
-- 1. Create Slovak text search configuration
CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple);
-- 2. Add unaccent dictionary
ALTER TEXT SEARCH CONFIGURATION sk_unaccent
ALTER MAPPING FOR asciiword, word, numword, asciihword, hword
WITH unaccent, slovak_ispell, simple;
-- 3. Multi-weight indexing
CREATE OR REPLACE FUNCTION build_slovak_tsvector(
text_original TEXT,
text_normalized TEXT
) RETURNS tsvector AS $$
BEGIN
RETURN
-- Weight A: Exact Slovak match (with diacritics)
setweight(to_tsvector('simple', text_original), 'A') ||
-- Weight B: Slovak stemmed/normalized
setweight(to_tsvector('sk_unaccent', text_normalized), 'B') ||
-- Weight C: Fully unaccented (fallback)
setweight(to_tsvector('simple', unaccent(text_original)), 'C');
END;
$$ LANGUAGE plpgsql IMMUTABLE;Index Creation:
-- Apply to idx_product_ts
UPDATE idx_product_ts
SET idx_tsvector = build_slovak_tsvector(
original_text,
normalize_slovak(original_text)
);
CREATE INDEX idx_product_ts_gin ON idx_product_ts USING gin(idx_tsvector);Search Query (NO REGEX!):
-- Multi-config search with ranking
SELECT
ad_table_id,
record_id,
ts_rank(
array[1.0, 0.7, 0.4, 0.2], -- Weight preferences: A=1.0, B=0.7, C=0.4, D=0.2
idx_tsvector,
to_tsquery('sk_unaccent', 'ruža')
) AS rank
FROM idx_product_ts
WHERE idx_tsvector @@ to_tsquery('sk_unaccent', 'ruža')
ORDER BY rank DESC;Why this works:
- ✅ Uses GIN index (fast)
- ✅ Native ts_rank() for scoring
- ✅ Exact Slovak matches rank highest (weight A)
- ✅ Unaccented fallback works (weight C)
- ✅ No regex needed
- ✅ Scales to millions of rows
File: PGTextSearchIndexProvider.java
Current getTSConfig() (Line 94-95):
private String getTSConfig(Properties ctx, String trxName) {
return "unaccent"; // Hardcoded - PROBLEM!
}Improved getTSConfig():
private String getTSConfig(Properties ctx, String trxName) {
// Get language from context or AD_Client
String language = Env.getAD_Language(ctx);
switch (language) {
case "sk_SK":
return "sk_unaccent"; // Slovak configuration
case "cs_CZ":
return "cs_unaccent"; // Czech configuration
case "pl_PL":
return "pl_unaccent"; // Polish configuration
case "hu_HU":
return "hu_unaccent"; // Hungarian configuration
default:
return "simple"; // Fallback
}
}Update documentContentToTsvector() (Lines 426-454):
private String documentContentToTsvector(
Map<String, SearchIndexColumnData> tableDataSet,
String tsConfig,
List<Object> params) {
StringBuilder documentContent = new StringBuilder();
for (Map.Entry<String, SearchIndexColumnData> entry : tableDataSet.entrySet()) {
SearchIndexColumnData columnData = entry.getValue();
String value = Objects.toString(columnData.getValue(), "");
// Weight A: Exact match with original diacritics
documentContent.append("setweight(")
.append("to_tsvector('simple', ?::text), 'A') || ");
params.add(value);
// Weight B: Language-specific normalized
documentContent.append("setweight(")
.append("to_tsvector('").append(tsConfig).append("', ?::text), 'B') || ");
params.add(value);
// Weight C: Fully unaccented (fallback for typos)
documentContent.append("setweight(")
.append("to_tsvector('simple', unaccent(?::text)), 'C') || ");
params.add(value);
}
// Remove trailing ' || '
if (documentContent.length() > 4) {
documentContent.setLength(documentContent.length() - 4);
}
return documentContent.toString();
}Delete POSITION search type entirely (Lines 670-715)
Update TS_RANK (Lines 657-669):
case TS_RANK:
// Use weight array for multi-config ranking
rankSql.append("ts_rank(")
.append("array[1.0, 0.7, 0.4, 0.2], ") // A, B, C, D weights
.append("idx_tsvector, ")
.append("to_tsquery(?::regconfig, ?::text))");
params.add(tsConfig);
params.add(sanitizedQuery);
break;Create Slovak text search configuration:
-- Migration script: V1.0__Slovak_Text_Search_Config.sql
-- 1. Ensure unaccent extension exists
CREATE EXTENSION IF NOT EXISTS unaccent;
-- 2. Create Slovak configuration
CREATE TEXT SEARCH CONFIGURATION sk_unaccent (COPY = simple);
-- 3. Map token types to unaccent + simple dictionary
ALTER TEXT SEARCH CONFIGURATION sk_unaccent
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH unaccent, simple;
-- 4. Optional: Add Slovak stop words
CREATE TEXT SEARCH DICTIONARY slovak_stopwords (
TEMPLATE = pg_catalog.simple,
STOPWORDS = slovak
);
-- Create stopwords file: /usr/share/postgresql/tsearch_data/slovak.stop
-- Content: a, alebo, ale, ani, až, bez, či, do, ...
-- 5. Create similar configs for other Central European languages
CREATE TEXT SEARCH CONFIGURATION cs_unaccent (COPY = sk_unaccent);
CREATE TEXT SEARCH CONFIGURATION pl_unaccent (COPY = sk_unaccent);
CREATE TEXT SEARCH CONFIGURATION hu_unaccent (COPY = sk_unaccent);
-- 6. Reindex existing data
-- (This will be done by com.cloudempiere.searchindex CreateSearchIndex process)-- Current implementation (simplified)
SELECT ad_table_id, record_id,
CASE WHEN EXISTS (
SELECT 1 FROM regexp_matches(idx_tsvector::text, E'\\yruža\\y')
) THEN 0.5
ELSE 1.0 END *
COALESCE((regexp_match(idx_tsvector::text, E'ruža[^'']*:(\\d+)'))[1]::int, 1000)
FROM idx_product_ts
WHERE idx_tsvector @@ to_tsquery('simple', 'ruža')
ORDER BY rank ASC;Performance:
- 10,000 rows: ~5 seconds (full table scan + 60,000 regex ops)
- 100,000 rows: ~50 seconds (UNUSABLE)
-- Proposed implementation
SELECT ad_table_id, record_id,
ts_rank(
array[1.0, 0.7, 0.4, 0.2],
idx_tsvector,
to_tsquery('sk_unaccent', 'ruža')
) AS rank
FROM idx_product_ts
WHERE idx_tsvector @@ to_tsquery('sk_unaccent', 'ruža')
ORDER BY rank DESC;Performance:
- 10,000 rows: ~50ms (GIN index scan)
- 100,000 rows: ~100ms (scales logarithmically)
Improvement: 100× faster
Current POSITION Approach:
❌ "ruža" exact → rank 0.5 × position
❌ "růže" Czech → rank 1.0 × position
❌ "ruza" unaccented → rank 10.0 × position
Problem: Position numbers are arbitrary, not semantic
Proposed Weighted TS_RANK:
✅ "ruža" exact → weight A (1.0) → highest rank
✅ "růže" normalized → weight B (0.7) → medium rank
✅ "ruza" unaccented → weight C (0.4) → lower rank
Benefit: Semantic relevance scoring
Test queries (Slovak products):
- "červená ruža" (red rose)
- "modré ruže" (blue roses - Czech plural)
- "ruza cervena" (unaccented typo)
Expected behavior:
- Query 1: Slovak exact matches rank #1
- Query 2: Czech variants rank high, Slovak matches included
- Query 3: Finds results despite typos, ranks lower
Implementation: Use weight array [1.0, 0.7, 0.4, 0.2] to control preferences
- Run migration script to create Slovak text search config
- Test configuration manually:
SELECT to_tsvector('sk_unaccent', 'červená ruža'); SELECT to_tsquery('sk_unaccent', 'cervena & ruza');
- Update
getTSConfig()to return language-specific config - Modify
documentContentToTsvector()for multi-weight indexing - Remove POSITION search type code (670-715)
- Update TS_RANK to use weight array
- Add unit tests for Slovak language scenarios
- Run
CreateSearchIndexprocess for all indexes - Monitor index size (will increase ~3× due to multi-weight)
- Verify GIN indexes are used (EXPLAIN ANALYZE)
- Test Slovak queries with diacritics
- Test Czech/Polish queries (similar languages)
- Performance benchmarks (compare before/after)
- Search quality evaluation (precision/recall)
- Deploy to staging environment
- User acceptance testing
- Production deployment
- Monitor performance metrics
Total effort: ~10-12 days
Issue: Multi-weight indexing creates 3× larger indexes
Mitigation:
- Monitor disk space (estimate: 300MB → 900MB per 1M products)
- PostgreSQL handles this well (GIN indexes are compressed)
- Trade-off: 3× space for 100× speed is acceptable
Issue: Changing text search config may affect existing queries
Mitigation:
- Keep backward compatibility: Support both 'unaccent' and 'sk_unaccent'
- Gradual migration: Add new config, test, then switch default
- Fallback mechanism in
getTSConfig()
Issue: Slovak ispell dictionary may not exist in all PostgreSQL installations
Mitigation:
- Start with unaccent only (proven to work)
- ispell is optional enhancement (Phase 2)
- Document requirement in installation guide
-
Slovak FTS Article: https://linuxos.sk/blog/mirecove-dristy/detail/fulltext-v-databaze-prakticky-alebo-co-vam-na/
- Key insights: Custom configs, unaccent, RUM indexes, Slovak stop words
-
Original POC Docs:
docs/SearchIndex (iDempiere) e8f64240f05942388b4b5f89687657f5.md- Historical context, BX Omnisearch analysis, identified challenges
-
PostgreSQL FTS Documentation: https://www.postgresql.org/docs/current/textsearch.html
- Text search configurations, dictionaries, ranking functions
-
Unaccent Extension: https://www.postgresql.org/docs/current/unaccent.html
- Diacritic removal, custom rules, performance
The POSITION search type was a workaround for lack of Slovak language support in PostgreSQL text search configuration. By creating a proper sk_unaccent configuration with multi-weight indexing, we can:
✅ Eliminate regex operations (100× faster) ✅ Maintain Slovak language quality (exact diacritics rank higher) ✅ Support Central European languages (Czech, Polish, Hungarian) ✅ Use native PostgreSQL features (GIN indexes, ts_rank) ✅ Scale to millions of rows (logarithmic complexity)
Recommendation: Implement this solution immediately. The performance gain is critical for production use.
Next Steps:
- Review and approve this architecture
- Create database migration scripts
- Implement code changes in
PGTextSearchIndexProvider.java - Test with real Slovak product data
- Deploy to staging environment
Estimated Timeline: 2 weeks to production-ready implementation