Skip to content

Commit db98157

Browse files
doc: added doc for trigram
1 parent 4791470 commit db98157

File tree

3 files changed

+277
-0
lines changed

3 files changed

+277
-0
lines changed

Doc/trigram.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# 🧠 PostgreSQL Trigram (`pg_trgm`) Deep Dive
2+
3+
## Complete Mental Model & End-to-End Flow
4+
5+
---
6+
7+
### 🧩 User Journey
8+
9+
**Flow:**
10+
`User Types "Barca" → Application → PostgreSQL Query → pg_trgm → GIN/GiST Index → Results`
11+
12+
---
13+
14+
## 1. What is pg_trgm?
15+
16+
`pg_trgm` is a PostgreSQL extension that enables **fuzzy string matching** using _trigrams_ (groups of 3 consecutive characters).
17+
18+
### 🔹 Core Concept: Trigrams
19+
20+
A **trigram** is a sequence of three consecutive characters extracted from a string.
21+
22+
**Example**
23+
24+
| String | Trigrams |
25+
| --------- | --------------------------------------------- |
26+
| `"hello"` | `" h"`, `" he"`, `hel`, `ell`, `llo`, `"lo "` |
27+
28+
> Padding with spaces at start/end is important for boundary matching.
29+
30+
```sql
31+
-- Enable the extension
32+
CREATE EXTENSION IF NOT EXISTS pg_trgm;
33+
34+
-- View trigrams for a string
35+
SELECT show_trgm('hello');
36+
-- Result: {" h"," he",ell,hel,llo,"lo "}
37+
38+
```
39+
40+
## 2. 🔢 Similarity Algorithms
41+
42+
### Key functions
43+
44+
```sql
45+
-- Basic similarity (range: 0.0 - 1.0)
46+
SELECT similarity('hello', 'hell'); -- 0.5714286
47+
SELECT similarity('christopher', 'chris'); -- 0.46153846
48+
49+
-- Distance (inverse of similarity)
50+
SELECT 'hello' <-> 'hell' AS distance; -- 0.4285714
51+
52+
-- Word similarity (substring matching)
53+
SELECT word_similarity('chris', 'christopher'); -- 0.8333333
54+
55+
```
56+
57+
## 3. ⚙️ Index Types: GIN vs GiST
58+
59+
🔸 GIN (Generalized Inverted Index)
60+
61+
- ✅ Faster for reads, slower for writes
62+
- ✅ Better for multiple search terms
63+
- ✅ Ideal for search-heavy applications
64+
- ❌ Larger disk space usage
65+
66+
```sql
67+
CREATE INDEX CONCURRENTLY users_name_gin_idx
68+
ON users USING gin (name gin_trgm_ops);
69+
70+
```
71+
72+
🔸 GiST (Generalized Search Tree)
73+
74+
- ✅ Faster for writes, smaller disk footprint
75+
- ✅ Better for mixed read/write workloads
76+
- ❌ Slower for complex searches
77+
78+
```sql
79+
CREATE INDEX CONCURRENTLY users_name_gist_idx
80+
ON users USING gist (name gist_trgm_ops);
81+
82+
```
83+
84+
## 4. 🔁 Complete End-to-End Flow
85+
86+
- Step 1: User Input
87+
88+
- User searches: "michal" (intended: "michael")
89+
90+
- Step 2: Application Query
91+
92+
```sql
93+
SELECT
94+
name,
95+
similarity(name, 'michal') AS score
96+
FROM users
97+
WHERE name % 'michal'
98+
ORDER BY score DESC
99+
LIMIT 10;
100+
```
101+
102+
- Step 3: PostgreSQL Execution Flow
103+
104+
- Parse query with % operator
105+
106+
- Access GIN/GiST trigram index
107+
108+
- Compute trigrams for 'michal':
109+
→ {" m"," mi","mic","ich","cha","hal","al "}
110+
111+
- Retrieve overlapping trigrams via index
112+
113+
- Calculate similarity scores
114+
115+
- Return ranked results
116+
117+
## 🧩 Building Optimal Queries
118+
119+
✅ Basic Similarity Search
120+
121+
```sql
122+
-- Simple fuzzy match (uses index)
123+
SELECT name FROM users WHERE name % 'michal';
124+
125+
-- With scoring and ordering
126+
SELECT
127+
name,
128+
similarity(name, 'michal') AS match_score
129+
FROM users
130+
WHERE name % 'michal'
131+
ORDER BY match_score DESC
132+
LIMIT 10;
133+
134+
135+
```
136+
137+
⚡ Advanced Multi-Strategy Search
138+
139+
```sql
140+
SELECT
141+
name,
142+
similarity(name, 'michal') AS basic_score,
143+
word_similarity('michal', name) AS word_score,
144+
(similarity(name, 'michal') * 0.6 +
145+
word_similarity('michal', name) * 0.4) AS combined_score
146+
FROM users
147+
WHERE
148+
name % 'michal' OR
149+
name ILIKE '%michal%' OR
150+
'michal' % name
151+
ORDER BY combined_score DESC
152+
LIMIT 20;
153+
154+
```
155+
156+
## 9. 🛍 Real-World E-commerce Search Example
157+
158+
```sql
159+
CREATE TABLE products (
160+
id BIGSERIAL PRIMARY KEY,
161+
name TEXT NOT NULL,
162+
description TEXT,
163+
category TEXT,
164+
brand TEXT,
165+
created_at TIMESTAMPTZ DEFAULT NOW()
166+
);
167+
168+
-- Trigram indexes
169+
CREATE INDEX CONCURRENTLY products_name_trgm_idx
170+
ON products USING gin (name gin_trgm_ops);
171+
172+
CREATE INDEX CONCURRENTLY products_description_trgm_idx
173+
ON products USING gin (description gin_trgm_ops);
174+
175+
CREATE INDEX CONCURRENTLY products_brand_trgm_idx
176+
ON products USING gin (brand gin_trgm_ops);
177+
178+
-- Composite index
179+
CREATE INDEX CONCURRENTLY products_search_composite_idx
180+
ON products USING gin (
181+
name gin_trgm_ops,
182+
description gin_trgm_ops,
183+
brand gin_trgm_ops
184+
);
185+
186+
```
187+
188+
## Advanced Product Search Function
189+
190+
```sql
191+
192+
CREATE OR REPLACE FUNCTION search_products(
193+
search_query TEXT,
194+
category_filter TEXT DEFAULT NULL,
195+
min_similarity FLOAT DEFAULT 0.2,
196+
result_limit INT DEFAULT 50
197+
)
198+
RETURNS TABLE (
199+
product_id BIGINT,
200+
product_name TEXT,
201+
product_category TEXT,
202+
product_brand TEXT,
203+
relevance_score FLOAT,
204+
match_source TEXT
205+
)
206+
LANGUAGE plpgsql
207+
STABLE
208+
AS $$
209+
BEGIN
210+
RETURN QUERY
211+
SELECT
212+
p.id,
213+
p.name,
214+
p.category,
215+
p.brand,
216+
GREATEST(
217+
similarity(p.name, search_query),
218+
word_similarity(search_query, p.name),
219+
similarity(p.description, search_query) * 0.7,
220+
similarity(p.brand, search_query) * 0.9
221+
) AS score,
222+
CASE
223+
WHEN p.name ILIKE '%' || search_query || '%' THEN 'name_exact'
224+
WHEN p.description ILIKE '%' || search_query || '%' THEN 'desc_exact'
225+
WHEN p.brand ILIKE '%' || search_query || '%' THEN 'brand_exact'
226+
ELSE 'fuzzy_match'
227+
END AS source
228+
FROM products p
229+
WHERE
230+
(p.name % search_query OR
231+
p.description % search_query OR
232+
p.brand % search_query OR
233+
search_query % p.name OR
234+
p.name ILIKE '%' || search_query || '%' OR
235+
p.description ILIKE '%' || search_query || '%' OR
236+
p.brand ILIKE '%' || search_query || '%')
237+
AND (category_filter IS NULL OR p.category = category_filter)
238+
AND GREATEST(
239+
similarity(p.name, search_query),
240+
word_similarity(search_query, p.name),
241+
similarity(p.description, search_query) * 0.7,
242+
similarity(p.brand, search_query) * 0.9
243+
) >= min_similarity
244+
ORDER BY score DESC
245+
LIMIT result_limit;
246+
END;
247+
$$;
248+
249+
```
250+
251+
```sql
252+
253+
User Interface (Search Bar)
254+
255+
Application Layer
256+
↓ REST API: GET /search?q=Barca&limit=10&min_score=0.3
257+
Backend Service
258+
↓ Query Construction & Parameter Validation
259+
PostgreSQL with pg_trgm
260+
↓ Query: SELECT ... WHERE name % 'Barca' AND similarity() > 0.3
261+
GIN Trigram Index Scan
262+
↓ Index Lookup & Candidate Selection
263+
Similarity Scoring & Ranking
264+
↓ Result Filtering & Pagination
265+
Ranked, Fuzzy Matched Results
266+
↓ JSON Response to Client
267+
268+
```
269+
270+
🧭 Key Takeaways
271+
272+
- ✅ Always use % operator in WHERE to leverage trigram indexes
273+
- ✅ Tune similarity thresholds for your use case
274+
- ✅ Prefer GIN indexes for read-heavy systems
275+
- ✅ Combine multiple strategies for robust matching
276+
- ✅ Monitor index usage and query performance regularly
277+
- ✅ Use transaction blocks for temporary threshold overrides

cmd/migrate/migrations/000034_add_ecommerce_base_schema.down.sql

Whitespace-only changes.

cmd/migrate/migrations/000034_add_ecommerce_base_schema.up.sql

Whitespace-only changes.

0 commit comments

Comments
 (0)