Skip to content

Commit d84ca6c

Browse files
committed
⚡️ (query/matching) Make weak signals opt-in
1 parent 07efca5 commit d84ca6c

File tree

5 files changed

+125
-49
lines changed

5 files changed

+125
-49
lines changed

docs/matching.md

Lines changed: 45 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,25 @@ The index stores multiple name representations to catch variations:
2323
- Name parts (partial matching) (`name_parts`)
2424

2525

26+
## Configuration
27+
28+
Matching stages 1 (normalized keywords) and 2 (name keys) are always enabled. Stages 3-5 can be toggled via environment variables:
29+
30+
| Setting | Default | Stage |
31+
|---------|---------|-------|
32+
| `OPENALEPH_SEARCH_MATCH_NAME_PARTS` | `false` | Name parts (partial token overlap) |
33+
| `OPENALEPH_SEARCH_MATCH_PHONETIC` | `false` | Phonetic encoding (sound-alike) |
34+
| `OPENALEPH_SEARCH_MATCH_SYMBOLS` | `false` | Name symbols (cross-language) |
35+
36+
Enabling more stages improves recall (finding more potential matches) at the cost of query complexity and performance. For most use cases, stages 1 and 2 provide sufficient matching quality.
37+
38+
```bash
39+
# Enable all matching stages
40+
export OPENALEPH_SEARCH_MATCH_NAME_PARTS=true
41+
export OPENALEPH_SEARCH_MATCH_PHONETIC=true
42+
export OPENALEPH_SEARCH_MATCH_SYMBOLS=true
43+
```
44+
2645
## Name matching strategies
2746

2847
### 1. Normalized keywords
@@ -42,7 +61,10 @@ Normalization:
4261

4362
Exact name matches (with order preserved) receive the highest boost.
4463

45-
### 2. Name symbols
64+
### 2. Name symbols {: #name-symbols }
65+
66+
!!! note
67+
Disabled by default. Enable with `OPENALEPH_SEARCH_MATCH_SYMBOLS=true`.
4668

4769
Cross-language and cross-alphabet matching via symbolic representations. This can be considered as a synonyms search, but more precise and context specific than [a global synonyms file](https://www.elastic.co/docs/solutions/search/full-text/search-with-synonyms).
4870

@@ -58,7 +80,10 @@ Example:
5880

5981
Same symbol = same entity name (part) across languages.
6082

61-
### 3. Phonetic encoding
83+
### 3. Phonetic encoding {: #phonetic }
84+
85+
!!! note
86+
Disabled by default. Enable with `OPENALEPH_SEARCH_MATCH_PHONETIC=true`.
6287

6388
Sound-alike matching using Double Metaphone algorithm.
6489

@@ -72,7 +97,10 @@ Example:
7297

7398
Catches alternate spellings and transcription variations.
7499

75-
### 4. Name parts
100+
### 4. Name parts {: #name-parts }
101+
102+
!!! note
103+
Disabled by default. Enable with `OPENALEPH_SEARCH_MATCH_NAME_PARTS=true`.
76104

77105
Individual name components for partial matching.
78106

@@ -143,16 +171,16 @@ Only compatible schema types can match each other.
143171

144172
Match scores combine multiple factors:
145173

146-
| Signal | Boost | Index field |
147-
|--------|-------|-------------|
148-
| Names (exact, order preserved) | 5.0 | `names` |
149-
| Name keys (order-independent) | 3.0 | `name_keys` |
150-
| Identifiers | 3.0 | `properties.*` (for group type "identifier") |
151-
| High-value properties | 2.0 | `properties.*` (ip, url, email, phone) |
152-
| Name parts | 1.0 | `name_parts` |
153-
| Other properties | 1.0 | `properties.*` |
154-
| Phonetic codes | 0.8 | `name_phonetics` |
155-
| Name symbols | 0.8 | `name_symbols` |
174+
| Signal | Boost | Index field | Default |
175+
|--------|-------|-------------|---------|
176+
| Names (exact, order preserved) | 5.0 | `names` | always |
177+
| Name keys (order-independent) | 3.0 | `name_keys` | always |
178+
| Identifiers | 3.0 | `properties.*` (for group type "identifier") | always |
179+
| High-value properties | 2.0 | `properties.*` (ip, url, email, phone) | always |
180+
| Name parts | 1.0 | `name_parts` | opt-in |
181+
| Other properties | 1.0 | `properties.*` | always |
182+
| Phonetic codes | 0.8 | `name_phonetics` | opt-in |
183+
| Name symbols | 0.8 | `name_symbols` | opt-in |
156184

157185
Higher boost = more important for matching.
158186

@@ -192,9 +220,10 @@ A match query combines multiple strategies:
192220
// Name matching clauses (using terms queries for efficiency)
193221
{"terms": {"names": ["john smith"], "boost": 5.0}},
194222
{"terms": {"name_keys": ["johnsmith"], "boost": 3.0}},
195-
{"terms_set": {"name_parts": {"terms": ["john", "smith"], "minimum_should_match_script": {...}}}},
196-
{"terms_set": {"name_phonetic": {"terms": ["JN", "SM0"], "minimum_should_match_script": {...}}}},
197-
{"terms_set": {"name_symbols": {"terms": ["[NAME:12345]"], "minimum_should_match_script": {...}}}}
223+
// Optional stages (disabled by default, enable via settings):
224+
{"terms_set": {"name_parts": {"terms": ["john", "smith"], "minimum_should_match_script": {...}}}}, // match_name_parts
225+
{"terms_set": {"name_phonetic": {"terms": ["JN", "SM0"], "minimum_should_match_script": {...}}}}, // match_phonetic
226+
{"terms_set": {"name_symbols": {"terms": ["[NAME:12345]"], "minimum_should_match_script": {...}}}} // match_symbols
198227
],
199228
"minimum_should_match": 1
200229
}

docs/reference/settings.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -393,6 +393,38 @@ Maximum document frequency for MLT query terms. Common terms above this threshol
393393
- Type: `int`
394394
- Default: `500`
395395

396+
## Entity matching
397+
398+
[Read more](../matching.md)
399+
400+
### `match_name_parts`
401+
402+
Enable name parts matching (partial token overlap, requires 2+ matching tokens).
403+
404+
- Type: `bool`
405+
- Default: `false`
406+
407+
### `match_phonetic`
408+
409+
Enable phonetic matching (sound-alike via Double Metaphone).
410+
411+
- Type: `bool`
412+
- Default: `false`
413+
414+
### `match_symbols`
415+
416+
Enable name symbols matching (cross-language/alphabet via WikiData).
417+
418+
- Type: `bool`
419+
- Default: `false`
420+
421+
```bash
422+
# Enable all optional matching stages
423+
export OPENALEPH_SEARCH_MATCH_NAME_PARTS=true
424+
export OPENALEPH_SEARCH_MATCH_PHONETIC=true
425+
export OPENALEPH_SEARCH_MATCH_SYMBOLS=true
426+
```
427+
396428
## Authorization
397429

398430
[Read more](./authorization.md)

openaleph_search/query/matching.py

Lines changed: 39 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
from openaleph_search.index.mapping import Field, property_field_name
1010
from openaleph_search.query.util import BoolQuery, bool_query, none_query
11+
from openaleph_search.settings import Settings
1112
from openaleph_search.transform.util import (
1213
index_name_keys,
1314
index_name_parts,
@@ -93,50 +94,55 @@ def names_query(schema: Schema, names: list[str]) -> Clauses:
9394
if keys:
9495
shoulds.append({"terms": {Field.NAME_KEYS: keys, "boost": 3.0}})
9596

97+
settings = Settings()
98+
9699
# 3. name_parts: partial token overlap (requires 2+ matching tokens)
97-
parts = list(index_name_parts(schema, names))
98-
if parts:
99-
shoulds.append(
100-
{
101-
"terms_set": {
102-
Field.NAME_PARTS: {
103-
"terms": parts,
104-
"minimum_should_match_script": _min_should_match_script(2),
105-
"boost": 1.0,
100+
if settings.match_name_parts:
101+
parts = list(index_name_parts(schema, names))
102+
if parts:
103+
shoulds.append(
104+
{
105+
"terms_set": {
106+
Field.NAME_PARTS: {
107+
"terms": parts,
108+
"minimum_should_match_script": _min_should_match_script(2),
109+
"boost": 1.0,
110+
}
106111
}
107112
}
108-
}
109-
)
113+
)
110114

111115
# 4. name_phonetic: spelling/transliteration variants
112-
phonetics = list(phonetic_names(schema, names))
113-
if phonetics:
114-
shoulds.append(
115-
{
116-
"terms_set": {
117-
Field.NAME_PHONETIC: {
118-
"terms": phonetics,
119-
"minimum_should_match_script": _min_should_match_script(2),
120-
"boost": 0.8,
116+
if settings.match_phonetic:
117+
phonetics = list(phonetic_names(schema, names))
118+
if phonetics:
119+
shoulds.append(
120+
{
121+
"terms_set": {
122+
Field.NAME_PHONETIC: {
123+
"terms": phonetics,
124+
"minimum_should_match_script": _min_should_match_script(2),
125+
"boost": 0.8,
126+
}
121127
}
122128
}
123-
}
124-
)
129+
)
125130

126131
# 5. name_symbols: synonyms, nicknames, company suffixes
127-
symbols = [str(s) for s in get_name_symbols(schema, *names)]
128-
if symbols:
129-
shoulds.append(
130-
{
131-
"terms_set": {
132-
Field.NAME_SYMBOLS: {
133-
"terms": symbols,
134-
"minimum_should_match_script": _min_should_match_script(2),
135-
"boost": 0.8,
132+
if settings.match_symbols:
133+
symbols = [str(s) for s in get_name_symbols(schema, *names)]
134+
if symbols:
135+
shoulds.append(
136+
{
137+
"terms_set": {
138+
Field.NAME_SYMBOLS: {
139+
"terms": symbols,
140+
"minimum_should_match_script": _min_should_match_script(2),
141+
"boost": 0.8,
142+
}
136143
}
137144
}
138-
}
139-
)
145+
)
140146

141147
return shoulds
142148

openaleph_search/settings.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,12 @@ class Settings(BaseSettings):
9898
mlt_min_word_length: int = 5
9999
mlt_max_doc_freq: int = 500
100100

101+
# Entity matching stages (names_query in query/matching.py)
102+
# Stages 1 (names) and 2 (name_keys) are always enabled.
103+
match_name_parts: bool = False
104+
match_phonetic: bool = False
105+
match_symbols: bool = False
106+
101107
# Pre-build global ordinals on frequently-aggregated keyword fields
102108
# during refresh. Eliminates first-query latency spikes at the cost of
103109
# slightly slower refreshes.

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,3 +70,6 @@ OPENALEPH_SEARCH_SIGNIFICANT_TERMS_RANDOM_SAMPLER = 0
7070
OPENALEPH_SEARCH_MLT_MIN_WORD_LENGTH = 3
7171
OPENALEPH_SEARCH_MLT_MIN_DOC_FREQ = 1
7272
OPENALEPH_SEARCH_MLT_MIN_TERM_FREQ = 1
73+
OPENALEPH_SEARCH_MATCH_NAME_PARTS = 1
74+
OPENALEPH_SEARCH_MATCH_PHONETIC = 1
75+
OPENALEPH_SEARCH_MATCH_SYMBOLS = 1

0 commit comments

Comments
 (0)