You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## v2.0.0
Major update with improved search capabilities and updated configuration.
### Added
- Substring search
- Prefix search
### Changed
- **BREAKING**: Adjusted indexing and query configuration. See also /src/default-config.ts for the new structure.
Related to #2 and #6.
@@ -169,7 +196,9 @@ The following parameters are available when creating a query:
169
196
| --------- | ---- | ------- | ----------- |
170
197
| string | string | - | The query string. |
171
198
| topN | number | 10 | The maximum number of matches to return. Provide Infinity to return all matches. |
172
-
| minQuality | number | 0.3 | The minimum quality of a match, ranging from 0 to 1. When set to zero, all terms that share at least one common n-gram with the query are considered a match. |
199
+
| searchers | SearcherSpec[]|[new FuzzySearcher(0.3), new SubstringSearcher(0), new PrefixSearcher(0)]| The searchers to use and the minimum quality thresholds for their matches. |
200
+
201
+
A fuzzy search minimum quality threshold below 0.3 is not recommended, as the respective matches are most likely irrelevant.
173
202
174
203
If the data terms contain characters and strings in non-latin scripts (such as Arabic, Cyrillic, Greek, Han, ... see also [ISO 15924](https://en.wikipedia.org/wiki/ISO_15924)), the default configuration must be adjusted before creating the searcher:
175
204
@@ -218,33 +247,29 @@ Query strings and data terms are normalized in the following normalization pipel
218
247
- Strings are normalized to NFKD.
219
248
- Space equivalent characters are replaced by a space.
220
249
- Surrogate characters, padding characters and other non-allowed characters are removed.
221
-
- Strings are padded to the left, right and in the middle (replacement of spaces).
222
250
223
251
>Normalization to NFKC decomposes characters by compatibility, then re-composes them by canonical equivalence. This ensures that the characters in the replacement table always match. Normalization to NFKD decomposes the characters by compatibility but does not re-compose them, allowing undesired characters to be removed thereafter.
224
252
225
253
The default normalizer config adopts the following values:
226
254
227
255
```js
228
-
let paddingLeft ='$$';
229
-
let paddingRight ='!';
230
-
let paddingMiddle ='!$$';
231
-
let replacements = [fuzzySearch.LatinReplacements.Value];
With this pipeline and configuration, the string `Thanh Việt Đoàn` is normalized to `thanh viet doan` before padding. With padding applied, it becomes `$$thanh!$$viet!$$doan!`. The choice of the padding is explained in the next section.
264
+
With this pipeline and configuration, the string `Thanh Việt Đoàn` is normalized to `thanh viet doan`.
240
265
241
-
## Sorted n-grams
266
+
## Fuzzy search: sorted n-grams
242
267
243
-
The general idea of n-grams and the sorting trick is outlined in this [blog post](https://www.m31coding.com/blog/fuzzy-search.html). In short, the data terms and the query string are broken down into 3-grams, e.g. the string `$$sarah!`becomes:
268
+
The general idea of n-grams and the sorting trick is outlined in this [blog post](https://www.m31coding.com/blog/fuzzy-search.html). In short, the data terms and the query string are padded on the left, right and middle (replacement of spaces) with `$$`, `!`, and `!$$`, respectively, before they are broken down into 3-grams. For example, the string `sarah` becomes `$$sarah!`after padding and the resulting 3-grams are:
244
269
245
270
```text
246
271
$$s, $sa, sar, ara, rah, ah!
247
-
``````
272
+
```
248
273
249
274
The more common 3-grams between the query and the term, the higher the quality of the match. By padding the front with two characters, and the back with one character, more weight is given to the beginning of the string.
250
275
@@ -269,18 +294,48 @@ The quality is then computed by dividing the number of common n-grams by the num
269
294
270
295
Padding strings in the middle allows for extending the algorithm across word boundaries. `sarah wolff` becomes `$$sarah!$$wolff!` and matches `wolff sarah` with a quality of 0.95, if 3-grams that end with a '\$' are discarded.
271
296
272
-
The overall approach outlined above can be summarized as: remove n-grams that end with '\$', sort n-grams that don't contain '\$'. The default configuration appears in the code as follows:
297
+
The overall approach outlined above can be summarized as: remove n-grams that end with '\$', sort n-grams that don't contain '\$'. The default fuzzy search configuration appears in the code as follows:
Substring and prefix search is realized with a single suffix array created by [An efficient, versatile approach to suffix sorting](https://dl.acm.org/doi/10.1145/1227161.1278374).
314
+
315
+
The base quality of a prefix or substring match is simply computed by dividing the query length by the term length. For example, the query `sa` matches the term `sarah` with a quality of 2/5 = 0.4, and the query `ara` matches the same term with a quality of 3/5 = 0.6.
316
+
317
+
A quality offset of +2 and +1 is added to prefix and substring matches, respectively, as explained in the next section.
318
+
319
+
The final qualities of the examples are:
320
+
321
+
| Query | Term | Searcher | Quality |
322
+
| ----- | ----- | ----------| ----------|
323
+
| sa | sarah | Prefix | 2 / 5 + 2 = 2.4 |
324
+
| ara | sarah | Substring | 3 / 5 + 1 = 1.6 |
325
+
326
+
The default configuration for the searchers is as follows:
The matches of the searchers are mixed with a simple approach. Prefix matches get a quality offset of +2, substring matches of +1, and fuzzy matches keep their original quality. The rationale is that, for the same query length, prefix matches are more relevant than substring matches. Additionally, fuzzy matches are only relevant if there are no prefix or substring matches.
335
+
336
+
## Changing the default configuration
337
+
338
+
The default configuration has been chosen carefully. There are only a few specific scenarios that require adjustments. Consult the file [default-config.ts](src/default-config.ts) for all configuration options and their default values.
0 commit comments