Skip to content

Commit dd0cde7

Browse files
committed
Lucene Linguistics: multiple profiles sample app
1 parent ea0113c commit dd0cde7

File tree

7 files changed

+210
-0
lines changed

7 files changed

+210
-0
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
components
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
<!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
2+
3+
<picture>
4+
<source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg">
5+
<source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg">
6+
<img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;">
7+
</picture>
8+
9+
# Vespa sample applications - Lucene Linguistics
10+
11+
This app demonstrates how to use multiple analyzer profiles in [Lucene Linguistics](https://docs.vespa.ai/en/linguistics/lucene-linguistics.html).
12+
13+
You can bind different fields to different analyzer profiles in the schema. Here, we have three analyzers in [services.xml](app/services.xml):
14+
- `lowerFolding`: [standard tokenizer](https://lucene.apache.org/core/9_11_1/core/org/apache/lucene/analysis/standard/StandardTokenizer.html) + [lowercase](https://lucene.apache.org/core/9_11_1/core/org/apache/lucene/analysis/LowerCaseFilter.html) and [ASCII folding](https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html) token filters
15+
- `lowerFoldingStemming`: lowerFolding + [kStem for English](https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/en/KStemFilterFactory.html)
16+
- `lowerFoldingStemmingSynonyms`: lowerFoldingStemming + [synonym expansion](https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.html)
17+
18+
We have three fields in the schema:
19+
- `title`: bound to `lowerFolding`
20+
- `description`: bound to `lowerFoldingStemming` at write time, and `lowerFoldingStemmingSynonyms` at search time. We want to expand synonyms at search time only, it doesn't make sense to do it on both sides.
21+
22+
In this example, we only use English, but you can combine this with multiple languages if you wanted to. Steps to do this are:
23+
1. In `services.xml`, define an analyzer for each profile+language combination.
24+
- Use `default` profile for fields that are not bound to a specific profile.
25+
2. In the schema, use `linguistics` block to bind the field to the profile (or profiles, if you need different profiles for index and search).
26+
3. Use [language tags and detection](https://docs.vespa.ai/en/linguistics/linguistics.html#language-handling) as before.
27+
28+
## Deploy the application
29+
Follow [app deploy guide](https://docs.vespa.ai/en/basics/deploy-an-application)
30+
through the <code>vespa deploy</code> step, cloning `examples/lucene-linguistics/multiple-profiles` instead of `album-recommendation`.
31+
32+
## Feed the sample document
33+
34+
```bash
35+
vespa feed ext/*.json
36+
```
37+
38+
## Run test queries
39+
40+
This will confirm that ASCII folding is working on the `title` field, because it will match `åao` with `åäö`:
41+
```bash
42+
curl -s -X POST -d '{
43+
"yql":"select * from sources * where title contains \"åao\"",
44+
"presentation.summary": "debug-text-tokens",
45+
"model.locale": "en",
46+
"trace.level":2}' -H "Content-Type: application/json" 'http://localhost:8080/search/' | jq .
47+
```
48+
49+
You can also force a different profile for the query via `model.type.profile`. This will match "dubious" with "special" (our test synonym expansion):
50+
51+
```bash
52+
curl -s -X POST -d '{
53+
"yql":"select * from sources * where title contains \"dubious\"",
54+
"model.type.profile": "lowerFoldingStemmingSynonyms",
55+
"presentation.summary": "debug-text-tokens",
56+
"model.locale": "en",
57+
"trace.level":2}' -H "Content-Type: application/json" 'http://localhost:8080/search/' | jq .
58+
```
59+
60+
For the `description` field, we already use a different profile for search time which already does synonym expansion (as defined in [the schema](app/schemas/doc.sd)). So it will match "dubious" with "special" out of the box:
61+
62+
```bash
63+
curl -s -X POST -d '{
64+
"yql":"select * from sources * where description contains \"dubious\"",
65+
"presentation.summary": "debug-text-tokens",
66+
"model.locale": "en",
67+
"trace.level":2}' -H "Content-Type: application/json" 'http://localhost:8080/search/' | jq .
68+
```
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# This file excludes unnecessary files from the application package. See
2+
# https://docs.vespa.ai/en/reference/vespaignore.html for more information.
3+
.DS_Store
4+
.gitignore
5+
README.md
6+
ext/
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# using Solr synonyms format (default for synonymGraph token filter)
2+
dubious =>special
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
schema doc {
2+
3+
document doc {
4+
field language type string {
5+
indexing: set_language | summary | index
6+
match: word
7+
}
8+
9+
field title type string {
10+
indexing: summary | index
11+
# use this when the profile (analyzer configuration) is the same for indexing and searching
12+
linguistics {
13+
profile: lowerFolding
14+
}
15+
index: enable-bm25
16+
}
17+
18+
field description type string {
19+
indexing: summary | index
20+
# profile/analyzer can be different for index and search strings
21+
# typical use-case: synonym expansion (usually done at search time only)
22+
linguistics {
23+
profile {
24+
index: lowerFoldingStemming
25+
search: lowerFoldingStemmingSynonyms
26+
}
27+
}
28+
index: enable-bm25
29+
}
30+
}
31+
32+
document-summary debug-text-tokens {
33+
summary documentid {}
34+
summary language {}
35+
summary title {}
36+
summary description {}
37+
summary title_tokens {
38+
source: title
39+
tokens
40+
}
41+
summary description_tokens {
42+
source: description
43+
tokens
44+
}
45+
from-disk
46+
}
47+
}
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
<?xml version="1.0" encoding="utf-8" ?>
2+
<services version="1.0" minimum-required-vespa-version="8.315.19">
3+
<container id="container" version="1.0">
4+
<component id="linguistics"
5+
class="com.yahoo.language.lucene.LuceneLinguistics"
6+
bundle="lucene-linguistics">
7+
<config name="com.yahoo.language.lucene.lucene-analysis">
8+
<!-- we store synonyms (and potentially other files) in this directory under the application package -->
9+
<configDir>lucene-linguistics</configDir>
10+
<analysis>
11+
<!-- profile is essentially the name of the analyzer configuration; use it in the schema and at query time -->
12+
<item key="profile=lowerFolding;language=en">
13+
<tokenizer>
14+
<name>standard</name>
15+
</tokenizer>
16+
<tokenFilters>
17+
<item>
18+
<name>lowercase</name>
19+
</item>
20+
<item>
21+
<name>asciiFolding</name>
22+
</item>
23+
</tokenFilters>
24+
</item>
25+
26+
<item key="profile=lowerFoldingStemming;language=en">
27+
<tokenizer>
28+
<name>standard</name>
29+
</tokenizer>
30+
<tokenFilters>
31+
<item>
32+
<name>lowercase</name>
33+
</item>
34+
<item>
35+
<name>asciiFolding</name>
36+
</item>
37+
<item>
38+
<name>kStem</name>
39+
</item>
40+
</tokenFilters>
41+
</item>
42+
43+
<item key="profile=lowerFoldingStemmingSynonyms;language=en">
44+
<tokenizer>
45+
<name>standard</name>
46+
</tokenizer>
47+
<tokenFilters>
48+
<item>
49+
<name>lowercase</name>
50+
</item>
51+
<item>
52+
<name>asciiFolding</name>
53+
</item>
54+
<item>
55+
<name>kStem</name>
56+
</item>
57+
<item>
58+
<name>synonymGraph</name>
59+
<conf>
60+
<item key="synonyms">en/synonyms.txt</item>
61+
</conf>
62+
</item>
63+
</tokenFilters>
64+
</item>
65+
</analysis>
66+
</config>
67+
</component>
68+
<document-processing/>
69+
<document-api/>
70+
<search/>
71+
</container>
72+
<content id="content" version="1.0">
73+
<min-redundancy>1</min-redundancy>
74+
<documents>
75+
<document type="doc" mode="index"/>
76+
</documents>
77+
</content>
78+
</services>
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"put": "id:en:doc::1",
3+
"fields": {
4+
"title": "Title with special characters åäö",
5+
"description": "No character specials here",
6+
"language": "en"
7+
}
8+
}

0 commit comments

Comments
 (0)