|
| 1 | +<!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --> |
| 2 | + |
| 3 | +<picture> |
| 4 | + <source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg"> |
| 5 | + <source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg"> |
| 6 | + <img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;"> |
| 7 | +</picture> |
| 8 | + |
| 9 | +# Vespa sample applications - Lucene Linguistics |
| 10 | + |
| 11 | +This app demonstrates how to use multiple analyzer profiles in [Lucene Linguistics](https://docs.vespa.ai/en/linguistics/lucene-linguistics.html). |
| 12 | + |
| 13 | +You can bind different fields to different analyzer profiles in the schema. Here, we have three analyzers in [services.xml](app/services.xml): |
| 14 | +- `lowerFolding`: [standard tokenizer](https://lucene.apache.org/core/9_11_1/core/org/apache/lucene/analysis/standard/StandardTokenizer.html) + [lowercase](https://lucene.apache.org/core/9_11_1/core/org/apache/lucene/analysis/LowerCaseFilter.html) and [ASCII folding](https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html) token filters |
| 15 | +- `lowerFoldingStemming`: lowerFolding + [kStem for English](https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/en/KStemFilterFactory.html) |
| 16 | +- `lowerFoldingStemmingSynonyms`: lowerFoldingStemming + [synonym expansion](https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.html) |
| 17 | + |
| 18 | +We have three fields in the schema: |
| 19 | +- `title`: bound to `lowerFolding` |
| 20 | +- `description`: bound to `lowerFoldingStemming` at write time, and `lowerFoldingStemmingSynonyms` at search time. We want to expand synonyms at search time only, it doesn't make sense to do it on both sides. |
| 21 | + |
| 22 | +In this example, we only use English, but you can combine this with multiple languages if you wanted to. Steps to do this are: |
| 23 | +1. In `services.xml`, define an analyzer for each profile+language combination. |
| 24 | + - Use `default` profile for fields that are not bound to a specific profile. |
| 25 | +2. In the schema, use `linguistics` block to bind the field to the profile (or profiles, if you need different profiles for index and search). |
| 26 | +3. Use [language tags and detection](https://docs.vespa.ai/en/linguistics/linguistics.html#language-handling) as before. |
| 27 | + |
| 28 | +## Deploy the application |
| 29 | +Follow [app deploy guide](https://docs.vespa.ai/en/basics/deploy-an-application) |
| 30 | +through the <code>vespa deploy</code> step, cloning `examples/lucene-linguistics/multiple-profiles` instead of `album-recommendation`. |
| 31 | + |
| 32 | +## Feed the sample document |
| 33 | + |
| 34 | +```bash |
| 35 | +vespa feed ext/*.json |
| 36 | +``` |
| 37 | + |
| 38 | +## Run test queries |
| 39 | + |
| 40 | +This will confirm that ASCII folding is working on the `title` field, because it will match `åao` with `åäö`: |
| 41 | +```bash |
| 42 | +curl -s -X POST -d '{ |
| 43 | + "yql":"select * from sources * where title contains \"åao\"", |
| 44 | + "presentation.summary": "debug-text-tokens", |
| 45 | + "model.locale": "en", |
| 46 | + "trace.level":2}' -H "Content-Type: application/json" 'http://localhost:8080/search/' | jq . |
| 47 | +``` |
| 48 | + |
| 49 | +You can also force a different profile for the query via `model.type.profile`. This will match "dubious" with "special" (our test synonym expansion): |
| 50 | + |
| 51 | +```bash |
| 52 | +curl -s -X POST -d '{ |
| 53 | + "yql":"select * from sources * where title contains \"dubious\"", |
| 54 | + "model.type.profile": "lowerFoldingStemmingSynonyms", |
| 55 | + "presentation.summary": "debug-text-tokens", |
| 56 | + "model.locale": "en", |
| 57 | + "trace.level":2}' -H "Content-Type: application/json" 'http://localhost:8080/search/' | jq . |
| 58 | +``` |
| 59 | + |
| 60 | +For the `description` field, we already use a different profile for search time which already does synonym expansion (as defined in [the schema](app/schemas/doc.sd)). So it will match "dubious" with "special" out of the box: |
| 61 | + |
| 62 | +```bash |
| 63 | +curl -s -X POST -d '{ |
| 64 | + "yql":"select * from sources * where description contains \"dubious\"", |
| 65 | + "presentation.summary": "debug-text-tokens", |
| 66 | + "model.locale": "en", |
| 67 | + "trace.level":2}' -H "Content-Type: application/json" 'http://localhost:8080/search/' | jq . |
| 68 | +``` |
0 commit comments