|
1 | 1 | ---
|
2 |
| -title: Configure the similarity algorithm |
| 2 | +title: Configure BM25 similarity algorithm |
3 | 3 | titleSuffix: Azure Cognitive Search
|
4 |
| -description: Learn how to enable BM25 on older search services, and how BM25 parameters can be modified to better accommodate the content of your indexes. |
| 4 | +description: Enable Okapi BM25 ranking to upgrade the search ranking and relevance behavior on older Azure Search services. |
5 | 5 |
|
6 |
| -author: nitinme |
7 |
| -ms.author: nitinme |
| 6 | +author: HeidiSteen |
| 7 | +ms.author: heidist |
8 | 8 | ms.service: cognitive-search
|
9 |
| -ms.topic: conceptual |
10 |
| -ms.date: 03/12/2021 |
| 9 | +ms.topic: how-to |
| 10 | +ms.date: 06/22/2022 |
11 | 11 | ---
|
12 | 12 |
|
13 | 13 | # Configure the similarity ranking algorithm in Azure Cognitive Search
|
14 | 14 |
|
15 |
| -Azure Cognitive Search supports two similarity ranking algorithms: |
| 15 | +Depending on the age of your search service, Azure Cognitive Search supports two [similarity ranking algorithms](index-similarity-and-scoring.md) for scoring relevance on full text search results: |
16 | 16 |
|
17 |
| -+ A *classic similarity* algorithm, used by all search services up until July 15, 2020. |
18 |
| -+ An implementation of the *Okapi BM25* algorithm, used in all search services created after July 15. |
| 17 | ++ An *Okapi BM25* algorithm, used in all search services created after July 15, 2020 |
| 18 | ++ A *classic similarity* algorithm, used by all search services created before July 15, 2020 |
19 | 19 |
|
20 |
| -BM25 ranking is the new default because it tends to produce search rankings that align better with user expectations. It comes with [parameters](#set-bm25-parameters) for tuning results based on factors such as document size. |
| 20 | +BM25 ranking is the default because it tends to produce search rankings that align better with user expectations. It includes [parameters](#set-bm25-parameters) for tuning results based on factors such as document size. |
21 | 21 |
|
22 |
| -For new services created after July 15, 2020, BM25 is used automatically and is the sole similarity algorithm. If you try to set similarity to ClassicSimilarity on a new service, an HTTP 400 error will be returned because that algorithm is not supported by the service. |
| 22 | +For search services created after July 2020, BM25 is the sole similarity algorithm. If you try to set similarity to ClassicSimilarity on a new service, an HTTP 400 error will be returned because that algorithm is not supported by the service. |
23 | 23 |
|
24 |
| -For older services created before July 15, 2020, classic similarity remains the default algorithm. Older services can upgrade to BM25 on a per-index basis, as explained below. If you are switching from classic to BM25, you can expect to see some differences how search results are ordered. |
| 24 | +For older services, classic similarity remains the default algorithm. Older services can [upgrade to BM25](#enable-bm25-scoring-on-older-services) on a per-index basis. When switching from classic to BM25, you can expect to see some differences how search results are ordered. |
25 | 25 |
|
26 |
| -> [!NOTE] |
27 |
| -> Semantic ranking, currently in preview for standard services in selected regions, is an additional step forward in producing more relevant results. Unlike the other algorithms, it is an add-on feature that iterates over an existing result set. For more information, see [Semantic search overview](semantic-search-overview.md) and [Semantic ranking](semantic-ranking.md). |
| 26 | +## Set BM25 parameters |
| 27 | + |
| 28 | +BM25 similarity adds two parameters to control the relevance score calculation. To set "similarity" parameters, issue a [Create or Update Index](/rest/api/searchservice/create-index) request as illustrated by the following example. |
| 29 | + |
| 30 | +Because Cognitive Search won't allow updates to a live index, you'll need to take the index offline so that the parameters can be added. Indexing and query requests will fail while the index is offline. The duration of the outage is the amount of time it takes to update the index, usually no more than several seconds. When the update is complete, the index comes back automatically. To take the index offline, append the "allowIndexDowntime=true" URI parameter on the request that sets the "similarity" property: |
| 31 | + |
| 32 | +```http |
| 33 | +PUT https://[search service name].search.windows.net/indexes/[index name]?api-version=2020-06-30&allowIndexDowntime=true |
| 34 | +{ |
| 35 | + "similarity": { |
| 36 | + "@odata.type": "#Microsoft.Azure.Search.BM25Similarity", |
| 37 | + "b" : 0.5, |
| 38 | + "k1" : 1.3 |
| 39 | + } |
| 40 | +} |
| 41 | +``` |
| 42 | + |
| 43 | +### BM25 property reference |
| 44 | + |
| 45 | +| Property | Type | Description | |
| 46 | +|----------|------|-------------| |
| 47 | +| k1 | number | Controls the scaling function between the term frequency of each matching terms to the final relevance score of a document-query pair. Values are usually 0.0 to 3.0, with 1.2 as the default. </br></br>A value of 0.0 represents a "binary model", where the contribution of a single matching term is the same for all matching documents, regardless of how many times that term appears in the text, while a larger k1 value allows the score to continue to increase as more instances of the same term is found in the document. </br></br>Using a higher k1 value can be important in cases where we expect multiple terms to be part of a search query. In those cases, we might want to favor documents that match many of the different query terms being searched over documents that only match a single one, multiple times. For example, when querying the index for documents containing the terms "Apollo Spaceflight", we might want to lower the score of an article about Greek Mythology that contains the term "Apollo" a few dozen times, without mentions of "Spaceflight", compared to another article that explicitly mentions both "Apollo" and "Spaceflight" a handful of times only. | |
| 48 | +| b | number | Controls how the length of a document affects the relevance score. Values are between 0 and 1, with 0.75 as the default. </br></br>A value of 0.0 means the length of the document will not influence the score, while a value of 1.0 means the impact of term frequency on relevance score will be normalized by the document's length. </br></br>Normalizing the term frequency by the document's length is useful in cases where we want to penalize longer documents. In some cases, longer documents (such as a complete novel), are more likely to contain many irrelevant terms, compared to much shorter documents. | |
28 | 49 |
|
29 | 50 | ## Enable BM25 scoring on older services
|
30 | 51 |
|
31 |
| -If you are running a search service that was created prior to July 15, 2020, you can enable BM25 by setting a Similarity property on new indexes. The property is only exposed on new indexes, so if want BM25 on an existing index, you must drop and [rebuild the index](search-howto-reindex.md) with a new Similarity property set to "Microsoft.Azure.Search.BM25Similarity". |
| 52 | +If you are running a search service that was created from March 2014 through July 15, 2020, you can enable BM25 by setting a "similarity" property on new indexes. The property is only exposed on new indexes, so if want BM25 on an existing index, you must drop and [rebuild the index](search-howto-reindex.md) with a "similarity" property set to "Microsoft.Azure.Search.BM25Similarity". |
32 | 53 |
|
33 |
| -Once an index exists with a Similarity property, you can switch between BM25Similarity or ClassicSimilarity. |
| 54 | +Once an index exists with a "similarity" property, you can switch between `BM25Similarity` or `ClassicSimilarity`. |
34 | 55 |
|
35 | 56 | The following links describe the Similarity property in the Azure SDKs.
|
36 | 57 |
|
@@ -69,32 +90,9 @@ PUT https://[search service name].search.windows.net/indexes/[index name]?api-ve
|
69 | 90 | }
|
70 | 91 | ```
|
71 | 92 |
|
72 |
| -## Set BM25 parameters |
73 |
| - |
74 |
| -BM25 similarity adds two user customizable parameters to control the calculated relevance score. You can set BM25 parameters during index creation, or as an index update if the BM25 algorithm was specified during index creation. |
75 |
| - |
76 |
| -| Property | Type | Description | |
77 |
| -|----------|------|-------------| |
78 |
| -| k1 | number | Controls the scaling function between the term frequency of each matching terms to the final relevance score of a document-query pair. Values are usually 0.0 to 3.0, with 1.2 as the default. </br></br>A value of 0.0 represents a "binary model", where the contribution of a single matching term is the same for all matching documents, regardless of how many times that term appears in the text, while a larger k1 value allows the score to continue to increase as more instances of the same term is found in the document. </br></br>Using a higher k1 value can be important in cases where we expect multiple terms to be part of a search query. In those cases, we might want to favor documents that match many of the different query terms being searched over documents that only match a single one, multiple times. For example, when querying the index for documents containing the terms "Apollo Spaceflight", we might want to lower the score of an article about Greek Mythology that contains the term "Apollo" a few dozen times, without mentions of "Spaceflight", compared to another article that explicitly mentions both "Apollo" and "Spaceflight" a handful of times only. | |
79 |
| -| b | number | Controls how the length of a document affects the relevance score. Values are between 0 and 1, with 0.75 as the default. </br></br>A value of 0.0 means the length of the document will not influence the score, while a value of 1.0 means the impact of term frequency on relevance score will be normalized by the document's length. </br></br>Normalizing the term frequency by the document's length is useful in cases where we want to penalize longer documents. In some cases, longer documents (such as a complete novel), are more likely to contain many irrelevant terms, compared to much shorter documents. | |
80 |
| - |
81 |
| -### Setting k1 and b parameters |
82 |
| - |
83 |
| -To set or modify b or k1 values, add them to the BM25 similarity object. Setting or changing these values on an existing index will take the index offline for at least a few seconds, causing active indexing and query requests to fail. Consequently, you should set the "allowIndexDowntime=true" parameter of the update request: |
84 |
| - |
85 |
| -```http |
86 |
| -PUT https://[search service name].search.windows.net/indexes/[index name]?api-version=2020-06-30&allowIndexDowntime=true |
87 |
| -{ |
88 |
| - "similarity": { |
89 |
| - "@odata.type": "#Microsoft.Azure.Search.BM25Similarity", |
90 |
| - "b" : 0.5, |
91 |
| - "k1" : 1.3 |
92 |
| - } |
93 |
| -} |
94 |
| -``` |
95 |
| - |
96 | 93 | ## See also
|
97 | 94 |
|
| 95 | ++ [Similarity and scoring in Azure Cognitive Search](index-similarity-and-scoring.md) |
98 | 96 | + [REST API Reference](/rest/api/searchservice/)
|
99 | 97 | + [Add scoring profiles to your index](index-add-scoring-profiles.md)
|
100 | 98 | + [Create Index API](/rest/api/searchservice/create-index)
|
|
0 commit comments