Skip to content

Commit ae662cd

Browse files
authored
Merge pull request #252780 from HeidiSteen/heidist-freshness
BM25 edits
2 parents b02e987 + 98b3b43 commit ae662cd

File tree

5 files changed

+88
-54
lines changed

5 files changed

+88
-54
lines changed

articles/search/index-ranking-similarity.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,19 @@ author: HeidiSteen
66
ms.author: heidist
77
ms.service: cognitive-search
88
ms.topic: how-to
9-
ms.date: 09/07/2023
9+
ms.date: 09/25/2023
1010
---
1111

1212
# Configure BM25 relevance scoring
1313

1414
In this article, learn how to configure the [BM25 relevance scoring algorithm](https://en.wikipedia.org/wiki/Okapi_BM25) used by Azure Cognitive Search for full text search queries. It also explains how to enable BM25 on older search services.
1515

16-
BM25 applies to strings (text) on fields having a "searchable" attribution. At query time, the search engine uses BM25 to calculate a **@searchScore** for each match in a given query. Matching documents are ranked by their search score, with the top results returned in the query response.
16+
BM25 applies to:
17+
18+
+ Queries that use the `search` parameter for full text search, on text fields having a `searchable` attribution.
19+
+ Scoring is scoped to `searchFields`, or to all `searchable` fields if `searchFields` is null.
20+
21+
The search engine uses BM25 to calculate a **@searchScore** for each match in a given query. Matching documents are ranked by their search score, with the top results returned in the query response. It's possible to get some [score variation](index-similarity-and-scoring.md#score-variation) in results, even from the same query executing over the same search index, but usually these variations are small and don't change the overall ranking of results.
1722

1823
BM25 has defaults for weighting term frequency and document length. You can customize these properties if the defaults aren't suited to your content. Configuration changes are scoped to individual indexes, which means you can adjust relevance scoring based on the characteristics of each index.
1924

articles/search/index-similarity-and-scoring.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,36 @@
11
---
2-
title: Relevance and scoring
2+
title: BM25 relevance scoring
33
titleSuffix: Azure Cognitive Search
4-
description: Explains the concepts of relevance and scoring in Azure Cognitive Search, and what a developer can do to customize the scoring result.
4+
description: Explains the concepts of BM25 relevance and scoring in Azure Cognitive Search, and what a developer can do to customize the scoring result.
55
author: HeidiSteen
66
ms.author: heidist
77
ms.service: cognitive-search
88
ms.topic: conceptual
9-
ms.date: 08/31/2023
9+
ms.date: 09/25/2023
1010
---
1111

12-
# Relevance and scoring in Azure Cognitive Search
12+
# BM25 relevance and scoring for full text search
1313

14-
This article explains the relevance and the scoring algorithms used to compute search scores in Azure Cognitive Search. A relevance score is computed for each match found in a [full text search](search-lucene-query-architecture.md), where the strongest matches are assigned higher search scores.
14+
This article explains the BM25 relevance scoring algorithm used to compute search scores for [full text search](search-lucene-query-architecture.md). BM25 relevance is exclusive to full text search. Filter queries, autocomplete and suggested queries, wildcard search or fuzzy search queries aren't scored or ranked for relevance.
1515

16-
Relevance applies to full text search only. Filter queries, autocomplete and suggested queries, wildcard search or fuzzy search queries aren't scored or ranked for relevance.
17-
18-
In Azure Cognitive Search, you can tune search relevance and boost search scores through these mechanisms:
16+
In Azure Cognitive Search, you can configure algorithm parameters, and tune search relevance and boost search scores through these mechanisms:
1917

2018
+ Scoring algorithm configuration
21-
+ Semantic ranking (in preview, described in [this article](semantic-search-overview.md))
2219
+ Scoring profiles
20+
+ [Semantic ranking](semantic-search-overview.md)
2321
+ Custom scoring logic enabled through the *featuresMode* parameter
2422

25-
> [!NOTE]
26-
> Matches are scored and ranked from high to low. The score is returned as "@search.score". By default, the top 50 are returned in the response, but you can use the **$top** parameter to return a smaller or larger number of items (up to 1000 in a single response), and **$skip** to get the next set of results.
27-
2823
## Relevance scoring
2924

30-
Relevance scoring refers to the computation of a search score that serves as an indicator of an item's relevance in the context of the current query. The higher the score, the more relevant the item.
25+
Relevance scoring refers to the computation of a search score (**@search.score**) that serves as an indicator of an item's relevance in the context of the current query. The range is unbounded. However, the higher the score, the more relevant the item.
26+
27+
By default, the top 50 highest scoring matches are returned in the response, but you can use the **$top** parameter to return a smaller or larger number of items (up to 1000 in a single response), and **$skip** to get the next set of results.
3128

3229
The search score is computed based on statistical properties of the string input and the query itself. Azure Cognitive Search finds documents that match on search terms (some or all, depending on [searchMode](/rest/api/searchservice/search-documents#query-parameters)), favoring documents that contain many instances of the search term. The search score goes up even higher if the term is rare across the data index, but common within the document. The basis for this approach to computing relevance is known as *TF-IDF or* term frequency-inverse document frequency.
3330

34-
Search scores can be repeated throughout a result set. When multiple hits have the same search score, the ordering of the same scored items is undefined and not stable. Run the query again, and you might see items shift position, especially if you are using the free service or a billable service with multiple replicas. Given two items with an identical score, there's no guarantee that one appears first.
31+
Search scores can be repeated throughout a result set. When multiple hits have the same search score, the ordering of the same scored items is undefined and not stable. Run the query again, and you might see items shift position, especially if you're using the free service or a billable service with multiple replicas. Given two items with an identical score, there's no guarantee that one appears first.
3532

36-
If you want to break the tie among repeating scores, you can add an **$orderby** clause to first order by score, then order by another sortable field (for example, `$orderby=search.score() desc,Rating desc`). For more information, see [$orderby](search-query-odata-orderby.md).
33+
To break the tie among repeating scores, you can add an **$orderby** clause to first order by score, then order by another sortable field (for example, `$orderby=search.score() desc,Rating desc`). For more information, see [$orderby](search-query-odata-orderby.md).
3734

3835
> [!NOTE]
3936
> A `@search.score = 1` indicates an un-scored or un-ranked result set. The score is uniform across all results. Un-scored results occur when the query form is fuzzy search, wildcard or regex queries, or an empty search (`search=*`, sometimes paired with filters, where the filter is the primary means for returning a match).
@@ -76,7 +73,7 @@ For scalability, Azure Cognitive Search distributes each index horizontally thro
7673

7774
By default, the score of a document is calculated based on statistical properties of the data *within a shard*. This approach is generally not a problem for a large corpus of data, and it provides better performance than having to calculate the score based on information across all shards. That said, using this performance optimization could cause two very similar documents (or even identical documents) to end up with different relevance scores if they end up in different shards.
7875

79-
If you prefer to compute the score based on the statistical properties across all shards, you can do so by adding *scoringStatistics=global* as a [query parameter](/rest/api/searchservice/search-documents) (or add *"scoringStatistics": "global"* as a body parameter of the [query request](/rest/api/searchservice/search-documents)).
76+
If you prefer to compute the score based on the statistical properties across all shards, you can do so by adding `scoringStatistics=global` as a [query parameter](/rest/api/searchservice/search-documents) (or add `"scoringStatistics": "global"` as a body parameter of the [query request](/rest/api/searchservice/search-documents)).
8077

8178
```http
8279
POST https://[service name].search.windows.net/indexes/hotels/docs/search?api-version=2020-06-30
@@ -86,7 +83,7 @@ POST https://[service name].search.windows.net/indexes/hotels/docs/search?api-ve
8683
}
8784
```
8885

89-
Using scoringStatistics will ensure that all shards in the same replica provide the same results. That said, different replicas may be slightly different from one another as they are always getting updated with the latest changes to your index. In some scenarios, you may want your users to get more consistent results during a "query session". In such scenarios, you can provide a `sessionId` as part of your queries. The `sessionId` is a unique string that you create to refer to a unique user session.
86+
Using `scoringStatistics` will ensure that all shards in the same replica provide the same results. That said, different replicas may be slightly different from one another as they're always getting updated with the latest changes to your index. In some scenarios, you may want your users to get more consistent results during a "query session". In such scenarios, you can provide a `sessionId` as part of your queries. The `sessionId` is a unique string that you create to refer to a unique user session.
9087

9188
```http
9289
POST https://[service name].search.windows.net/indexes/hotels/docs/search?api-version=2020-06-30
@@ -96,7 +93,7 @@ POST https://[service name].search.windows.net/indexes/hotels/docs/search?api-ve
9693
}
9794
```
9895

99-
As long as the same `sessionId` is used, a best-effort attempt will be made to target the same replica, increasing the consistency of results your users will see.
96+
As long as the same `sessionId` is used, a best-effort attempt is made to target the same replica, increasing the consistency of results your users will see.
10097

10198
> [!NOTE]
10299
> Reusing the same `sessionId` values repeatedly can interfere with the load balancing of the requests across replicas and adversely affect the performance of the search service. The value used as sessionId cannot start with a '_' character.
@@ -111,7 +108,7 @@ A scoring profile is part of the index definition, composed of weighted fields,
111108

112109
## featuresMode parameter (preview)
113110

114-
[Search Documents](/rest/api/searchservice/preview-api/search-documents) requests have a new [featuresMode](/rest/api/searchservice/preview-api/search-documents#featuresmode) parameter that can provide additional detail about relevance at the field level. Whereas the `@searchScore` is calculated for the document all-up (how relevant is this document in the context of this query), through featuresMode you can get information about individual fields, as expressed in a `@search.features` structure. The structure contains all fields used in the query (either specific fields through **searchFields** in a query, or all fields attributed as **searchable** in an index). For each field, you get the following values:
111+
[Search Documents](/rest/api/searchservice/preview-api/search-documents) requests have a new [featuresMode](/rest/api/searchservice/preview-api/search-documents#featuresmode) parameter that can provide more detail about relevance at the field level. Whereas the `@searchScore` is calculated for the document all-up (how relevant is this document in the context of this query), through featuresMode you can get information about individual fields, as expressed in a `@search.features` structure. The structure contains all fields used in the query (either specific fields through **searchFields** in a query, or all fields attributed as **searchable** in an index). For each field, you get the following values:
115112

116113
+ Number of unique tokens found in the field
117114
+ Similarity score, or a measure of how similar the content of the field is, relative to the query term
@@ -134,6 +131,9 @@ For a query that targets the "description" and "title" fields, a response that i
134131
"similarityScore": 1.75451557,
135132
"termFrequency" : 6
136133
}
134+
}
135+
}
136+
]
137137
```
138138

139139
You can consume these data points in [custom scoring solutions](https://github.com/Azure-Samples/search-ranking-tutorial) or use the information to debug search relevance problems.
99.7 KB
Loading

0 commit comments

Comments
 (0)