Skip to content

Commit 2ff938c

Browse files
Merge pull request #264811 from HeidiSteen/heidist-fix
[azure search] refactor vector concept articles
2 parents f1a5e46 + 50aba94 commit 2ff938c

7 files changed

+129
-36
lines changed

articles/search/TOC.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@
132132
- name: Search index
133133
href: search-what-is-an-index.md
134134
- name: Vector store
135-
href: vector-search-overview.md
135+
href: vector-store.md
136136
- name: Knowledge store
137137
href: knowledge-store-concept-intro.md
138138
- name: Data import strategies
@@ -163,11 +163,11 @@
163163
items:
164164
- name: Semantic ranking
165165
href: semantic-search-overview.md
166-
- name: Scoring in keyword queries (BM25)
166+
- name: Keyword queries (BM25 ranking)
167167
href: index-similarity-and-scoring.md
168-
- name: Scoring in vector queries
168+
- name: Vector queries
169169
href: vector-search-ranking.md
170-
- name: Scoring in hybrid queries (RRF)
170+
- name: Hybrid queries (RRF)
171171
href: hybrid-search-ranking.md
172172
- name: Security
173173
items:

articles/search/index-similarity-and-scoring.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.topic: conceptual
1212
ms.date: 09/27/2023
1313
---
1414

15-
# Relevance scoring for full text search (BM25)
15+
# Relevance in keyword search (BM25 scoring)
1616

1717
This article explains the BM25 relevance scoring algorithm used to compute search scores for [full text search](search-lucene-query-architecture.md). BM25 relevance is exclusive to full text search. Filter queries, autocomplete and suggested queries, wildcard search or fuzzy search queries aren't scored or ranked for relevance.
1818

articles/search/search-lucene-query-architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ ms.date: 10/09/2023
1515

1616
# Full text search in Azure AI Search
1717

18-
Full text search is an approach in information retrieval that matches on plain text content stored in an index. For example, given a query string "hotels in San Diego on the beach", the search engine looks for content containing those terms. To make scans more efficient, query strings undergo lexical analysis: lower-casing all terms, removing stop words like "the", and reducing terms to primitive root forms. When matching terms are found, the search engine retrieves documents, ranks them in order of relevance, and returns the top results.
18+
Full text search is an approach in information retrieval that matches on plain text stored in an index. For example, given a query string "hotels in San Diego on the beach", the search engine looks for tokenized strings based on those terms. To make scans more efficient, query strings undergo lexical analysis: lower-casing all terms, removing stop words like "the", and reducing terms to primitive root forms. When matching terms are found, the search engine retrieves documents, ranks them in order of relevance, and returns the top results.
1919

2020
Query execution can be complex. This article is for developers who need a deeper understanding of how full text search works in Azure AI Search. For text queries, Azure AI Search seamlessly delivers expected results in most scenarios, but occasionally you might get a result that seems "off" somehow. In these situations, having a background in the four stages of Lucene query execution (query parsing, lexical analysis, document matching, scoring) can help you identify specific changes to query parameters or index configuration that produce the desired outcome.
2121

articles/search/vector-search-how-to-query.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ All results are returned in plain text, including vectors in fields marked as `r
4444

4545
If you aren't sure whether your search index already has vector fields, look for:
4646

47-
+ A non-empty `vectorSearch` property containing algorithms and other vector-related configurations embedded in the index schema.
47+
+ A nonempty `vectorSearch` property containing algorithms and other vector-related configurations embedded in the index schema.
4848

4949
+ In the fields collection, look for fields of type `Collection(Edm.Single)` with a `dimensions` attribute, and a `vectorSearch` section in the index.
5050

@@ -398,7 +398,7 @@ REST API version [**2023-07-01-Preview**](/rest/api/searchservice/index-preview)
398398

399399
In the following example, the vector is a representation of this query string: "what Azure services support full text search". The query targets the "contentVector" field. The actual vector has 1536 embeddings, so it's trimmed in this example for readability.
400400

401-
In this API version, there's no pre-filter support or `vectorFilterMode` parameter. The filter criteria are applied after the search engine executes the vector query. The set of `"k"` nearest neighbors is retrieved, and then combined with the set of filtered results. As such, the value of `"k"` predetermines the surface over which the filter is applied. For `"k": 10`, the filter is applied to 10 most similar documents. For `"k": 100`, the filter iterates over 100 documents (assuming the index contains 100 documents that are sufficiently similar to the query).
401+
In this API version, there's no prefilter support or `vectorFilterMode` parameter. The filter criteria are applied after the search engine executes the vector query. The set of `"k"` nearest neighbors is retrieved, and then combined with the set of filtered results. As such, the value of `"k"` predetermines the surface over which the filter is applied. For `"k": 10`, the filter is applied to 10 most similar documents. For `"k": 100`, the filter iterates over 100 documents (assuming the index contains 100 documents that are sufficiently similar to the query).
402402

403403
```http
404404
POST https://{{search-service-name}}.search.windows.net/indexes/{{index-name}}/docs/search?api-version=2023-07-01-Preview
@@ -577,14 +577,20 @@ Search results are composed of "retrievable" fields from your search index. A re
577577
+ All "retrievable" fields (a REST API default).
578578
+ Fields explicitly listed in a "select" parameter on the query.
579579

580-
The examples in this article used a "select" statement to specify text (non-vector) fields in the response.
580+
The examples in this article used a "select" statement to specify text (nonvector) fields in the response.
581581

582582
> [!NOTE]
583583
> Vectors aren't designed for readability, so avoid returning them in the response. Instead, choose non-vector fields that are representative of the search document. For example, if the query targets a "descriptionVector" field, return an equivalent text field if you have one ("description") in the response.
584584
585-
### Number of results
585+
### Number of ranked results in a vector query response
586586

587-
A query might match to any number of documents, as many as all of them if the search criteria are weak (for example "search=*" for a null query). Because it's seldom practical to return unbounded results, you should specify a maximum for the response:
587+
A vector query specifies the `k` parameter, which determines how many matches are returned in the results. The search engine always returns `k` number of matches. If `k` is larger than the number of documents in the index, then the number of documents determines the upper limit of what can be returned.
588+
589+
If you're familiar with full text search, you know to expect zero results if the index doesn't contain a term or phrase. However, in vector search, the search operation is identifying nearest neighbors, and it will always return `k` results even if the nearest neighbors aren't that similar. So, it's possible to get results for nonsensical or off-topic queries, especially if you aren't using prompts to set boundaries. Less relevant results have a worse similarity score, but they're still the "nearest" vectors if there isn't anything closer. As such, a response with no meaningful results can still return `k` results, but each result's similarity score would be low.
590+
591+
A [hybrid approach](hybrid-search-overview.md) that includes full text search can mitigate this problem. Another mitigation is to set a minimum threshold on the search score, but only if the query is a pure single vector query. Hybrid queries aren't conducive to minimum thresholds because the ranges are so much smaller and volatile.
592+
593+
Query parameters affecting result count include:
588594

589595
+ `"k": n` results for vector-only queries
590596
+ `"top": n` results for hybrid queries that include a "search" parameter
@@ -595,7 +601,7 @@ Both "k" and "top" are optional. Unspecified, the default number of results in a
595601

596602
Ranking of results is computed by either:
597603

598-
+ The similarity metric specified in the index `vectorSearch` section for a vector-only query. Valid values are `cosine` , `euclidean`, and `dotProduct`.
604+
+ The similarity metric specified in the index `vectorSearch` section for a vector-only query. Valid values are `cosine`, `euclidean`, and `dotProduct`.
599605
+ Reciprocal Rank Fusion (RRF) if there are multiple sets of search results.
600606

601607
Azure OpenAI embedding models use cosine similarity, so if you're using Azure OpenAI embedding models, `cosine` is the recommended metric. Other supported ranking metrics include `euclidean` and `dotProduct`.

articles/search/vector-search-overview.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Vector search
33
titleSuffix: Azure AI Search
4-
description: Describes concepts, scenarios, and availability of the vector search feature in Azure AI Search.
4+
description: Describes concepts, scenarios, and availability of vector capabilities in Azure AI Search.
55

66
author: robertklee
77
ms.author: robertlee
@@ -12,11 +12,11 @@ ms.topic: conceptual
1212
ms.date: 01/29/2024
1313
---
1414

15-
# Vector stores and vector search in Azure AI Search
15+
# Vectors in Azure AI Search
1616

1717
Vector search is an approach in information retrieval that stores numeric representations of content for search scenarios. Because the content is numeric rather than plain text, the search engine matches on vectors that are the most similar to the query, with no requirement for matching on exact terms.
1818

19-
This article is a high-level introduction to vector support in Azure AI Search. It also explains integration with other Azure services and covers [terminology and concepts](#vector-search-concepts) related to vector search development.
19+
This article is a high-level introduction to vectors in Azure AI Search. It also explains integration with other Azure services and covers [terminology and concepts](#vector-search-concepts) related to vector search development.
2020

2121
We recommend this article for background, but if you'd rather get started, follow these steps:
2222

@@ -110,7 +110,7 @@ In order to create effective embeddings for vector search, it's important to tak
110110

111111
### What is the embedding space?
112112

113-
*Embedding space* is the corpus for vector queries. Within a search index, it's all of the vector fields populated with embeddings from the same embedding model. Machine learning models create the embedding space by mapping individual words, phrases, or documents (for natural language processing), images, or other forms of data into a representation comprised of a vector of real numbers representing a coordinate in a high-dimensional space. In this embedding space, similar items are located close together, and dissimilar items are located farther apart.
113+
*Embedding space* is the corpus for vector queries. Within a search index, an embedding space is all of the vector fields populated with embeddings from the same embedding model. Machine learning models create the embedding space by mapping individual words, phrases, or documents (for natural language processing), images, or other forms of data into a representation comprised of a vector of real numbers representing a coordinate in a high-dimensional space. In this embedding space, similar items are located close together, and dissimilar items are located farther apart.
114114

115115
For example, documents that talk about different species of dogs would be clustered close together in the embedding space. Documents about cats would be close together, but farther from the dogs cluster while still being in the neighborhood for animals. Dissimilar concepts such as cloud computing would be much farther away. In practice, these embedding spaces are abstract and don't have well-defined, human-interpretable meanings, but the core idea stays the same.
116116

0 commit comments

Comments
 (0)