Skip to content

Commit cc8975c

Browse files
authored
Merge pull request #279809 from wmwxwa/patch-24
Doc campaign
2 parents 609cffd + 90adac0 commit cc8975c

File tree

8 files changed

+148
-5
lines changed

8 files changed

+148
-5
lines changed

articles/cosmos-db/TOC.yml

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,39 @@
99
href: faq.yml
1010
- name: Try Azure Cosmos DB free
1111
href: try-free.md
12-
- name: Azure AI Advantage free trial
13-
href: ai-advantage.md
1412
- name: Choose an API
1513
href: choose-api.md
1614
- name: Distributed NoSQL
1715
href: distributed-nosql.md
1816
- name: Distributed relational
1917
href: distributed-relational.md
20-
- name: Vector databases in Azure Cosmos DB
18+
- name: Integrated vector databases
2119
expanded: true
2220
items:
23-
- name: Vector Databases
21+
- name: What is a vector database
2422
href: vector-database.md
25-
- name: Vector Search in Azure Cosmos DB NoSQL
23+
- name: Vector database in Azure Cosmos DB NoSQL
2624
href: nosql/vector-search.md
25+
- name: Vector database in Azure Cosmos DB for MongoDB
26+
href: mongodb/vcore/vector-search.md
27+
- name: Related concepts
28+
items:
29+
- name: Vector search overview
30+
href: gen-ai/vector-search-overview.md
31+
- name: Tokens
32+
href: gen-ai/tokens.md
33+
- name: Vector embeddings
34+
href: gen-ai/vector-embeddings.md
35+
- name: Distance functions
36+
href: gen-ai/distance-functions.md
37+
- name: kNN vs ANN
38+
href: gen-ai/knn-vs-ann.md
39+
- name: Unified AI database
40+
items:
41+
- name: AI agent
42+
href: ai-agents.md
43+
- name: Azure AI Advantage free trial
44+
href: ai-advantage.md
2745
- name: NoSQL
2846
href: nosql/toc.yml
2947
- name: MongoDB
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
title: Distance functions
3+
description: Distance functions overview.
4+
author: wmwxwa
5+
ms.author: wangwilliam
6+
ms.service: cosmos-db
7+
ms.topic: conceptual
8+
ms.date: 07/01/2024
9+
---
10+
11+
# What are distance functions?
12+
13+
Distance functions are mathematical formulas used to measure the similarity or dissimilarity between vectors (see [vector search](vector-search-overview.md)). Common examples include Manhattan distance, Euclidean distance, cosine similarity, and dot product. These measurements are crucial for determining how closely related two pieces of data.
14+
15+
## Manhattan distance
16+
17+
This measures the distance between two points by adding up the absolute differences of their coordinates. Imagine walking in a grid-like city, such as many neighborhoods in Manhattan; it is the total number of blocks you walk north-south and east-west.
18+
19+
## Euclidean distance
20+
21+
Euclidean distance measures the straight-line distance between two points. It is named after the ancient Greek mathematician Euclid, who is often referred to as the “father of geometry”.
22+
23+
## Cosine similarity
24+
25+
Cosine similarity measures the cosine of the angle between two vectors projected in a multidimensional space. Two documents may be far apart by Euclidean distance because of document sizes, but they could still have a smaller angle between them and therefore high cosine similarity.
26+
27+
## Dot product
28+
29+
Two vectors are multiplied to return a single number. It combines the two vectors' magnitudes, as well as the cosine of the angle between them, showing how much one vector goes in the direction of another.
30+
31+
## Related content
32+
- [VectorDistance system function](../nosql/query/vectordistance.md) in Azure Cosmos DB NoSQL
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
title: kNN vs ANN
3+
description: Explanation and comparison of kNN and ANN algorithms.
4+
author: wmwxwa
5+
ms.author: wangwilliam
6+
ms.service: cosmos-db
7+
ms.topic: conceptual
8+
ms.date: 07/01/2024
9+
---
10+
11+
# kNN vs ANN
12+
13+
Two popular vector search algorithms are k-Nearest Neighbors (kNN) and Approximate Nearest Neighbors (ANN, not to be confused with Artificial Neural Network). kNN is precise but computationally intensive, making it less suitable for large datasets. ANN, on the other hand, offers a balance between accuracy and efficiency, making it better suited for large-scale applications.
14+
15+
## How kNN works
16+
17+
1. Vectorization: Each data point in the dataset is represented as a vector in a multi-dimensional space.
18+
2. Distance Calculation: To classify a new data point (query point), the algorithm calculates the distance between the query point and all other points in the dataset using a [distance function](distance-functions.md).
19+
3. Finding Neighbors: The algorithm identifies the k closest data points (neighbors) to the query point based on the calculated distances. The value of k (the number of neighbors) is crucial. A small k can be sensitive to noise, while a large k can smooth out details.
20+
4. Making Predictions:
21+
- Classification: For classification tasks, kNN assigns the class label to the query point that is most common among the k neighbors. Essentially, it performs a "majority vote."
22+
- Regression: For regression tasks, kNN predicts the value for the query point as the average (or sometimes weighted average) of the values of the k neighbors.
23+
24+
## How ANN works
25+
26+
1. Vectorization: Each data point in the dataset is represented as a vector in a multi-dimensional space.
27+
2. Indexing and Data Structures: ANN algorithms use advanced data structures (e.g., KD-trees, locality-sensitive hashing, or graph-based methods) to index the data points, allowing for faster searches.
28+
3. Distance Calculation: Instead of calculating the exact distance to every point, ANN algorithms use heuristics to quickly identify regions of the space that are likely to contain the nearest neighbors.
29+
4. Finding Neighbors: The algorithm identifies a set of data points that are likely to be close to the query point. These neighbors are not guaranteed to be the exact closest points but are close enough for practical purposes.
30+
4. Making Predictions:
31+
- Classification: For classification tasks, ANN assigns the class label to the query point that is most common among the identified neighbors, similar to kNN.
32+
- Regression: For regression tasks, ANN predicts the value for the query point as the average (or weighted average) of the values of the identified neighbors.

articles/cosmos-db/gen-ai/tokens.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: Tokens
3+
description: Overview of tokens in the context of large language models.
4+
author: wmwxwa
5+
ms.author: wangwilliam
6+
ms.service: cosmos-db
7+
ms.topic: conceptual
8+
ms.date: 07/01/2024
9+
---
10+
11+
# What are tokens?
12+
13+
Tokens are small chunks of text generated by splitting the input text into smaller segments. These segments can either be words or groups of characters, varying in length from a single character to an entire word. For instance, the word hamburger would be divided into tokens such as ham, bur, and ger while a short and common word like pear would be considered a single token. LLMs like GPT-3.5 or GPT-4 break words into tokens for processing.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
title: Vector embeddings
3+
description: Vector embeddings overview.
4+
author: wmwxwa
5+
ms.author: wangwilliam
6+
ms.service: cosmos-db
7+
ms.topic: conceptual
8+
ms.date: 07/01/2024
9+
---
10+
11+
# What are vector embeddings?
12+
13+
Vectors, also known as embeddings or vector embeddings, are mathematical representations of data in a high-dimensional space. They represent various types of information — text, images, audio — a format that machine learning models can process. When an AI model receives text input, it first tokenizes the text into tokens. Each token is then converted into its corresponding embedding. The model processes these embeddings through multiple layers, capturing complex patterns and relationships within the text. The output embeddings can then be converted back into tokens if needed, generating readable text.
14+
15+
Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar. These high-dimensional representations capture semantic meaning, making it easier to perform tasks like searching, clustering, and classifying.
16+
17+
Here are two examples of texts represented as vectors:
18+
19+
:::image type="content" source="../media/gen-ai/concepts/vector-examples.png" lightbox="../media/gen-ai/concepts/vector-examples.png" alt-text="Screenshot of vector examples.":::
20+
Image source: [OpenAI](https://openai.com/index/introducing-text-and-code-embeddings/)
21+
22+
Each box containing floating-point numbers corresponds to a dimension, and each dimension corresponds to a feature or attribute that may or may not be comprehensible to humans. Large language model text embeddings typically have a few thousand dimensions, while more complex data models may have tens of thousands of dimensions.
23+
24+
Between the two vectors in the above example, some dimensions are similar while other dimensions are different, which are due to the similarities and differences in the meaning of the two phrases.
25+
26+
This image shows the spatial closeness of vectors that are similar, constrasting vectors that are drastically different:
27+
28+
:::image type="content" source="../media/gen-ai/concepts/vector-closeness.png" lightbox="../media/gen-ai/concepts/vector-closeness.png" alt-text="Screenshot of vector closeness.":::
29+
Image source: [OpenAI](https://openai.com/index/introducing-text-and-code-embeddings/)
30+
31+
You can see more examples in this [interactive visualization](https://openai.com/index/introducing-text-and-code-embeddings/#_1Vr7cWWEATucFxVXbW465e) that transforms data into a three-dimensional space.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: Vector search concept overview
3+
description: Vector search concept overview
4+
author: wmwxwa
5+
ms.author: wangwilliam
6+
ms.service: cosmos-db
7+
ms.topic: conceptual
8+
ms.date: 07/01/2024
9+
---
10+
11+
# What is vector search?
12+
13+
Vector search is a method that helps you find similar items based on their data characteristics rather than by exact matches on a property field. This technique is useful in applications such as searching for similar text, finding related images, making recommendations, or even detecting anomalies. It works by taking the [vector embeddings](vector-embeddings.md) of your data that you created by using an embedding generation model, such as [Azure OpenAI Embeddings](../../ai-services/openai/how-to/embeddings.md) or [Hugging Face on Azure](https://azure.microsoft.com/solutions/hugging-face-on-azure). It then measures the [distance](distance-functions.md) between the data vectors and your query vector. The data vectors that are closest to your query vector are the ones that are found to be most similar semantically. Some well-known vector search algorithms include Hierarchical Navigable Small World (HNSW), Inverted File (IVF), and the state-of-the-art DiskANN.
14+
15+
This [interactive visualization](https://openai.com/index/introducing-text-and-code-embeddings/#_1Vr7cWWEATucFxVXbW465e) shows some examples of closeness and distance between vectors.
16+
17+
Using an integrated vector search feature offers an efficient way to store, index, and search high-dimensional vector data directly alongside other application data. This approach removes the necessity of migrating your data to costlier alternative vector databases and provides a seamless integration of your AI-driven applications.
27.2 KB
Loading
34.7 KB
Loading

0 commit comments

Comments
 (0)