Merge pull request #279809 from wmwxwa/patch-24

Stacyrch140 · web-flow · commit cc8975c96eac · 2024-07-01T22:35:40.000-04:00
Doc campaign
diff --git a/articles/cosmos-db/TOC.yml b/articles/cosmos-db/TOC.yml
@@ -9,21 +9,39 @@
       href: faq.yml
     - name: Try Azure Cosmos DB free
       href: try-free.md
-    - name: Azure AI Advantage free trial
-      href: ai-advantage.md
     - name: Choose an API
       href: choose-api.md
     - name: Distributed NoSQL
       href: distributed-nosql.md
     - name: Distributed relational
       href: distributed-relational.md
-    - name: Vector databases in Azure Cosmos DB
+    - name: Integrated vector databases
       expanded: true
       items:
-      - name: Vector Databases
+      - name: What is a vector database
         href: vector-database.md
-      - name: Vector Search in Azure Cosmos DB NoSQL
+      - name: Vector database in Azure Cosmos DB NoSQL
         href: nosql/vector-search.md
+      - name: Vector database in Azure Cosmos DB for MongoDB
+        href: mongodb/vcore/vector-search.md
+      - name: Related concepts
+        items:
+        - name: Vector search overview
+          href: gen-ai/vector-search-overview.md
+        - name: Tokens
+          href: gen-ai/tokens.md
+        - name: Vector embeddings
+          href: gen-ai/vector-embeddings.md
+        - name: Distance functions
+          href: gen-ai/distance-functions.md
+        - name: kNN vs ANN
+          href: gen-ai/knn-vs-ann.md
+    - name: Unified AI database
+      items:
+        - name: AI agent
+          href: ai-agents.md
+        - name: Azure AI Advantage free trial
+          href: ai-advantage.md
 - name: NoSQL
   href: nosql/toc.yml
 - name: MongoDB
diff --git a/articles/cosmos-db/gen-ai/distance-functions.md b/articles/cosmos-db/gen-ai/distance-functions.md
@@ -0,0 +1,32 @@
+---
+title: Distance functions
+description: Distance functions overview.
+author: wmwxwa
+ms.author: wangwilliam
+ms.service: cosmos-db
+ms.topic: conceptual
+ms.date: 07/01/2024
+---
+
+# What are distance functions?
+
+Distance functions are mathematical formulas used to measure the similarity or dissimilarity between vectors (see [vector search](vector-search-overview.md)). Common examples include Manhattan distance, Euclidean distance, cosine similarity, and dot product. These measurements are crucial for determining how closely related two pieces of data.
+
+## Manhattan distance
+
+This measures the distance between two points by adding up the absolute differences of their coordinates. Imagine walking in a grid-like city, such as many neighborhoods in Manhattan; it is the total number of blocks you walk north-south and east-west.
+
+## Euclidean distance
+
+Euclidean distance measures the straight-line distance between two points. It is named after the ancient Greek mathematician Euclid, who is often referred to as the “father of geometry”.
+
+## Cosine similarity
+
+Cosine similarity measures the cosine of the angle between two vectors projected in a multidimensional space. Two documents may be far apart by Euclidean distance because of document sizes, but they could still have a smaller angle between them and therefore high cosine similarity.
+
+## Dot product
+
+Two vectors are multiplied to return a single number. It combines the two vectors' magnitudes, as well as the cosine of the angle between them, showing how much one vector goes in the direction of another.
+
+## Related content
+- [VectorDistance system function](../nosql/query/vectordistance.md) in Azure Cosmos DB NoSQL
diff --git a/articles/cosmos-db/gen-ai/knn-vs-ann.md b/articles/cosmos-db/gen-ai/knn-vs-ann.md
@@ -0,0 +1,32 @@
+---
+title: kNN vs ANN
+description: Explanation and comparison of kNN and ANN algorithms.
+author: wmwxwa
+ms.author: wangwilliam
+ms.service: cosmos-db
+ms.topic: conceptual
+ms.date: 07/01/2024
+---
+
+# kNN vs ANN
+
+Two popular vector search algorithms are k-Nearest Neighbors (kNN) and Approximate Nearest Neighbors (ANN, not to be confused with Artificial Neural Network). kNN is precise but computationally intensive, making it less suitable for large datasets. ANN, on the other hand, offers a balance between accuracy and efficiency, making it better suited for large-scale applications.
+
+## How kNN works
+
+1. Vectorization: Each data point in the dataset is represented as a vector in a multi-dimensional space.
+2. Distance Calculation: To classify a new data point (query point), the algorithm calculates the distance between the query point and all other points in the dataset using a [distance function](distance-functions.md).
+3. Finding Neighbors: The algorithm identifies the k closest data points (neighbors) to the query point based on the calculated distances. The value of k (the number of neighbors) is crucial. A small k can be sensitive to noise, while a large k can smooth out details.
+4. Making Predictions:
+  - Classification: For classification tasks, kNN assigns the class label to the query point that is most common among the k neighbors. Essentially, it performs a "majority vote."
+  - Regression: For regression tasks, kNN predicts the value for the query point as the average (or sometimes weighted average) of the values of the k neighbors.
+
+## How ANN works
+
+1. Vectorization: Each data point in the dataset is represented as a vector in a multi-dimensional space.
+2. Indexing and Data Structures: ANN algorithms use advanced data structures (e.g., KD-trees, locality-sensitive hashing, or graph-based methods) to index the data points, allowing for faster searches.
+3. Distance Calculation: Instead of calculating the exact distance to every point, ANN algorithms use heuristics to quickly identify regions of the space that are likely to contain the nearest neighbors.
+4. Finding Neighbors: The algorithm identifies a set of data points that are likely to be close to the query point. These neighbors are not guaranteed to be the exact closest points but are close enough for practical purposes.
+4. Making Predictions:
+  - Classification: For classification tasks, ANN assigns the class label to the query point that is most common among the identified neighbors, similar to kNN.
+  - Regression: For regression tasks, ANN predicts the value for the query point as the average (or weighted average) of the values of the identified neighbors.
diff --git a/articles/cosmos-db/gen-ai/tokens.md b/articles/cosmos-db/gen-ai/tokens.md
@@ -0,0 +1,13 @@
+---
+title: Tokens
+description: Overview of tokens in the context of large language models.
+author: wmwxwa
+ms.author: wangwilliam
+ms.service: cosmos-db
+ms.topic: conceptual
+ms.date: 07/01/2024
+---
+
+# What are tokens?
+
+Tokens are small chunks of text generated by splitting the input text into smaller segments. These segments can either be words or groups of characters, varying in length from a single character to an entire word. For instance, the word hamburger would be divided into tokens such as ham, bur, and ger while a short and common word like pear would be considered a single token. LLMs like GPT-3.5 or GPT-4 break words into tokens for processing.
diff --git a/articles/cosmos-db/gen-ai/vector-embeddings.md b/articles/cosmos-db/gen-ai/vector-embeddings.md
@@ -0,0 +1,31 @@
+---
+title: Vector embeddings
+description: Vector embeddings overview.
+author: wmwxwa
+ms.author: wangwilliam
+ms.service: cosmos-db
+ms.topic: conceptual
+ms.date: 07/01/2024
+---
+
+# What are vector embeddings?
+
+Vectors, also known as embeddings or vector embeddings, are mathematical representations of data in a high-dimensional space. They represent various types of information — text, images, audio — a format that machine learning models can process. When an AI model receives text input, it first tokenizes the text into tokens. Each token is then converted into its corresponding embedding. The model processes these embeddings through multiple layers, capturing complex patterns and relationships within the text. The output embeddings can then be converted back into tokens if needed, generating readable text.
+
+Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar. These high-dimensional representations capture semantic meaning, making it easier to perform tasks like searching, clustering, and classifying.
+
+Here are two examples of texts represented as vectors:
+
+:::image type="content" source="../media/gen-ai/concepts/vector-examples.png" lightbox="../media/gen-ai/concepts/vector-examples.png" alt-text="Screenshot of vector examples.":::
+Image source: [OpenAI](https://openai.com/index/introducing-text-and-code-embeddings/)
+
+Each box containing floating-point numbers corresponds to a dimension, and each dimension corresponds to a feature or attribute that may or may not be comprehensible to humans. Large language model text embeddings typically have a few thousand dimensions, while more complex data models may have tens of thousands of dimensions.
+
+Between the two vectors in the above example, some dimensions are similar while other dimensions are different, which are due to the similarities and differences in the meaning of the two phrases.
+
+This image shows the spatial closeness of vectors that are similar, constrasting vectors that are drastically different:
+
+:::image type="content" source="../media/gen-ai/concepts/vector-closeness.png" lightbox="../media/gen-ai/concepts/vector-closeness.png" alt-text="Screenshot of vector closeness.":::
+Image source: [OpenAI](https://openai.com/index/introducing-text-and-code-embeddings/)
+
+You can see more examples in this [interactive visualization](https://openai.com/index/introducing-text-and-code-embeddings/#_1Vr7cWWEATucFxVXbW465e) that transforms data into a three-dimensional space.
diff --git a/articles/cosmos-db/gen-ai/vector-search-overview.md b/articles/cosmos-db/gen-ai/vector-search-overview.md
@@ -0,0 +1,17 @@
+---
+title: Vector search concept overview
+description: Vector search concept overview
+author: wmwxwa
+ms.author: wangwilliam
+ms.service: cosmos-db
+ms.topic: conceptual
+ms.date: 07/01/2024
+---
+
+# What is vector search?
+
+Vector search is a method that helps you find similar items based on their data characteristics rather than by exact matches on a property field. This technique is useful in applications such as searching for similar text, finding related images, making recommendations, or even detecting anomalies. It works by taking the [vector embeddings](vector-embeddings.md) of your data that you created by using an embedding generation model, such as [Azure OpenAI Embeddings](../../ai-services/openai/how-to/embeddings.md) or [Hugging Face on Azure](https://azure.microsoft.com/solutions/hugging-face-on-azure). It then measures the [distance](distance-functions.md) between the data vectors and your query vector. The data vectors that are closest to your query vector are the ones that are found to be most similar semantically. Some well-known vector search algorithms include Hierarchical Navigable Small World (HNSW), Inverted File (IVF), and the state-of-the-art DiskANN.
+
+This [interactive visualization](https://openai.com/index/introducing-text-and-code-embeddings/#_1Vr7cWWEATucFxVXbW465e) shows some examples of closeness and distance between vectors.
+
+Using an integrated vector search feature offers an efficient way to store, index, and search high-dimensional vector data directly alongside other application data. This approach removes the necessity of migrating your data to costlier alternative vector databases and provides a seamless integration of your AI-driven applications.
diff --git a/articles/cosmos-db/media/gen-ai/concepts/vector-closeness.png b/articles/cosmos-db/media/gen-ai/concepts/vector-closeness.png
diff --git a/articles/cosmos-db/media/gen-ai/concepts/vector-examples.png b/articles/cosmos-db/media/gen-ai/concepts/vector-examples.png