Skip to content

Commit c8b4094

Browse files
Python: Introducing vector and text search (#9345)
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> This PR does the following things: - Introduces TextSearch abstractions, including implementation for Bing - This consists of the TextSearch class, which implements three public search methods, and handles the internals, the search methods are: 'search' returns a string, 'get_text_search_results' returns a TextSearchResult object and 'get_search_results' returns a object native to the search service (i.e. BingWebPages for Bing) - This also has a method called "create_{search_method}' which returns a KernelFunction based on the search method. This function can be adapted by setting the parameters and has several adaptability options and allows you to create a RAG pipeline easily with custom names and descriptions of both the functions and the parameters! - Introduces VectorSearch abstractions, including implementation for Azure AI Search - This consists of a VectorStoreBase class which handles all the internal and three public interfaces, vectorized_search (supply a vector), vectorizable_text_search (supply a string that get's vectorized downstream), vector_text_search (supply a string), each vector store record collection can pick and choose which ones they need to support by importing one or more next to the VectorSearchBase class. - Introduces VectorStoreTextSearch as a way to leverage text search against vector stores - Since this builds on TextSearch this is now the best way to create a super powerfull RAG setup with your own data model! - Adds all the related classes, samples and tests for the above. - Also reorders the data folder, which might cause some slight breaking changes for the few stores that have the new vector store model. - Adds additional IndexKinds and DistanceFunctions to stay in sync with dotnet. - Renames VolatileStore and VolatileCollection to InMemoryVectorStore and InMemoryVectorCollection. Closes #6832 #6833 ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 --------- Co-authored-by: Tao Chen <[email protected]>
1 parent 7ca11a9 commit c8b4094

File tree

128 files changed

+5020
-2064
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

128 files changed

+5020
-2064
lines changed

.github/workflows/python-integration-tests.yml

Lines changed: 140 additions & 90 deletions
Large diffs are not rendered by default.

python/.cspell.json

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,14 @@
6363
"vectorizer",
6464
"vectorstoremodel",
6565
"vertexai",
66-
"Weaviate"
66+
"Weaviate",
67+
"qdrant",
68+
"huggingface",
69+
"pytestmark",
70+
"contoso",
71+
"opentelemetry",
72+
"SEMANTICKERNEL",
73+
"OTEL",
74+
"vectorizable"
6775
]
68-
}
76+
}

python/.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ repos:
4949
- id: mypy
5050
files: ^python/semantic_kernel/
5151
name: mypy
52-
entry: uv run mypy -p semantic_kernel --config-file python/mypy.ini
52+
entry: cd python && uv run mypy -p semantic_kernel --config-file mypy.ini
5353
language: system
5454
types: [python]
5555
pass_filenames: false

python/.vscode/launch.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"request": "launch",
1111
"program": "${file}",
1212
"console": "integratedTerminal",
13-
"justMyCode": true
13+
"justMyCode": false
1414
}
1515
]
1616
}

python/pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ azure = [
5656
"azure-cosmos ~= 4.7"
5757
]
5858
chroma = [
59-
"chromadb >= 0.4,<0.6"
59+
"chromadb >= 0.5,<0.6"
6060
]
6161
google = [
6262
"google-cloud-aiplatform ~= 1.60",
@@ -79,7 +79,7 @@ milvus = [
7979
"milvus >= 2.3,<2.3.8; platform_system != 'Windows'"
8080
]
8181
mistralai = [
82-
"mistralai >= 0.4,< 2.0"
82+
"mistralai >= 0.4,< 1.0"
8383
]
8484
ollama = [
8585
"ollama ~= 0.2"
@@ -140,6 +140,7 @@ environments = [
140140

141141
[tool.pytest.ini_options]
142142
addopts = "-ra -q -r fEX"
143+
asyncio_default_fixture_loop_scope = "function"
143144

144145
[tool.ruff]
145146
line-length = 120

python/samples/concepts/memory/azure_ai_search_hotel_samples/__init__.py

Whitespace-only changes.
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Copyright (c) Microsoft. All rights reserved.
2+
3+
4+
from typing import Annotated, Any
5+
6+
from pydantic import BaseModel
7+
8+
from semantic_kernel.connectors.ai.open_ai import OpenAIEmbeddingPromptExecutionSettings
9+
from semantic_kernel.data import (
10+
VectorStoreRecordDataField,
11+
VectorStoreRecordKeyField,
12+
VectorStoreRecordVectorField,
13+
vectorstoremodel,
14+
)
15+
16+
###
17+
# The data model used for this sample is based on the hotel data model from the Azure AI Search samples.
18+
# When deploying a new index in Azure AI Search using the import wizard you can choose to deploy the 'hotel-samples'
19+
# dataset, see here: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal.
20+
# This is the dataset used in this sample with some modifications.
21+
# This model adds vectors for the 2 descriptions in English and French.
22+
# Both are based on the 1536 dimensions of the OpenAI models.
23+
# You can adjust this at creation time and then make the change below as well.
24+
###
25+
26+
27+
@vectorstoremodel
28+
class HotelSampleClass(BaseModel):
29+
hotel_id: Annotated[str, VectorStoreRecordKeyField]
30+
hotel_name: Annotated[str | None, VectorStoreRecordDataField()] = None
31+
description: Annotated[
32+
str,
33+
VectorStoreRecordDataField(
34+
has_embedding=True, embedding_property_name="description_vector", is_full_text_searchable=True
35+
),
36+
]
37+
description_vector: Annotated[
38+
list[float] | None,
39+
VectorStoreRecordVectorField(
40+
dimensions=1536,
41+
local_embedding=True,
42+
embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},
43+
),
44+
] = None
45+
description_fr: Annotated[
46+
str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="description_fr_vector")
47+
]
48+
description_fr_vector: Annotated[
49+
list[float] | None,
50+
VectorStoreRecordVectorField(
51+
dimensions=1536,
52+
local_embedding=True,
53+
embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},
54+
),
55+
] = None
56+
category: Annotated[str, VectorStoreRecordDataField()]
57+
tags: Annotated[list[str], VectorStoreRecordDataField()]
58+
parking_included: Annotated[bool | None, VectorStoreRecordDataField()] = None
59+
last_renovation_date: Annotated[str | None, VectorStoreRecordDataField()] = None
60+
rating: Annotated[float, VectorStoreRecordDataField()]
61+
location: Annotated[dict[str, Any], VectorStoreRecordDataField()]
62+
address: Annotated[dict[str, str | None], VectorStoreRecordDataField()]
63+
rooms: Annotated[list[dict[str, Any]], VectorStoreRecordDataField()]
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Copyright (c) Microsoft. All rights reserved.
2+
3+
import asyncio
4+
5+
###
6+
# The data model used for this sample is based on the hotel data model from the Azure AI Search samples.
7+
# When deploying a new index in Azure AI Search using the import wizard you can choose to deploy the 'hotel-samples'
8+
# dataset, see here: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal.
9+
# This is the dataset used in this sample with some modifications.
10+
# This model adds vectors for the 2 descriptions in English and French.
11+
# Both are based on the 1536 dimensions of the OpenAI models.
12+
# You can adjust this at creation time and then make the change below as well.
13+
# This sample assumes the index is deployed, the vector fields can be empty.
14+
# If the vector fields are empty, change the first_run parameter to True to add the vectors.
15+
###
16+
from step_0_data_model import HotelSampleClass
17+
18+
from semantic_kernel import Kernel
19+
from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbedding
20+
from semantic_kernel.connectors.memory.azure_ai_search import AzureAISearchCollection
21+
from semantic_kernel.data import (
22+
VectorSearchOptions,
23+
VectorStoreRecordUtils,
24+
)
25+
26+
first_run = False
27+
28+
29+
async def add_vectors(collection: AzureAISearchCollection, vectorizer: VectorStoreRecordUtils):
30+
"""This is a simple function that uses the VectorStoreRecordUtils to add vectors to the records in the collection.
31+
32+
It first uses the search_client within the collection to get a list of ids.
33+
and then uses the upsert to add the vectors to the records.
34+
"""
35+
ids: list[str] = [res.get("hotel_id") async for res in await collection.search_client.search(select="hotel_id")]
36+
print("sample id:", ids[0])
37+
38+
hotels = await collection.get_batch(ids)
39+
if hotels is not None and isinstance(hotels, list):
40+
for hotel in hotels:
41+
if not hotel.description_vector or not hotel.description_fr_vector:
42+
hotel = await vectorizer.add_vector_to_records(hotel, HotelSampleClass)
43+
await collection.upsert(hotel)
44+
45+
46+
async def main(query: str, first_run: bool = False):
47+
# Create the kernel
48+
kernel = Kernel()
49+
# Add the OpenAI text embedding service
50+
embeddings = OpenAITextEmbedding(service_id="embedding", ai_model_id="text-embedding-3-small")
51+
kernel.add_service(embeddings)
52+
# Create the VectorStoreRecordUtils object
53+
vectorizer = VectorStoreRecordUtils(kernel)
54+
# Create the Azure AI Search collection
55+
collection = AzureAISearchCollection[HotelSampleClass](
56+
collection_name="hotels-sample-index", data_model_type=HotelSampleClass
57+
)
58+
# Check if the collection exists.
59+
if not await collection.does_collection_exist():
60+
raise ValueError(
61+
"Collection does not exist, please create using the "
62+
"Azure AI Search portal wizard -> Import Data -> Samples -> hotels-sample."
63+
"During creation adopt the schema to add the description_vector and description_fr_vector fields."
64+
"Then run this sample with `first_run=True` to add the vectors."
65+
)
66+
67+
# If it is the first run and there are no vectors, add them.
68+
if first_run:
69+
await add_vectors(collection, vectorizer)
70+
71+
# Search using just text, by default this will search all the searchable text fields in the index.
72+
results = await collection.text_search(search_text=query)
73+
print("Search results using text: ")
74+
async for result in results.results:
75+
print(
76+
f" {result.record.hotel_id} (in {result.record.address['city']}, "
77+
f"{result.record.address['country']}): {result.record.description} (score: {result.score})"
78+
)
79+
80+
print("\n")
81+
82+
# Generate the vector for the query
83+
query_vector = (await embeddings.generate_raw_embeddings([query]))[0]
84+
85+
print("Search results using vector: ")
86+
# Use vectorized search to search using the vector.
87+
results = await collection.vectorized_search(
88+
vector=query_vector,
89+
options=VectorSearchOptions(vector_field_name="description_vector"),
90+
)
91+
async for result in results.results:
92+
print(
93+
f" {result.record.hotel_id} (in {result.record.address['city']}, "
94+
f"{result.record.address['country']}): {result.record.description} (score: {result.score})"
95+
)
96+
97+
# Delete the collection object so that the connection is closed.
98+
del collection
99+
await asyncio.sleep(2)
100+
101+
102+
if __name__ == "__main__":
103+
query = "swimming pool and good internet connection"
104+
asyncio.run(main(query=query, first_run=first_run))

0 commit comments

Comments
 (0)