Skip to content

Commit 197d2bc

Browse files
authored
COH-31963 - Create example for using Vectors in Python Client (#233)
* COH-31963 - Create example for using Vectors in Python Client
1 parent f4f98bc commit 197d2bc

File tree

5 files changed

+284
-7
lines changed

5 files changed

+284
-7
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ repos:
1414
- id: end-of-file-fixer
1515
- id: check-yaml
1616
- id: check-added-large-files
17+
exclude: \.json.gzip
1718

1819
- repo: https://github.com/PyCQA/flake8
1920
rev: 7.1.1

examples/README.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,19 @@ Be sure a Coherence gRPC proxy is available for the examples to work against.
1313
docker run -d -p 1408:1408 ghcr.io/oracle/coherence-ce:22.06.11
1414
```
1515

16+
> [!NOTE]
17+
> Coherence AI [vector_search.py](vector_search.py) example requires installation of `sentence-transformers` package so that the example code can use the `all-MiniLM-L6-v2` model for generating text embeddings
18+
>
19+
> ```bash
20+
> python3 -m pip install sentence-transformers
21+
> ```
22+
23+
1624
### The Examples
17-
* basics.py - basic CRUD operations
18-
* python_object_keys_and_values.py - shows how to use standard Python objects as keys or values of a cache
19-
* filters.py - using filters to filter results
20-
* processors.py - using entry processors to mutate cache entries on the server without get/put
21-
* aggregators.py - using entry aggregators to query a subset of entries to produce a result
22-
* events.py - demonstrates cache lifecycle and cache entry events
25+
* [basics.py](basics.py) - basic CRUD operations
26+
* [python_object_keys_and_values.py](python_object_keys_and_values.py) - shows how to use standard Python objects as keys or values of a cache
27+
* [filters.py](filters.py) - using filters to filter results
28+
* [processors.py](processors.py) - using entry processors to mutate cache entries on the server without get/put
29+
* [aggregators.py](aggregators.py) - using entry aggregators to query a subset of entries to produce a result
30+
* [events.py](events.py) - demonstrates cache lifecycle and cache entry events
31+
* [vector_search.py](vector_search.py) - shows how to use some of the Coherence AI features to store vectors and perform a k-nearest neighbors (k-nn) search on those vectors.

examples/movies.json.gzip

629 KB
Binary file not shown.

examples/vector_search.py

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# Copyright (c) 2025, Oracle and/or its affiliates.
2+
# Licensed under the Universal Permissive License v 1.0 as shown at
3+
# https://oss.oracle.com/licenses/upl.
4+
5+
import asyncio
6+
import gzip
7+
import json
8+
from typing import Final, List
9+
10+
from sentence_transformers import SentenceTransformer
11+
12+
from coherence import NamedMap, Session
13+
from coherence.ai import FloatVector, QueryResult, SimilaritySearch, Vectors
14+
from coherence.extractor import Extractors, ValueExtractor
15+
from coherence.filter import Filter, Filters
16+
17+
"""
18+
This example shows how to use some of the Coherence AI features to store
19+
vectors and perform a k-nearest neighbors (k-nn) search on those vectors to
20+
find matches for search text.
21+
22+
Coherence includes an implementation of the HNSW index which can be used to
23+
index vectors to improve search times.
24+
25+
Coherence is only a vector store so in order to actually create vectors from
26+
text snippets this example uses the `sentence-transformers` package to
27+
integrate with a model and produce vector embeddings from text.
28+
29+
This example has shows how easy it is to add vector search capabilities to
30+
cache data in Coherence and how to easily add HNSW indexes to those searches.
31+
It has not been optimised at all for speed of loading vector data or searches.
32+
33+
Coherence Vectors
34+
=================
35+
36+
Coherence Python client can handle a few different types of vector,
37+
this example will use the FloatVector type.
38+
39+
Just like any other data type in Coherence, vectors are stored in normal
40+
Coherence caches. The vector may be stored as the actual cache value,
41+
or it may be in a field of another type that is the cache value. Vector data
42+
is then loaded into Coherence the same way that any other data is loaded
43+
using the NamedMap API.
44+
45+
Movie Database
46+
==============
47+
48+
This example is going to build a small database of movies. The database is
49+
small because the data used is stored in the source repository along with the
50+
code. The same techniques could be used to load any of the freely available
51+
much larger JSON datasets with the required field names.
52+
53+
The Data Model
54+
==============
55+
56+
This example is not going to use an specialized classes to store the data in
57+
the cache. The dataset is a json file and the example will use Coherence json
58+
support to read and store the data.
59+
60+
The schema of the JSON movie data looks like this:
61+
62+
+--------------------+-------------------------------------------------------+
63+
| Field Name | Description |
64+
+====================+=======================================================+
65+
| title + The title of the movie |
66+
+--------------------+-------------------------------------------------------+
67+
| plot | A short summary of the plot of the movie |
68+
+--------------------+-------------------------------------------------------+
69+
| fullplot | A longer summary of the plot of the movie |
70+
+--------------------+-------------------------------------------------------+
71+
| cast + A list of the names of the actors in the movie |
72+
+--------------------+-------------------------------------------------------+
73+
| genres | A list of string values representing the different |
74+
| | genres the movie belongs to |
75+
+--------------------+-------------------------------------------------------+
76+
| runtime | How long the move runs for in minutes |
77+
+--------------------+-------------------------------------------------------+
78+
| poster | A link to the poster for the movie |
79+
+--------------------+-------------------------------------------------------+
80+
| languages | A list of string values representing the different |
81+
| | languages for the movie |
82+
+--------------------+-------------------------------------------------------+
83+
| directors | A list of the names of the directors of the movie |
84+
+--------------------+-------------------------------------------------------+
85+
| writers | A list of the names of the writers of the movie |
86+
+--------------------+-------------------------------------------------------+
87+
88+
This example uses the fullplot to create the vector embeddings for each
89+
movie. Other fields can be used by normal Coherence filters to further narrow
90+
down vector searches.
91+
92+
Searching Vectors
93+
=================
94+
95+
A common way to search data in Coherence caches is to use Coherence
96+
aggregators. The aggregator feature has been used to implement k-nearest
97+
neighbour (k-nn) vector searching using a new built-in aggregator named
98+
SimilaritySearch. When invoking a SimilaritySearch aggregator on a cache
99+
the results are returned as a list of QueryResult instances.
100+
101+
The SimilaritySearch aggregator is used to perform a Knn vector search on a
102+
cache in the same way that normal Coherence aggregators are used.
103+
"""
104+
105+
106+
class MovieRepository:
107+
"""This class represents the repository of movies. It contains all the
108+
code to load and search movie data."""
109+
110+
MODEL_NAME: Final[str] = "all-MiniLM-L6-v2"
111+
"""
112+
This is a sentence-transformers model used for generating text embeddings.
113+
See https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
114+
"""
115+
116+
EMBEDDING_DIMENSIONS: Final[int] = 384
117+
"""Embedding dimension for all-MiniLM-L6-v2."""
118+
119+
VECTOR_FIELD: Final[str] = "embeddings"
120+
"""The name of the field in the json containing the embeddings."""
121+
122+
VALUE_EXTRACTOR: Final[ValueExtractor] = Extractors.extract(VECTOR_FIELD)
123+
"""The ValueExtractor to extract the embeddings vector from the json."""
124+
125+
def __init__(self, movies: NamedMap) -> None:
126+
"""
127+
Creates an instance of the MovieRepository.
128+
129+
:param movies: The Coherence NamedMap is the cache used to store the
130+
movie data.
131+
132+
"""
133+
self.movies = movies
134+
# embedding model to generate embeddings
135+
self.model = SentenceTransformer(self.MODEL_NAME)
136+
137+
async def load(self, filename: str) -> None:
138+
"""
139+
Loads the movie data into the NamedMao using the specified zip file.
140+
141+
:param filename: Name of the movies json zip file.
142+
:return: None.
143+
"""
144+
try:
145+
with gzip.open(filename, "rt", encoding="utf-8") as f:
146+
# the JSON data should be a JSON list of movie objects (dictionary)
147+
# in the format described above.
148+
data = json.load(f)
149+
except FileNotFoundError:
150+
print("Error: The file was not found.")
151+
except Exception as e:
152+
print(f"An unexpected error occurred: {e}")
153+
finally:
154+
try:
155+
f.close()
156+
except NameError:
157+
pass # File was never opened, so nothing to close.
158+
except Exception as e:
159+
print(f"An error occurred while closing the file: {e}")
160+
161+
# iterate over list of movie objects (dictionary) to load them into
162+
# Coherence cache.
163+
for movie in data:
164+
# get the title of the movie
165+
title: str = movie.get("title")
166+
# get the full plot of the movie
167+
full_plot: str = movie.get("fullplot")
168+
key: str = title
169+
# text of the full_plot converted to a vector.
170+
vector: FloatVector = self.vectorize(full_plot)
171+
# vector is added to the movie object.
172+
movie[self.VECTOR_FIELD] = vector
173+
# The movie object is added to the cache using the "title" field
174+
# as the cache key.
175+
await self.movies.put(key, movie)
176+
177+
def vectorize(self, input_string: str) -> FloatVector:
178+
"""vectorize method takes a String value and returns a FloatVector."""
179+
180+
# model used to creat embeddings for the input_string
181+
# in this example model used is onnx-models/all-MiniLM-L6-v2-onnx.
182+
embeddings: List[float] = self.model.encode(input_string).tolist()
183+
184+
# The vector returned is normalized, which makes future operations on
185+
# the vector more efficient.
186+
return FloatVector(Vectors.normalize(embeddings))
187+
188+
async def search(self, search_text: str, count: int, filter: Filter = Filters.always()) -> List[QueryResult]:
189+
"""
190+
Searches the movies cache by converting the search_text into a vector
191+
and then using SimilaritySearch for nearest matches to the embeddings
192+
vector in the cached object. The count parameter is a count of the
193+
number of nearest neighbours to search for. An optional filter
194+
parameter can be The filter is used to reduce the cache entries used
195+
to perform the k-nn search.
196+
197+
:param search_text: the text to nearest match on the movie full plot.
198+
:param count: the count of the nearest matches to return.
199+
:param filter: an optional Filter to use to further reduce the movies
200+
to be queried.
201+
:return: a List of QueryResult objects.
202+
"""
203+
204+
# create a FloatVector of the search_text
205+
vector: FloatVector = self.vectorize(search_text)
206+
# create the SimilaritySearch aggregator using the above vector and count.
207+
search: SimilaritySearch = SimilaritySearch(self.VALUE_EXTRACTOR, vector, count)
208+
# perform the k-nn search using the above aggregator and optional filter and
209+
# returns a list of QueryResults.
210+
return await self.movies.aggregate(search, filter=filter)
211+
212+
213+
# Name of the compressed gzip json file that has data for the movies.
214+
MOVIE_JSON_FILENAME: Final[str] = "movies.json.gzip"
215+
216+
217+
async def do_run() -> None:
218+
219+
# Create a new session to the Coherence server using the default host and
220+
# port i.e. localhost:1408
221+
session: Session = await Session.create()
222+
# Create a NamedMap called movies with key of str and value of dict
223+
movie_db: NamedMap[str, dict] = await session.get_map("movies")
224+
try:
225+
# an instance of class MovieRepository is create passing the above
226+
# NamedMap as a parameter
227+
movies_repo = MovieRepository(movie_db)
228+
229+
# All of the movies data from filename MOVIE_JSON_FILENAME is
230+
# processed and loaded into the movies_repo
231+
await movies_repo.load(MOVIE_JSON_FILENAME)
232+
233+
# Search method is called on the movies_repo instance of class
234+
# MovieRepository that takes a search_text parameter which is the
235+
# text to use to convert to a vector and search the movie plot for
236+
# the nearest matches. The second parameter is a count of the number
237+
# of nearest neighbours to search for.
238+
#
239+
# Below, a search for five movies roughly based on "star travel and space ships"
240+
# is being done
241+
results = await movies_repo.search("star travel and space ships", 5)
242+
print("Search results:")
243+
print("================")
244+
for e in results:
245+
print(f"key = {e.key}, distance = {e.distance}, plot = {e.value.get('plot')}")
246+
247+
# Search method on the movies_repo instance can also include a filter
248+
# to reduce the cache entries used to perform the nearest neighbours
249+
# (k-nn) search.
250+
#
251+
# Below, any movie with a plot similar to "star travel and space
252+
# ships" was searched for. In addition, a Filter is used to narrow down
253+
# the search i.e. movies that starred "Harrison Ford". The filter
254+
# will be applied to the cast field of the json object.
255+
cast_extractor = Extractors.extract("cast")
256+
filter = Filters.contains(cast_extractor, "Harrison Ford")
257+
results = await movies_repo.search("star travel and space ships", 2, filter)
258+
print("\nResults with a filter of movies with cast as Harrison Ford")
259+
print("===========================================================")
260+
for e in results:
261+
print(f"key = {e.key}, distance = {e.distance}, plot = {e.value.get('plot')}")
262+
263+
finally:
264+
await session.close()
265+
266+
267+
asyncio.run(do_run())

tests/e2e/test_ai.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ async def _run_similarity_search_with_index(test_session: Session, index_type: s
162162
hnsw_result = await cache.aggregate(ss)
163163
end_time = time.perf_counter()
164164
elapsed_time = end_time - start_time
165-
COH_LOG.info("Results below for test_SimilaritySearch with HnswIndex:")
165+
COH_LOG.info("Results below for test_SimilaritySearch with " + index_type + ":")
166166
for e in hnsw_result:
167167
COH_LOG.info(e)
168168
COH_LOG.info(f"Elapsed time: {elapsed_time} seconds")

0 commit comments

Comments
 (0)