@@ -17,4 +17,293 @@ bannerText: Vector set is a new data type that is currently in preview and may b
1717bannerChildren : true
1818---
1919
20- Redis [ vector sets] ({{< relref "/develop/data-types/vector-sets" >}})
20+ A Redis [ vector set] ({{< relref "/develop/data-types/vector-sets" >}}) lets
21+ you store a set of unique keys, each with its own associated vector.
22+ You can then retrieve keys from the set according to the similarity between
23+ their stored vectors and a query vector that you specify.
24+
25+ You can use vector sets to store any type of numeric vector but they are
26+ particularly optimized to work with text embedding vectors (see
27+ [ Redis for AI] ({{< relref "/develop/ai" >}}) to learn more about text
28+ embeddings). The example below shows how to use the
29+ [ ` sentence-transformers ` ] ( https://pypi.org/project/sentence-transformers/ )
30+ library to generate vector embeddings and then
31+ store and retrieve them using a vector set with ` redis-py ` .
32+
33+ ## Initialize
34+
35+ Start by installing the preview version of ` redis-py ` with the following
36+ command:
37+
38+ ``` bash
39+ pip install redis==6.0.0b2
40+ ```
41+
42+ Also, install ` sentence-transformers ` :
43+
44+ ``` bash
45+ pip install sentence-transformers
46+ ```
47+
48+ In a new Python file, import the required classes:
49+
50+ ``` python
51+ from sentence_transformers import SentenceTransformer
52+
53+ import redis
54+ import numpy as np
55+ ```
56+
57+ The first of these imports is the
58+ ` SentenceTransformer ` class, which generates an embedding from a section of text.
59+ Here, we create an instance of ` SentenceTransformer ` that uses the
60+ [ ` all-MiniLM-L6-v2 ` ] ( https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 )
61+ model for the embeddings. This model generates vectors with 384 dimensions, regardless
62+ of the length of the input text, but note that the input is truncated to 256
63+ tokens (see
64+ [ Word piece tokenization] ( https://huggingface.co/learn/nlp-course/en/chapter6/6 )
65+ at the [ Hugging Face] ( https://huggingface.co/ ) docs to learn more about the way tokens
66+ are related to the original text).
67+
68+ ``` python
69+ model = SentenceTransformer(" sentence-transformers/all-MiniLM-L6-v2" )
70+ ```
71+
72+ ## Create the data
73+
74+ For the example, we will use a dictionary of data that contains brief
75+ descriptions of some famous people:
76+
77+ ``` python
78+ peopleData = {
79+ " Marie Curie" : {
80+ " born" : 1867 , " died" : 1934 ,
81+ " description" : """
82+ Polish-French chemist and physicist. The only person ever to win
83+ two Nobel prizes for two different sciences.
84+ """
85+ },
86+ " Linus Pauling" : {
87+ " born" : 1901 , " died" : 1994 ,
88+ " description" : """
89+ American chemist and peace activist. One of only two people to win two
90+ Nobel prizes in different fields (chemistry and peace).
91+ """
92+ },
93+ " Freddie Mercury" : {
94+ " born" : 1946 , " died" : 1991 ,
95+ " description" : """
96+ British musician, best known as the lead singer of the rock band
97+ Queen.
98+ """
99+ },
100+ " Marie Fredriksson" : {
101+ " born" : 1958 , " died" : 2019 ,
102+ " description" : """
103+ Swedish multi-instrumentalist, mainly known as the lead singer and
104+ keyboardist of the band Roxette.
105+ """
106+ },
107+ " Paul Erdos" : {
108+ " born" : 1913 , " died" : 1996 ,
109+ " description" : """
110+ Hungarian mathematician, known for his eccentric personality almost
111+ as much as his contributions to many different fields of mathematics.
112+ """
113+ },
114+ " Maryam Mirzakhani" : {
115+ " born" : 1977 , " died" : 2017 ,
116+ " description" : """
117+ Iranian mathematician. The first woman ever to win the Fields medal
118+ for her contributions to mathematics.
119+ """
120+ },
121+ " Masako Natsume" : {
122+ " born" : 1957 , " died" : 1985 ,
123+ " description" : """
124+ Japanese actress. She was very famous in Japan but was primarily
125+ known elsewhere in the world for her portrayal of Tripitaka in the
126+ TV series Monkey.
127+ """
128+ },
129+ " Chaim Topol" : {
130+ " born" : 1935 , " died" : 2023 ,
131+ " description" : """
132+ Israeli actor and singer, usually credited simply as 'Topol'. He was
133+ best known for his many appearances as Tevye in the musical Fiddler
134+ on the Roof.
135+ """
136+ }
137+ }
138+ ```
139+
140+ ## Add the data to a vector set
141+
142+ The next step is to connect to Redis and add the data to a new vector set.
143+
144+ The code below uses the dictionary's
145+ [ ` items() ` ] ( https://docs.python.org/3/library/stdtypes.html#dict.items )
146+ view to iterate through all the key-value pairs and add corresponding
147+ elements to a vector set called ` famousPeople ` .
148+
149+ We use the
150+ [ ` encode() ` ] ( https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode )
151+ method of ` SentenceTransformer ` to generate the
152+ embedding as an array of ` float32 ` values. The ` tobytes() ` method converts
153+ the array to a byte string that we pass to the
154+ [ ` vadd() ` ] ({{< relref "/commands/vadd" >}}) command to set the embedding.
155+ Note that ` vadd() ` can also accept a list of ` float ` values to set the
156+ vector, but the byte string format is more compact and saves a little
157+ transmission time. If you later use
158+ [ ` vemb() ` ] ({{< relref "/commands/vemb" >}}) to retrieve the embedding,
159+ it will return the vector as an array rather than the original byte
160+ string (note that this is different from the behavior of byte strings in
161+ [ hash vector indexing] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors" >}})).
162+
163+ The call to ` vadd() ` also adds the ` born ` and ` died ` values from the
164+ original dictionary as attribute data. You can access this during a query
165+ or by using the [ ` vgetattr() ` ] ({{< relref "/commands/vgetattr" >}}) method.
166+
167+ ``` py
168+ r = redis.Redis(decode_responses = True )
169+
170+ for name, details in peopleData.items():
171+ emb = model.encode(details[" description" ]).astype(np.float32).tobytes()
172+
173+ r.vset().vadd(
174+ " famousPeople" ,
175+ emb,
176+ name,
177+ attributes = {
178+ " born" : details[" born" ],
179+ " died" : details[" died" ]
180+ }
181+ )
182+ ```
183+
184+ ## Query the vector set
185+
186+ We can now query the data in the set. The basic approach is to use the
187+ ` encode() ` method to generate another embedding vector for the query text.
188+ (This is the same method we used when we added the elements to the set.) Then, we pass
189+ the query vector to [ ` vsim() ` ] ({{< relref "/commands/vsim" >}}) to return elements
190+ of the set, ranked in order of similarity to the query.
191+
192+ Start with a simple query for "actors":
193+
194+ ``` py
195+ query_value = " actors"
196+
197+ actors_results = r.vset().vsim(
198+ " famousPeople" ,
199+ model.encode(query_value).astype(np.float32).tobytes(),
200+ )
201+
202+ print (f " 'actors': { actors_results} " )
203+ ```
204+
205+ This returns the following list of elements (formatted slightly for clarity):
206+
207+ ```
208+ 'actors': ['Masako Natsume', 'Chaim Topol', 'Linus Pauling',
209+ 'Marie Fredriksson', 'Maryam Mirzakhani', 'Marie Curie',
210+ 'Freddie Mercury', 'Paul Erdos']
211+ ```
212+
213+ The first two people in the list are the two actors, as expected, but none of the
214+ people from Linus Pauling onward was especially well-known for acting (and we certainly
215+ didn't include any information about that in the short description text).
216+ As it stands, the search attempts to rank all the elements in the set, based
217+ on the information contained in the embedding model.
218+ You can use the ` count ` parameter of ` vsim() ` to limit the list of elements
219+ to just the most relevant few items:
220+
221+ ``` py
222+ query_value = " actors"
223+
224+ two_actors_results = r.vset().vsim(
225+ " famousPeople" ,
226+ model.encode(query_value).astype(np.float32).tobytes(),
227+ count = 2
228+ )
229+
230+ print (f " 'actors (2)': { two_actors_results} " )
231+ # >>> 'actors (2)': ['Masako Natsume', 'Chaim Topol']
232+ ```
233+
234+ The reason for using text embeddings rather than simple text search
235+ is that the embeddings represent semantic information. This allows a query
236+ to find elements with a similar meaning even if the text is
237+ different. For example, we
238+ don't use the word "entertainer" in any of the descriptions but
239+ if we use it as a query, the actors and musicians are ranked highest
240+ in the results list:
241+
242+ ``` py
243+ query_value = " entertainer"
244+
245+ entertainer_results = r.vset().vsim(
246+ " famousPeople" ,
247+ model.encode(query_value).astype(np.float32).tobytes()
248+ )
249+
250+ print (f " 'entertainer': { entertainer_results} " )
251+ # >>> 'entertainer': ['Chaim Topol', 'Freddie Mercury',
252+ # >>> 'Marie Fredriksson', 'Masako Natsume', 'Linus Pauling',
253+ # 'Paul Erdos', 'Maryam Mirzakhani', 'Marie Curie']
254+ ```
255+
256+ Similarly, if we use "science" as a query, we get the following results:
257+
258+ ```
259+ 'science': ['Marie Curie', 'Linus Pauling', 'Maryam Mirzakhani',
260+ 'Paul Erdos', 'Marie Fredriksson', 'Freddie Mercury', 'Masako Natsume',
261+ 'Chaim Topol']
262+ ```
263+
264+ The scientists are ranked highest but they are then followed by the
265+ mathematicians. This seems reasonable given the connection between mathematics
266+ and science.
267+
268+ You can also use
269+ [ filter expressions] ({{< relref "/develop/data-types/vector-sets/filtered-search" >}})
270+ with ` vsim() ` to restrict the search further. For example,
271+ repeat the "science" query, but this time limit the results to people
272+ who died before the year 2000:
273+
274+ ``` py
275+ query_value = " science"
276+
277+ science2000_results = r.vset().vsim(
278+ " famousPeople" ,
279+ model.encode(query_value).astype(np.float32).tobytes(),
280+ filter = " .died < 2000"
281+ )
282+
283+ print (f " 'science2000': { science2000_results} " )
284+ # >>> 'science2000': ['Marie Curie', 'Linus Pauling',
285+ # 'Paul Erdos', 'Freddie Mercury', 'Masako Natsume']
286+ ```
287+
288+ Note that the boolean filter expression is applied to items in the list
289+ before the vector distance calculation is performed. Items that don't
290+ pass the filter test are removed from the results completely, rather
291+ than just reduced in rank. This can help to improve the performance of the
292+ search because there is no need to calculate the vector distance for
293+ elements that have already been filtered out of the search.
294+
295+ ## More information
296+
297+ See the [ vector sets] ({{< relref "/develop/data-types/vector-sets" >}})
298+ docs for more information and code examples. See the
299+ [ Redis for AI] ({{< relref "/develop/ai" >}}) section for more details
300+ about text embeddings and other AI techniques you can use with Redis.
301+
302+ You may also be interested in
303+ [ vector search] ({{< relref "/develop/clients/redis-py/vecsearch" >}}).
304+ This is a feature of the
305+ [ Redis query engine] ({{< relref "/develop/interact/search-and-query" >}})
306+ that lets you retrieve
307+ [ JSON] ({{< relref "/develop/data-types/json" >}}) and
308+ [ hash] ({{< relref "/develop/data-types/hashes" >}}) documents based on
309+ vector data stored in their fields.
0 commit comments