@@ -34,9 +34,6 @@ to generate the vector embeddings to store and index with Redis Query Engine.
3434
3535## Initialize
3636
37- Install [ ` jedis ` ] ({{< relref "/develop/clients/jedis" >}}) if you
38- have not already done so.
39-
4037If you are using [ Maven] ( https://maven.apache.org/ ) , add the following
4138dependencies to your ` pom.xml ` file:
4239
@@ -83,6 +80,33 @@ import java.util.List;
8380import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer ;
8481```
8582
83+ ## Define a helper method
84+
85+ Our embedding model represents the vectors as an array of ` long ` integer values,
86+ but Redis Query Engine expects the vector components to be ` float ` values.
87+ Also, when you store vectors in a hash object, you must encode the vector
88+ array as a ` byte ` string. To simplify this situation, we declare a helper
89+ method ` longsToFloatsByteString() ` that takes the ` long ` array that the
90+ embedding model returns, converts it to an array of ` float ` values, and
91+ then encodes the ` float ` array as a ` byte ` string:
92+
93+ ``` java
94+ public static byte [] longsToFloatsByteString(long [] input) {
95+ float [] floats = new float [input. length];
96+ for (int i = 0 ; i < input. length; i++ ) {
97+ floats[i] = input[i];
98+ }
99+
100+ byte [] bytes = new byte [Float . BYTES * floats. length];
101+ ByteBuffer
102+ .wrap(bytes)
103+ .order(ByteOrder . LITTLE_ENDIAN )
104+ .asFloatBuffer()
105+ .put(floats);
106+ return bytes;
107+ }
108+ ```
109+
86110## Create a tokenizer instance
87111
88112We will use the
@@ -136,8 +160,7 @@ SchemaField[] schema = {
136160 Map . of(
137161 " TYPE" , " FLOAT32" ,
138162 " DIM" , 768 ,
139- " DISTANCE_METRIC" , " L2" ,
140- " INITIAL_CAP" , 3
163+ " DISTANCE_METRIC" , " L2"
141164 )
142165 )
143166 .build()
@@ -151,29 +174,6 @@ jedis.ftCreate("vector_idx",
151174);
152175```
153176
154- ## Define a helper method
155-
156- The embedding model represents the vectors as an array of ` long ` integer values,
157- but Redis Query Engine expects the vector components to be ` float ` values.
158- Also, when you store vectors in a hash object, you must encode the vector
159- array as a ` byte ` string. To simplify this situation, we declare a helper
160- method ` longsToFloatsByteString() ` that takes the ` long ` array that the
161- embedding model returns, converts it to an array of ` float ` values, and
162- then encodes the ` float ` array as a ` byte ` string:
163-
164- ``` java
165- public static byte [] longsToFloatsByteString(long [] input) {
166- float [] floats = new float [input. length];
167- for (int i = 0 ; i < input. length; i++ ) {
168- floats[i] = input[i];
169- }
170-
171- byte [] bytes = new byte [Float . BYTES * floats. length];
172- ByteBuffer . wrap(bytes). order(ByteOrder . LITTLE_ENDIAN ). asFloatBuffer(). put(floats);
173- return bytes;
174- }
175- ```
176-
177177## Add data
178178
179179You can now supply the data objects, which will be indexed automatically
@@ -182,31 +182,33 @@ you use the `doc:` prefix specified in the index definition.
182182
183183Use the ` encode() ` method of the ` sentenceTokenizer ` object
184184as shown below to create the embedding that represents the ` content ` field.
185- The ` getIds() ` method that follows the ` encode() ` call obtains the vector
185+ The ` getIds() ` method that follows ` encode() ` obtains the vector
186186of ` long ` values which we then convert to a ` float ` array stored as a ` byte `
187- string. Use the ` byte ` string representation when you are indexing hash
188- objects (as we are here), but use the default list of ` float ` for JSON objects.
187+ string using our helper method. Use the ` byte ` string representation when you are
188+ indexing hash objects (as we are here), but use the default list of ` float ` for
189+ JSON objects. Note that when we set the ` embedding ` field, we must use an overload
190+ of ` hset() ` that requires ` byte ` arrays for each of the key, the field name, and
191+ the value, which is why we include the ` getBytes() ` calls on the strings.
189192
190193``` java
191194String sentence1 = " That is a very happy person" ;
192- jedis. hset(" doc:1" , Map . of( " content" , sentence1, " genre" , " persons" ));
195+ jedis. hset(" doc:1" , Map . of(" content" , sentence1, " genre" , " persons" ));
193196jedis. hset(
194197 " doc:1" . getBytes(),
195198 " embedding" . getBytes(),
196199 longsToFloatsByteString(sentenceTokenizer. encode(sentence1). getIds())
197200);
198201
199202String sentence2 = " That is a happy dog" ;
200- jedis. hset(" doc:2" , Map . of( " content" , sentence2, " genre" , " pets" ));
203+ jedis. hset(" doc:2" , Map . of(" content" , sentence2, " genre" , " pets" ));
201204jedis. hset(
202205 " doc:2" . getBytes(),
203206 " embedding" . getBytes(),
204207 longsToFloatsByteString(sentenceTokenizer. encode(sentence2). getIds())
205208);
206209
207210String sentence3 = " Today is a sunny day" ;
208- Map<String , String > doc3 = Map . of( " content" , sentence3, " genre" , " weather" );
209- jedis. hset(" doc:3" , doc3);
211+ jedis. hset(" doc:3" , Map . of(" content" , sentence3, " genre" , " weather" ));
210212jedis. hset(
211213 " doc:3" . getBytes(),
212214 " embedding" . getBytes(),
@@ -218,53 +220,65 @@ jedis.hset(
218220
219221After you have created the index and added the data, you are ready to run a query.
220222To do this, you must create another embedding vector from your chosen query
221- text. Redis calculates the similarity between the query vector and each
222- embedding vector in the index as it runs the query. It then ranks the
223- results in order of this numeric similarity value .
223+ text. Redis calculates the vector distance between the query vector and each
224+ embedding vector in the index as it runs the query. We can request the results to be
225+ sorted to rank them in order of ascending distance .
224226
225227The code below creates the query embedding using the ` encode() ` method, as with
226228the indexing, and passes it as a parameter when the query executes (see
227229[ Vector search] ({{< relref "/develop/interact/search-and-query/query/vector-search" >}})
228230for more information about using query parameters with embeddings).
231+ The query is a
232+ [ K nearest neighbors (KNN)] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#knn-vector-search" >}})
233+ search that sorts the results in order of vector distance from the query vector.
229234
230235``` java
231236String sentence = " That is a happy person" ;
232237
233238int K = 3 ;
234- Query q = new Query (" *=>[KNN $K @embedding $BLOB AS score]" ).
235- returnFields(" content" , " score" ).
236- addParam(" K" , K ).
237- addParam(
238- " BLOB" ,
239- longsToFloatsByteString(
240- sentenceTokenizer. encode(sentence). getIds()
241- )
242- ).
243- dialect(2 );
239+ Query q = new Query (" *=>[KNN $K @embedding $BLOB AS distance]" )
240+ .returnFields(" content" , " distance" )
241+ .addParam(" K" , K )
242+ .addParam(
243+ " BLOB" ,
244+ longsToFloatsByteString(
245+ sentenceTokenizer. encode(sentence). . getIds()
246+ )
247+ )
248+ .setSortBy(" distance" , true )
249+ .dialect(2 );
244250
245251List<Document > docs = jedis. ftSearch(" vector_idx" , q). getDocuments();
246252
247253for (Document doc: docs) {
248- System . out. println(doc);
254+ System . out. println(
255+ String . format(
256+ " ID: %s, Distance: %s, Content: %s" ,
257+ doc. getId(),
258+ doc. get(" distance" ),
259+ doc. get(" content" )
260+ )
261+ );
249262}
250263```
251264
252- The code is now ready to run, but note that it may take a while to complete when
265+ Assuming you have added the code from the steps above to your source file,
266+ it is now ready to run, but note that it may take a while to complete when
253267you run it for the first time (which happens because the tokenizer must download the
254268` all-mpnet-base-v2 ` model data before it can
255- generate the embeddings). When you run the code, it outputs the following result
256- objects:
269+ generate the embeddings). When you run the code, it outputs the following result text:
257270
258271```
259- id:doc:1, score: 1.0, properties:[score=9301635, content=That is a very happy person]
260- id:doc:2, score: 1.0, properties:[score=1411344, content=That is a happy dog]
261- id:doc:3, score: 1.0, properties:[score=67178800, content=Today is a sunny day]
272+ Results:
273+ ID: doc:2, Distance: 1411344, Content: That is a happy dog
274+ ID: doc:1, Distance: 9301635, Content: That is a very happy person
275+ ID: doc:3, Distance: 67178800, Content: Today is a sunny day
262276```
263277
264- Note that the results are ordered according to the value of the ` vector_distance `
278+ Note that the results are ordered according to the value of the ` distance `
265279field, with the lowest distance indicating the greatest similarity to the query.
266- As you would expect , the result for ` doc:0 ` with the content text * "That is a very happy person "*
267- is the result that is most similar in meaning to the query text
280+ For this model , the text * "That is a happy dog "*
281+ is the result judged to be most similar in meaning to the query text
268282* "That is a happy person"* .
269283
270284## Learn more
0 commit comments