@@ -31,6 +31,9 @@ of their meaning.
3131In the example below, we use the [ HuggingFace] ( https://huggingface.co/ ) model
3232[ ` all-mpnet-base-v2 ` ] ( https://huggingface.co/sentence-transformers/all-mpnet-base-v2 )
3333to generate the vector embeddings to store and index with Redis Query Engine.
34+ The code is first demonstrated for hash documents with a
35+ separate section to explain the
36+ [ differences with JSON documents] ( #differences-with-json-documents ) .
3437
3538## Initialize
3639
@@ -75,6 +78,7 @@ import java.nio.ByteBuffer;
7578import java.nio.ByteOrder ;
7679import java.util.Map ;
7780import java.util.List ;
81+ import org.json.JSONObject ;
7882
7983// Tokenizer to generate the vector embeddings.
8084import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer ;
@@ -185,8 +189,9 @@ as shown below to create the embedding that represents the `content` field.
185189The ` getIds() ` method that follows ` encode() ` obtains the vector
186190of ` long ` values which we then convert to a ` float ` array stored as a ` byte `
187191string using our helper method. Use the ` byte ` string representation when you are
188- indexing hash objects (as we are here), but use the default list of ` float ` for
189- JSON objects. Note that when we set the ` embedding ` field, we must use an overload
192+ indexing hash objects (as we are here), but use an array of ` float ` for
193+ JSON objects (see [ Differences with JSON objects] ( #differences-with-json-documents )
194+ below). Note that when we set the ` embedding ` field, we must use an overload
190195of ` hset() ` that requires ` byte ` arrays for each of the key, the field name, and
191196the value, which is why we include the ` getBytes() ` calls on the strings.
192197
@@ -281,6 +286,147 @@ For this model, the text *"That is a happy dog"*
281286is the result judged to be most similar in meaning to the query text
282287* "That is a happy person"* .
283288
289+ ## Differences with JSON documents
290+
291+ Indexing JSON documents is similar to hash indexing, but there are some
292+ important differences. JSON allows much richer data modelling with nested fields, so
293+ you must supply a [ path] ({{< relref "/develop/data-types/json/path" >}}) in the schema
294+ to identify each field you want to index. However, you can declare a short alias for each
295+ of these paths (using the ` as() ` option) to avoid typing it in full for
296+ every query. Also, you must specify ` IndexDataType.JSON ` when you create the index.
297+
298+ The code below shows these differences, but the index is otherwise very similar to
299+ the one created previously for hashes:
300+
301+ ``` java
302+ SchemaField [] jsonSchema = {
303+ TextField . of(" $.content" ). as(" content" ),
304+ TagField . of(" $.genre" ). as(" genre" ),
305+ VectorField . builder()
306+ .fieldName(" $.embedding" ). as(" embedding" )
307+ .algorithm(VectorAlgorithm . HNSW )
308+ .attributes(
309+ Map . of(
310+ " TYPE" , " FLOAT32" ,
311+ " DIM" , 768 ,
312+ " DISTANCE_METRIC" , " L2"
313+ )
314+ )
315+ .build()
316+ };
317+
318+ jedis. ftCreate(" vector_json_idx" ,
319+ FTCreateParams . createParams()
320+ .addPrefix(" jdoc:" )
321+ .on(IndexDataType . JSON ),
322+ jsonSchema
323+ );
324+ ```
325+
326+ An important difference with JSON indexing is that the vectors are
327+ specified using arrays of ` float ` instead of binary strings. This requires
328+ a modified version of the ` longsToFloatsByteString() ` method
329+ used previously:
330+
331+ ``` java
332+ public static float [] longArrayToFloatArray(long [] input) {
333+ float [] floats = new float [input. length];
334+ for (int i = 0 ; i < input. length; i++ ) {
335+ floats[i] = input[i];
336+ }
337+ return floats;
338+ }
339+ ```
340+
341+ Use [ ` jsonSet() ` ] ({{< relref "/commands/json.set" >}}) to add the data
342+ instead of [ ` hset() ` ] ({{< relref "/commands/hset" >}}). Use instances
343+ of ` JSONObject ` to supply the data instead of ` Map ` , as you would for
344+ hash objects.
345+
346+ ``` java
347+ String jSentence1 = " That is a very happy person" ;
348+
349+ JSONObject jdoc1 = new JSONObject ()
350+ .put(" content" , jSentence1)
351+ .put(" genre" , " persons" )
352+ .put(
353+ " embedding" ,
354+ longArrayToFloatArray(
355+ sentenceTokenizer. encode(jSentence1). getIds()
356+ )
357+ );
358+
359+ jedis. jsonSet(" jdoc:1" , Path2 . ROOT_PATH , jdoc1);
360+
361+ String jSentence2 = " That is a happy dog" ;
362+
363+ JSONObject jdoc2 = new JSONObject ()
364+ .put(" content" , jSentence2)
365+ .put(" genre" , " pets" )
366+ .put(
367+ " embedding" ,
368+ longArrayToFloatArray(
369+ sentenceTokenizer. encode(jSentence2). getIds()
370+ )
371+ );
372+
373+ jedis. jsonSet(" jdoc:2" , Path2 . ROOT_PATH , jdoc2);
374+
375+ String jSentence3 = " Today is a sunny day" ;
376+
377+ JSONObject jdoc3 = new JSONObject ()
378+ .put(" content" , jSentence3)
379+ .put(" genre" , " weather" )
380+ .put(
381+ " embedding" ,
382+ longArrayToFloatArray(
383+ sentenceTokenizer. encode(jSentence3). getIds()
384+ )
385+ );
386+
387+ jedis. jsonSet(" jdoc:3" , Path2 . ROOT_PATH , jdoc3);
388+ ```
389+
390+ The query is almost identical to the one for the hash documents. This
391+ demonstrates how the right choice of aliases for the JSON paths can
392+ save you having to write complex queries. An important thing to notice
393+ is that the vector parameter for the query is still specified as a
394+ binary string (using the ` longsToFloatsByteString() ` method), even though
395+ the data for the ` embedding ` field of the JSON was specified as an array.
396+
397+ ``` java
398+ String jSentence = " That is a happy person" ;
399+
400+ int jK = 3 ;
401+ Query jq = new Query (" *=>[KNN $K @embedding $BLOB AS distance]" ).
402+ returnFields(" content" , " distance" ).
403+ addParam(" K" , jK).
404+ addParam(
405+ " BLOB" ,
406+ longsToFloatsByteString(
407+ sentenceTokenizer. encode(jSentence). getIds()
408+ )
409+ )
410+ .setSortBy(" distance" , true )
411+ .dialect(2 );
412+
413+ // Execute the query
414+ List<Document > jDocs = jedis
415+ .ftSearch(" vector_json_idx" , jq)
416+ .getDocuments();
417+
418+ ```
419+
420+ Apart from the ` jdoc: ` prefixes for the keys, the result from the JSON
421+ query is the same as for hash:
422+
423+ ```
424+ Results:
425+ ID: jdoc:2, Distance: 1411344, Content: That is a happy dog
426+ ID: jdoc:1, Distance: 9301635, Content: That is a very happy person
427+ ID: jdoc:3, Distance: 67178800, Content: Today is a sunny day
428+ ```
429+
284430## Learn more
285431
286432See
0 commit comments