@@ -84,69 +84,78 @@ import java.util.List;
8484import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer ;
8585```
8686
87+ ## Create a tokenizer instance
8788
88- The first of these imports is the
89- ` SentenceTransformer ` class, which generates an embedding from a section of text.
90- Here, we create an instance of ` SentenceTransformer ` that uses the
91- [ ` all-MiniLM-L6-v2 ` ] ( https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 )
92- model for the embeddings. This model generates vectors with 384 dimensions, regardless
93- of the length of the input text, but note that the input is truncated to 256
94- tokens (see
95- [ Word piece tokenization] ( https://huggingface.co/learn/nlp-course/en/chapter6/6 )
96- at the [ Hugging Face] ( https://huggingface.co/ ) docs to learn more about the way tokens
97- are related to the original text).
89+ We will use the
90+ [ ` all-mpnet-base-v2 ` ] ( https://huggingface.co/sentence-transformers/all-mpnet-base-v2 )
91+ tokenizer to generate the embeddings. The vectors that represent the
92+ embeddings have 768 components, regardless of the length of the input
93+ text.
9894
99- ``` python
100- model = SentenceTransformer(" sentence-transformers/all-MiniLM-L6-v2" )
95+ ``` java
96+ HuggingFaceTokenizer sentenceTokenizer = HuggingFaceTokenizer . newInstance(
97+ " sentence-transformers/all-mpnet-base-v2" ,
98+ Map . of(" maxLength" , " 768" , " modelMaxLength" , " 768" )
99+ );
101100```
102101
103102## Create the index
104103
105104Connect to Redis and delete any index previously created with the
106- name ` vector_idx ` . (The ` dropindex ()` call throws an exception if
105+ name ` vector_idx ` . (The ` ftDropIndex ()` call throws an exception if
107106the index doesn't already exist, which is why you need the
108- ` try: except: ` block.)
107+ ` try...catch ` block.)
109108
110- ``` python
111- r = redis.Redis( decode_responses = True )
109+ ``` java
110+ UnifiedJedis jedis = new UnifiedJedis ( " redis://localhost:6379 " );
112111
113- try :
114- r.ft(" vector_idx" ).dropindex(True )
115- except redis.exceptions.ResponseError:
116- pass
112+ try {jedis. ftDropIndex(" vector_idx" );} catch (JedisDataException j){}
117113```
118114
119115Next, we create the index.
120- The schema in the example below specifies hash objects for storage and includes
121- three fields: the text content to index, a
116+ The schema in the example below includes three fields: the text content to index, a
122117[ tag] ({{< relref "/develop/interact/search-and-query/advanced-concepts/tags" >}})
123118field to represent the "genre" of the text, and the embedding vector generated from
124119the original text content. The ` embedding ` field specifies
125120[ HNSW] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#hnsw-index" >}})
126121indexing, the
127122[ L2] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#distance-metrics" >}})
128123vector distance metric, ` Float32 ` values to represent the vector's components,
129- and 384 dimensions, as required by the ` all-MiniLM-L6 -v2 ` embedding model.
124+ and 768 dimensions, as required by the ` all-mpnet-base -v2 ` embedding model.
130125
131- ``` python
132- schema = (
133- TextField(" content" ),
134- TagField(" genre" ),
135- VectorField(" embedding" , " HNSW" , {
136- " TYPE" : " FLOAT32" ,
137- " DIM" : 384 ,
138- " DISTANCE_METRIC" :" L2"
139- })
140- )
126+ The ` FTCreateParams ` object specifies hash objects for storage and a
127+ prefix ` doc: ` that identifies the hash objects we want to index.
141128
142- r.ft(" vector_idx" ).create_index(
143- schema,
144- definition = IndexDefinition(
145- prefix = [" doc:" ], index_type = IndexType.HASH
146- )
147- )
129+ ``` java
130+ SchemaField [] schema = {
131+ TextField . of(" content" ),
132+ TagField . of(" genre" ),
133+ VectorField . builder()
134+ .fieldName(" embedding" )
135+ .algorithm(VectorAlgorithm . HNSW )
136+ .attributes(
137+ Map . of(
138+ " TYPE" , " FLOAT32" ,
139+ " DIM" , 768 ,
140+ " DISTANCE_METRIC" , " L2" ,
141+ " INITIAL_CAP" , 3
142+ )
143+ )
144+ .build()
145+ };
146+
147+ jedis. ftCreate(" vector_idx" ,
148+ FTCreateParams . createParams()
149+ .addPrefix(" doc:" )
150+ .on(IndexDataType . HASH ),
151+ schema
152+ );
148153```
149154
155+ ## Define some helper methods
156+
157+
158+
150159## Add data
151160
152161You can now supply the data objects, which will be indexed automatically
@@ -162,30 +171,24 @@ default Python list of `float` values.
162171Use the binary string representation when you are indexing hash objects
163172(as we are here), but use the default list of ` float ` for JSON objects.
164173
165- ``` python
166- content = " That is a very happy person"
167-
168- r.hset(" doc:0" , mapping = {
169- " content" : content,
170- " genre" : " persons" ,
171- " embedding" : model.encode(content).astype(np.float32).tobytes(),
172- })
173-
174- content = " That is a happy dog"
175-
176- r.hset(" doc:1" , mapping = {
177- " content" : content,
178- " genre" : " pets" ,
179- " embedding" : model.encode(content).astype(np.float32).tobytes(),
180- })
181-
182- content = " Today is a sunny day"
174+ ``` java
175+ String sentence1 = " That is a very happy person" ;
176+ jedis. hset(" doc:1" , Map . of( " content" , sentence1, " genre" , " persons" ));
177+ jedis. hset(
178+ " doc:1" . getBytes(),
179+ " embedding" . getBytes(),
180+ longArrayToByteArray(sentenceTokenizer. encode(sentence1). getIds())
181+ );
182+
183+ String sentence2 = " That is a happy dog" ;
184+ jedis. hset(" doc:2" , Map . of( " content" , sentence2, " genre" , " pets" ));
185+ jedis. hset(" doc:2" . getBytes(), " embedding" . getBytes(), longArrayToByteArray(sentenceTokenizer. encode(sentence2). getIds()));
186+
187+ String sentence3 = " Today is a sunny day" ;
188+ Map<String , String > doc3 = Map . of( " content" , sentence3, " genre" , " weather" );
189+ jedis. hset(" doc:3" , doc3);
190+ jedis. hset(" doc:3" . getBytes(), " embedding" . getBytes(), longArrayToByteArray(sentenceTokenizer. encode(sentence3). getIds()));
183191
184- r.hset(" doc:2" , mapping = {
185- " content" : content,
186- " genre" : " weather" ,
187- " embedding" : model.encode(content).astype(np.float32).tobytes(),
188- })
189192```
190193
191194## Run a query
0 commit comments