You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorials/semantic_search/asymmetric_embedding_model.md
+280-7Lines changed: 280 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,9 @@
1
-
# Tutorial: Generating Embeddings Using a Local Asymmetric Embedding Model in OpenSearch
1
+
# Tutorial: Running Asymmetric Semnantic Search within OpenSearch
2
2
3
-
This tutorial demonstrates how to generate text embeddings using an asymmetric embedding model in OpenSearch, implemented within a Docker container. The example model used in this tutorial is the multilingual `intfloat/multilingual-e5-small` model from Hugging Face. You will learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
3
+
This tutorial demonstrates how to generate text embeddings using an asymmetric embedding model in OpenSearch which will be used
4
+
to run semantic search. This is implemented within a Docker container, the example model used in this tutorial is the multilingual
5
+
`intfloat/multilingual-e5-small` model from Hugging Face.
6
+
You will learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
4
7
5
8
> **Note**: Make sure to replace all placeholders (e.g., `your_`) with your specific values.
6
9
@@ -198,7 +201,7 @@ POST /_plugins/_ml/_predict/text_embedding/your_model_id
198
201
199
202
The response will include a sentence embedding of size 384:
200
203
201
-
```json
204
+
```
202
205
{
203
206
"inference_results": [
204
207
{
@@ -232,7 +235,7 @@ POST /_plugins/_ml/_predict/text_embedding/your_model_id
232
235
233
236
The response will look like this:
234
237
235
-
```json
238
+
```
236
239
{
237
240
"inference_results": [
238
241
{
@@ -251,11 +254,281 @@ The response will look like this:
251
254
252
255
---
253
256
254
-
## Next Steps
257
+
# Applying Semantic Search using an ML Inference processor
258
+
259
+
In this section you are going to apply semantic search on facts about New York City. First you will create an ingest pipeline
260
+
using the ML inference processor to create embeddings on ingestion. Then create a search pipeline to run a search using
POST /_ingest/pipeline/asymmetric_embedding_ingest_pipeline/_simulate
332
+
{
333
+
"docs": [
334
+
{
335
+
"_index": "my-index",
336
+
"_id": "1",
337
+
"_source": {
338
+
"title": "Central Park",
339
+
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities."
340
+
}
341
+
}
342
+
]
343
+
}
344
+
```
345
+
Response
346
+
```
347
+
{
348
+
"docs": [
349
+
{
350
+
"doc": {
351
+
"_index": "my-index",
352
+
"_id": "1",
353
+
"_source": {
354
+
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
355
+
"fact_embedding": [
356
+
[
357
+
0.06344555,
358
+
0.30067796,
359
+
...
360
+
0.014804064,
361
+
-0.022822019
362
+
]
363
+
],
364
+
"title": "Central Park",
365
+
"embedding_size": [
366
+
384.0
367
+
]
368
+
},
369
+
"_ingest": {
370
+
"timestamp": "2024-12-16T20:59:07.152169Z"
371
+
}
372
+
}
373
+
}
374
+
]
375
+
}
376
+
```
377
+
378
+
### 2.4 Test ingest data
379
+
Perform bulk ingestion, this will now trigger the ingest pipeline to have embeddings for each document.
380
+
```
381
+
POST /_bulk
382
+
{ "index": { "_index": "nyc_facts" } }
383
+
{ "title": "Central Park", "description": "A large public park in the heart of New York City, offering a wide range of recreational activities." }
384
+
{ "index": { "_index": "nyc_facts" } }
385
+
{ "title": "Empire State Building", "description": "An iconic skyscraper in New York City offering breathtaking views from its observation deck." }
386
+
{ "index": { "_index": "nyc_facts" } }
387
+
{ "title": "Statue of Liberty", "description": "A colossal neoclassical sculpture on Liberty Island, symbolizing freedom and democracy in the United States." }
388
+
{ "index": { "_index": "nyc_facts" } }
389
+
{ "title": "Brooklyn Bridge", "description": "A historic suspension bridge connecting Manhattan and Brooklyn, offering pedestrian walkways with great views." }
390
+
{ "index": { "_index": "nyc_facts" } }
391
+
{ "title": "Times Square", "description": "A bustling commercial and entertainment hub in Manhattan, known for its neon lights and Broadway theaters." }
392
+
{ "index": { "_index": "nyc_facts" } }
393
+
{ "title": "Yankee Stadium", "description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx." }
394
+
{ "index": { "_index": "nyc_facts" } }
395
+
{ "title": "The Bronx Zoo", "description": "One of the largest zoos in the world, located in the Bronx, featuring diverse animal exhibits and conservation efforts." }
396
+
{ "index": { "_index": "nyc_facts" } }
397
+
{ "title": "New York Botanical Garden", "description": "A large botanical garden in the Bronx, known for its diverse plant collections and stunning landscapes." }
398
+
{ "index": { "_index": "nyc_facts" } }
399
+
{ "title": "Flushing Meadows-Corona Park", "description": "A major park in Queens, home to the USTA Billie Jean King National Tennis Center and the Unisphere." }
400
+
{ "index": { "_index": "nyc_facts" } }
401
+
{ "title": "Citi Field", "description": "The home stadium of the New York Mets, located in Queens, known for its modern design and fan-friendly atmosphere." }
402
+
{ "index": { "_index": "nyc_facts" } }
403
+
{ "title": "Rockefeller Center", "description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink." }
404
+
{ "index": { "_index": "nyc_facts" } }
405
+
{ "title": "Queens Botanical Garden", "description": "A peaceful, beautiful botanical garden located in Flushing, Queens, featuring seasonal displays and plant collections." }
406
+
{ "index": { "_index": "nyc_facts" } }
407
+
{ "title": "Arthur Ashe Stadium", "description": "The largest tennis stadium in the world, located in Flushing Meadows-Corona Park, Queens, hosting the U.S. Open." }
408
+
{ "index": { "_index": "nyc_facts" } }
409
+
{ "title": "Wave Hill", "description": "A public garden and cultural center in the Bronx, offering stunning views of the Hudson River and a variety of nature programs." }
410
+
{ "index": { "_index": "nyc_facts" } }
411
+
{ "title": "Louis Armstrong House", "description": "The former home of jazz legend Louis Armstrong, located in Corona, Queens, now a museum celebrating his life and music." }
412
+
413
+
```
414
+
415
+
## 3. Run Semantic Search
416
+
417
+
### 3.1 Create the Search Pipeline
418
+
Create the search pipeline which will convert your query into a embedding and run KNN on the index to return the best documents.
255
419
256
-
- Create an ingest pipeline for processing documents using asymmetric embeddings.
257
-
- Run a query using KNN (k-nearest neighbors) to search with your asymmetric model.
420
+
```
421
+
PUT /_search/pipeline/asymmetric_embedding_search_pipeline
422
+
{
423
+
"description": "ingest passage text and generate a embedding using an asymmetric model",
0 commit comments