Skip to content

Commit d48f75c

Browse files
committed
adds Asymmetric semantic search
Signed-off-by: Brian Flores <[email protected]>
1 parent c792403 commit d48f75c

File tree

1 file changed

+280
-7
lines changed

1 file changed

+280
-7
lines changed

docs/tutorials/semantic_search/asymmetric_embedding_model.md

Lines changed: 280 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1-
# Tutorial: Generating Embeddings Using a Local Asymmetric Embedding Model in OpenSearch
1+
# Tutorial: Running Asymmetric Semnantic Search within OpenSearch
22

3-
This tutorial demonstrates how to generate text embeddings using an asymmetric embedding model in OpenSearch, implemented within a Docker container. The example model used in this tutorial is the multilingual `intfloat/multilingual-e5-small` model from Hugging Face. You will learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
3+
This tutorial demonstrates how to generate text embeddings using an asymmetric embedding model in OpenSearch which will be used
4+
to run semantic search. This is implemented within a Docker container, the example model used in this tutorial is the multilingual
5+
`intfloat/multilingual-e5-small` model from Hugging Face.
6+
You will learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
47

58
> **Note**: Make sure to replace all placeholders (e.g., `your_`) with your specific values.
69
@@ -198,7 +201,7 @@ POST /_plugins/_ml/_predict/text_embedding/your_model_id
198201

199202
The response will include a sentence embedding of size 384:
200203

201-
```json
204+
```
202205
{
203206
"inference_results": [
204207
{
@@ -232,7 +235,7 @@ POST /_plugins/_ml/_predict/text_embedding/your_model_id
232235

233236
The response will look like this:
234237

235-
```json
238+
```
236239
{
237240
"inference_results": [
238241
{
@@ -251,11 +254,281 @@ The response will look like this:
251254

252255
---
253256

254-
## Next Steps
257+
# Applying Semantic Search using an ML Inference processor
258+
259+
In this section you are going to apply semantic search on facts about New York City. First you will create an ingest pipeline
260+
using the ML inference processor to create embeddings on ingestion. Then create a search pipeline to run a search using
261+
the same asymmetric embedding model.
262+
263+
264+
## 2. Create an ingest pipeline
265+
266+
### 2.1 Create the test KNN index
267+
```
268+
PUT nyc_facts
269+
{
270+
"settings": {
271+
"index": {
272+
"default_pipeline": "asymmetric_embedding_ingest_pipeline",
273+
"knn": true,
274+
"knn.algo_param.ef_search": 100
275+
}
276+
},
277+
"mappings": {
278+
"properties": {
279+
"fact_embedding": {
280+
"type": "knn_vector",
281+
"dimension": 384,
282+
"method": {
283+
"name": "hnsw",
284+
"space_type": "l2",
285+
"engine": "nmslib",
286+
"parameters": {
287+
"ef_construction": 128,
288+
"m": 24
289+
}
290+
}
291+
}
292+
}
293+
}
294+
}
295+
```
296+
297+
### 2.2 Create an ingest pipeline
298+
299+
```
300+
PUT _ingest/pipeline/asymmetric_embedding_ingest_pipeline
301+
{
302+
"description": "ingest passage text and generate a embedding using an asymmetric model",
303+
"processors": [
304+
{
305+
"ml_inference": {
306+
307+
"model_input": "{\"text_docs\":[\"${input_map.text_docs}\"],\"target_response\":[\"sentence_embedding\"],\"parameters\":{\"content_type\":\"query\"}}",
308+
"function_name": "text_embedding",
309+
"model_id": "{{ _.model_id }}",
310+
"input_map": [
311+
{
312+
"text_docs": "description"
313+
}
314+
],
315+
"output_map": [
316+
{
317+
"fact_embedding": "$.inference_results[0].output[0].data",
318+
"embedding_size": "$.inference_results.*.output.*.shape[0]"
319+
}
320+
]
321+
}
322+
}
323+
]
324+
}
325+
```
326+
327+
### 2.3 Simulate pipeline
328+
329+
- Case1: two book objects with title
330+
```
331+
POST /_ingest/pipeline/asymmetric_embedding_ingest_pipeline/_simulate
332+
{
333+
"docs": [
334+
{
335+
"_index": "my-index",
336+
"_id": "1",
337+
"_source": {
338+
"title": "Central Park",
339+
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities."
340+
}
341+
}
342+
]
343+
}
344+
```
345+
Response
346+
```
347+
{
348+
"docs": [
349+
{
350+
"doc": {
351+
"_index": "my-index",
352+
"_id": "1",
353+
"_source": {
354+
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
355+
"fact_embedding": [
356+
[
357+
0.06344555,
358+
0.30067796,
359+
...
360+
0.014804064,
361+
-0.022822019
362+
]
363+
],
364+
"title": "Central Park",
365+
"embedding_size": [
366+
384.0
367+
]
368+
},
369+
"_ingest": {
370+
"timestamp": "2024-12-16T20:59:07.152169Z"
371+
}
372+
}
373+
}
374+
]
375+
}
376+
```
377+
378+
### 2.4 Test ingest data
379+
Perform bulk ingestion, this will now trigger the ingest pipeline to have embeddings for each document.
380+
```
381+
POST /_bulk
382+
{ "index": { "_index": "nyc_facts" } }
383+
{ "title": "Central Park", "description": "A large public park in the heart of New York City, offering a wide range of recreational activities." }
384+
{ "index": { "_index": "nyc_facts" } }
385+
{ "title": "Empire State Building", "description": "An iconic skyscraper in New York City offering breathtaking views from its observation deck." }
386+
{ "index": { "_index": "nyc_facts" } }
387+
{ "title": "Statue of Liberty", "description": "A colossal neoclassical sculpture on Liberty Island, symbolizing freedom and democracy in the United States." }
388+
{ "index": { "_index": "nyc_facts" } }
389+
{ "title": "Brooklyn Bridge", "description": "A historic suspension bridge connecting Manhattan and Brooklyn, offering pedestrian walkways with great views." }
390+
{ "index": { "_index": "nyc_facts" } }
391+
{ "title": "Times Square", "description": "A bustling commercial and entertainment hub in Manhattan, known for its neon lights and Broadway theaters." }
392+
{ "index": { "_index": "nyc_facts" } }
393+
{ "title": "Yankee Stadium", "description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx." }
394+
{ "index": { "_index": "nyc_facts" } }
395+
{ "title": "The Bronx Zoo", "description": "One of the largest zoos in the world, located in the Bronx, featuring diverse animal exhibits and conservation efforts." }
396+
{ "index": { "_index": "nyc_facts" } }
397+
{ "title": "New York Botanical Garden", "description": "A large botanical garden in the Bronx, known for its diverse plant collections and stunning landscapes." }
398+
{ "index": { "_index": "nyc_facts" } }
399+
{ "title": "Flushing Meadows-Corona Park", "description": "A major park in Queens, home to the USTA Billie Jean King National Tennis Center and the Unisphere." }
400+
{ "index": { "_index": "nyc_facts" } }
401+
{ "title": "Citi Field", "description": "The home stadium of the New York Mets, located in Queens, known for its modern design and fan-friendly atmosphere." }
402+
{ "index": { "_index": "nyc_facts" } }
403+
{ "title": "Rockefeller Center", "description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink." }
404+
{ "index": { "_index": "nyc_facts" } }
405+
{ "title": "Queens Botanical Garden", "description": "A peaceful, beautiful botanical garden located in Flushing, Queens, featuring seasonal displays and plant collections." }
406+
{ "index": { "_index": "nyc_facts" } }
407+
{ "title": "Arthur Ashe Stadium", "description": "The largest tennis stadium in the world, located in Flushing Meadows-Corona Park, Queens, hosting the U.S. Open." }
408+
{ "index": { "_index": "nyc_facts" } }
409+
{ "title": "Wave Hill", "description": "A public garden and cultural center in the Bronx, offering stunning views of the Hudson River and a variety of nature programs." }
410+
{ "index": { "_index": "nyc_facts" } }
411+
{ "title": "Louis Armstrong House", "description": "The former home of jazz legend Louis Armstrong, located in Corona, Queens, now a museum celebrating his life and music." }
412+
413+
```
414+
415+
## 3. Run Semantic Search
416+
417+
### 3.1 Create the Search Pipeline
418+
Create the search pipeline which will convert your query into a embedding and run KNN on the index to return the best documents.
255419

256-
- Create an ingest pipeline for processing documents using asymmetric embeddings.
257-
- Run a query using KNN (k-nearest neighbors) to search with your asymmetric model.
420+
```
421+
PUT /_search/pipeline/asymmetric_embedding_search_pipeline
422+
{
423+
"description": "ingest passage text and generate a embedding using an asymmetric model",
424+
"request_processors": [
425+
{
426+
"ml_inference": {
427+
"query_template": "{\"size\": 3,\"query\": {\"knn\": {\"fact_embedding\": {\"vector\": ${query_embedding},\"k\": 4}}}}",
428+
"function_name": "text_embedding",
429+
"model_id": "{{ _.model_id }}",
430+
"model_input": "{ \"text_docs\": [\"${input_map.query}\"], \"target_response\": [\"sentence_embedding\"], \"parameters\" : {\"content_type\" : \"query\" } }",
431+
"input_map": [
432+
{
433+
"query": "query.term.fact_embedding.value"
434+
}
435+
],
436+
"output_map": [
437+
{
438+
"query_embedding": "$.inference_results[0].output[0].data",
439+
"embedding_size": "$.inference_results.*.output.*.shape[0]"
440+
}
441+
]
442+
}
443+
}
444+
]
445+
}
446+
447+
```
448+
449+
### 3.1 Run Semantic Search
450+
In this scenario we are going to see the top 3 results, when asking about sporting activities in New York City.
451+
```
452+
GET /nyc_facts/_search?search_pipeline=asymmetric_embedding_search_pipeline
453+
{
454+
"query": {
455+
"term": {
456+
"fact_embedding": {
457+
"value": "What are some places for sports in NYC?",
458+
"boost": 1
459+
}
460+
}
461+
}
462+
}
463+
```
258464

465+
Which yields the following
466+
```json
467+
{
468+
"took": 22,
469+
"timed_out": false,
470+
"_shards": {
471+
"total": 1,
472+
"successful": 1,
473+
"skipped": 0,
474+
"failed": 0
475+
},
476+
"hits": {
477+
"total": {
478+
"value": 4,
479+
"relation": "eq"
480+
},
481+
"max_score": 0.12496973,
482+
"hits": [
483+
{
484+
"_index": "nyc_facts",
485+
"_id": "hb9X0ZMBICPs-TP0ijZX",
486+
"_score": 0.12496973,
487+
"_source": {
488+
"fact_embedding": [
489+
...
490+
],
491+
"embedding_size": [
492+
384.0
493+
],
494+
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
495+
"title": "Central Park"
496+
}
497+
},
498+
{
499+
"_index": "nyc_facts",
500+
"_id": "ir9X0ZMBICPs-TP0ijZX",
501+
"_score": 0.114651985,
502+
"_source": {
503+
"fact_embedding": [
504+
...
505+
],
506+
"embedding_size": [
507+
384.0
508+
],
509+
"description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx.",
510+
"title": "Yankee Stadium"
511+
}
512+
},
513+
{
514+
"_index": "nyc_facts",
515+
"_id": "j79X0ZMBICPs-TP0ijZX",
516+
"_score": 0.110090025,
517+
"_source": {
518+
"fact_embedding": [
519+
...
520+
],
521+
"embedding_size": [
522+
384.0
523+
],
524+
"description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink.",
525+
"title": "Rockefeller Center"
526+
}
527+
}
528+
]
529+
}
530+
}
531+
```
259532
---
260533

261534
## References

0 commit comments

Comments
 (0)