Skip to content

Commit 6ad9ea0

Browse files
authored
Merge pull request opensearch-project#3771 from will-hwang/optimized_inference
[BLOG] Optimizing Inference Processors for Cost Efficiency and Performance
2 parents 59d4e9b + 406571f commit 6ad9ea0

File tree

5 files changed

+266
-0
lines changed

5 files changed

+266
-0
lines changed

_community_members/will-hwang.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
name: Will Hwang
3+
short_name: will-hwang
4+
photo: "/assets/media/community/members/will-hwang.jpg"
5+
title: 'OpenSearch Community Member: Will Hwang'
6+
primary_title: Will Hwang
7+
breadcrumbs:
8+
icon: community
9+
items:
10+
- title: Community
11+
url: /community/index.html
12+
- title: Members
13+
url: /community/members/index.html
14+
- title: 'Will Hwang's Profile'
15+
url: '/community/members/will-hwang.html'
16+
github: will-hwang
17+
job_title_and_company: 'Software Engineer at Amazon Web Services'
18+
personas:
19+
- author
20+
permalink: '/community/members/will-hwang.html'
21+
redirect_from: '/authors/will-hwang/'
22+
---
23+
24+
**Will Hwang** is a Software Engineer at AWS who focuses on neural search development in OpenSearch.
Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
layout: post
3+
title: "Optimizing inference processors for cost efficiency and performance"
4+
authors:
5+
- will-hwang
6+
- heemin-kim
7+
- kolchfa
8+
date: 2025-05-29
9+
has_science_table: true
10+
categories:
11+
- technical-posts
12+
meta_keywords: inference processors, vector embeddings, OpenSearch text embedding, text image embedding, sparse encoding, caching mechanism, ingest pipeline, OpenSearch optimization
13+
meta_description: Learn about a new OpenSearch optimization for inference processors that reduces redundant calls, lowering costs and improving performance in vector embedding generation.
14+
15+
---
16+
17+
Inference processors, such as `text_embedding`, `text_image_embedding`, and `sparse_encoding`, enable the generation of vector embeddings during document ingestion or updates. Today, these processors invoke model inference every time a document is ingested or updated, even if the embedding source fields remain unchanged. This can lead to unnecessary compute usage and increased costs.
18+
19+
This blog post introduces a new inference processor optimization that reduces redundant inference calls, lowering costs and improving overall performance.
20+
21+
## How the optimization works
22+
23+
The optimization adds a caching mechanism that compares the embedding source fields in the updated document against the existing document. If the embedding fields have not changed, the processor directly copies the existing embeddings into the updated document instead of triggering new inference. If the fields differ, the processor proceeds with inference as usual. The following diagram illustrates this workflow.
24+
25+
![Optimization workflow](/assets/media/blog-images/2025-05-15-optimized-inference-processors/diagram.png)
26+
27+
This approach minimizes redundant inference calls, significantly improving efficiency without impacting the accuracy or freshness of embeddings.
28+
29+
## How to enable the optimization
30+
31+
To enable this optimization, set the `skip_existing` parameter to `true` in your ingest pipeline processor definition. This option is available for [`text_embedding`](#text-embedding-processor), [`text_image_embedding`](#textimage-embedding-processor), and [`sparse_encoding`](#sparse-encoding-processor) processors. By default, `skip_existing` is set to `false`.
32+
33+
### Text embedding processor
34+
35+
The [`text_embedding` processor](https://docs.opensearch.org/docs/latest/ingest-pipelines/processors/text-embedding/) generates vector embeddings for text fields, typically used in semantic search.
36+
37+
* **Optimization behavior**: If `skip_existing` is `true`, the processor checks whether the text fields mapped in `field_map` have changed. If they haven't, inference is skipped and the existing vector is reused.
38+
39+
**Example pipeline**:
40+
41+
```json
42+
PUT /_ingest/pipeline/optimized-ingest-pipeline
43+
{
44+
"description": "Optimized ingest pipeline",
45+
"processors": [
46+
{
47+
"text_embedding": {
48+
"model_id": "<model_id>",
49+
"field_map": {
50+
"text": "<vector_field>"
51+
},
52+
"skip_existing": true
53+
}
54+
}
55+
]
56+
}
57+
```
58+
59+
### Text/image embedding processor
60+
61+
The [`text_image_embedding` processor](https://docs.opensearch.org/docs/latest/ingest-pipelines/processors/text-image-embedding/) generates combined embeddings from text and image fields for multimodal search use cases.
62+
63+
* **Optimization behavior**: Because embeddings are generated for combined text and image fields, inference is skipped only if **both** the text and image fields mapped in `field_map` are unchanged.
64+
65+
**Example pipeline**:
66+
67+
```json
68+
PUT /_ingest/pipeline/optimized-ingest-pipeline
69+
{
70+
"description": "Optimized ingest pipeline",
71+
"processors": [
72+
{
73+
"text_image_embedding": {
74+
"model_id": "<model_id>",
75+
"embedding": "<vector_field>",
76+
"field_map": {
77+
"text": "<input_text_field>",
78+
"image": "<input_image_field>"
79+
},
80+
"skip_existing": true
81+
}
82+
}
83+
]
84+
}
85+
```
86+
87+
### Sparse encoding processor
88+
89+
The [`sparse_encoding` processor](https://docs.opensearch.org/docs/latest/ingest-pipelines/processors/sparse-encoding/) generates sparse vectors from text fields used in neural sparse retrieval.
90+
91+
* **Optimization behavior**: If the text fields in `field_map` are unchanged, the processor skips inference and reuses the existing sparse encoding.
92+
93+
**Example pipeline**:
94+
95+
```json
96+
PUT /_ingest/pipeline/optimized-ingest-pipeline
97+
{
98+
"description": "Optimized ingest pipeline",
99+
"processors": [
100+
{
101+
"sparse_encoding": {
102+
"model_id": "<model_id>",
103+
"prune_type": "max_ratio",
104+
"prune_ratio": "0.1",
105+
"field_map": {
106+
"text": "<vector_field>"
107+
},
108+
"skip_existing": true
109+
}
110+
}
111+
]
112+
}
113+
```
114+
115+
## Performance results
116+
117+
In addition to reducing compute costs, skipping redundant inference significantly lowers latency. The following benchmarks compare processor performance with and without the `skip_existing` optimization.
118+
119+
### Test environment
120+
121+
We used the following cluster setup to run benchmarking tests.
122+
123+
![Cluster setup](/assets/media/blog-images/2025-05-15-optimized-inference-processors/cluster_setup.png)
124+
125+
126+
### Text embedding processor
127+
128+
* **Model**: `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b`
129+
* **Dataset**: [Trec-Covid](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip)
130+
131+
**Sample requests**
132+
133+
Single document:
134+
135+
```json
136+
PUT /test_index/_doc/1
137+
{
138+
"text": "Hello World"
139+
}
140+
```
141+
142+
Bulk update:
143+
144+
```json
145+
POST _bulk
146+
{ "index": { "_index": "test_index" } }
147+
{ "text": "hello world" }
148+
{ "index": { "_index": "test_index" } }
149+
{ "text": "Hi World" }
150+
```
151+
152+
The following table presents the benchmarking test results for the `text_embedding` processor.
153+
154+
| Operation type | Doc size | Batch size | Baseline (`skip_existing`=false) | Updated (`skip_existing`=true) | Δ vs. baseline | Unchanged (`skip_existing`=true) | Δ vs. baseline |
155+
| -------------- | -------- | ---------- | ------------------------------- | ----------------------------- | -------------- | ------------------------------- | -------------- |
156+
| Single update | 3,000 | 1 | 1,400,710 ms | 1,401,216 ms | +0.04% | 292,020 ms | -79.15% |
157+
| Batch update | 171,332 | 200 | 2,247,191 ms | 2,192,883 ms | -2.42% | 352,767 ms | -84.30% |
158+
159+
### Text/image embedding processor
160+
161+
* **Model**: `amazon.titan-embed-image-v1`
162+
* **Dataset**: [Flickr Image](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset)
163+
164+
**Sample requests**
165+
166+
Single document:
167+
168+
```json
169+
PUT /test_index/_doc/1
170+
{
171+
"text": "Orange table",
172+
"image": "bGlkaHQtd29rfx43..."
173+
}
174+
```
175+
176+
Bulk update:
177+
178+
```json
179+
POST _bulk
180+
{ "index": { "_index": "test_index" } }
181+
{ "text": "Orange table", "image": "bGlkaHQtd29rfx43..." }
182+
{ "index": { "_index": "test_index" } }
183+
{ "text": "Red chair", "image": "aFlkaHQtd29rfx43..." }
184+
```
185+
186+
The following table presents the benchmarking test results for the `text_image_embedding` processor.
187+
188+
| Operation type | Doc size | Batch size | Baseline | Updated | Δ vs. baseline | Unchanged | Δ vs. baseline |
189+
| -------------- | -------- | ---------- | ------------ | ------------ | -------------- | ------------ | -------------- |
190+
| Single update | 3,000 | 1 | 1,060,339 ms | 1,060,785 ms | +0.04% | 465,771 ms | -56.07% |
191+
| Batch update | 31,783 | 200 | 1,809,299 ms | 1,662,389 ms | -8.12% | 1,571,012 ms | -13.17% |
192+
193+
194+
### Sparse encoding processor
195+
196+
* **Model**: `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b`
197+
* **Dataset**: [Trec-Covid](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip)
198+
* **Prune method**: `max_ratio`, **ratio**: `0.1`
199+
200+
**Sample requests**
201+
202+
Single document:
203+
204+
```json
205+
PUT /test_index/_doc/1
206+
{
207+
"text": "Hello World"
208+
}
209+
```
210+
211+
Bulk update:
212+
213+
```json
214+
POST _bulk
215+
{ "index": { "_index": "test_index" } }
216+
{ "text": "hello world" }
217+
{ "index": { "_index": "test_index" } }
218+
{ "text": "Hi World" }
219+
```
220+
221+
The following table presents the benchmarking test results for the `sparse_encoding` processor.
222+
223+
| Operation type | Doc size | Batch size | Baseline | Updated | Δ vs. baseline | Unchanged | Δ vs. baseline |
224+
| -------------- | -------- | ---------- | ------------ | ------------ | -------------- | ---------- | -------------- |
225+
| Single update | 3,000 | 1 | 1,942,907 ms | 1,965,918 ms | +1.18% | 306,766 ms | -84.21% |
226+
| Batch update | 171,332 | 200 | 3,077,040 ms | 3,101,697 ms | +0.80% | 475,197 ms | -84.56% |
227+
228+
## Conclusion
229+
230+
As demonstrated by the cost and performance results, the `skip_existing` optimization significantly reduces redundant inference operations, which translates to lower costs and improved system performance. By reusing existing embeddings when input fields remain unchanged, ingest pipelines can process updates faster and more efficiently. This strategy improves system performance, enhances scalability, and delivers more cost-effective embedding retrieval at scale.
231+
232+
## What's next
233+
234+
If you use the Bulk API with ingest pipelines, it's important to understand how different operations behave.
235+
236+
The Bulk API supports two operations---`index` and `update`:
237+
238+
* The `index` operation replaces the entire document and **does** trigger ingest pipelines.
239+
* The `update` operation modifies only the specified fields but **does not** currently trigger ingest pipelines.
240+
241+
If you'd like to see ingest pipeline support added to the `update` operation in Bulk API requests, consider supporting [this GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17494) by adding a +1.
242+
32.7 KB
Loading
239 KB
Loading
133 KB
Loading

0 commit comments

Comments
 (0)