|
1 | 1 | --- |
2 | 2 | layout: post |
3 | | -title: "Optimizing Inference Processors for Cost Efficiency and Performance" |
| 3 | +title: "Optimizing inference processors for cost efficiency and performance" |
4 | 4 | authors: |
5 | 5 | - will-hwang |
6 | 6 | - heemin |
| 7 | + - kolchfa |
7 | 8 | date: 2025-05-15 |
| 9 | +has_science_table: true |
8 | 10 | categories: |
9 | 11 | - technical-posts |
10 | | -meta_keywords: Ingestion, Embedding, Inference, Update, Pipeline |
11 | | -meta_description: Learn about OpenSearch 3.0's Optimized Inference Processors and utilize the plan-execute-reflect agent to resolve an observability use case |
| 12 | +meta_keywords: OpenSearch, inference processors, vector embeddings, text embedding, sparse encoding, image embedding, ingest pipeline optimization, skip_existing, performance tuning, semantic search, multimodal search, machine learning inference, cost reduction, bulk API, document updates |
| 13 | +meta_description: Learn how to optimize inference processors in OpenSearch to reduce redundant model calls, lower costs, and improve ingestion performance. |
12 | 14 | --- |
13 | | -Inference processors (Text Embedding, Text/Image Embedding, and Sparse Encoding) are defined in ingest pipelines to generate vector embeddings when documents are ingested or updated. Currently, these processors run model inference calls every time a document is ingested or updated, even when embedding fields remain unchanged. This can unnecessarily increase computational overhead and costs for customers. |
14 | 15 |
|
15 | | -This blog highlights an optimization to inference processors designed to avoid redundant inference calls, thereby reducing costs and improving overall performance. |
| 16 | +Inference processors, such as `text_embedding`, `text_image_embedding`, and `sparse_encoding`, enable the generation of vector embeddings during document ingestion or updates. Today, these processors invoke model inference every time a document is ingested or updated, even if the embedding source fields remain unchanged. This can lead to unnecessary compute usage and increased costs. |
16 | 17 |
|
| 18 | +This blog post introduces a new inference processor optimization that reduces redundant inference calls, reducing costs and improving overall performance. |
17 | 19 |
|
18 | | -## Optimization Methodology |
| 20 | +## How the optimization works |
19 | 21 |
|
20 | | -The optimization intelligently leverages previously ingested documents as a cache for embedding comparison. If the embedding fields remain unchanged, the update flow skips inference and directly copies the existing embeddings into the updated document. If changes are found, embeddings are regenerated via ML inference as usual. This approach minimizes redundant inference calls, significantly improving efficiency. |
| 22 | +The optimization adds a caching mechanism that compares the embedding source fields in the updated document against the existing document. If the embedding fields have not changed, the processor directly copies the existing embeddings into the updated document instead of triggering new inference. If the fields differ, the processor proceeds with inference as usual. The following diagram illustrates this workflow. |
21 | 23 |
|
22 | | - |
| 24 | + |
23 | 25 |
|
24 | | -## Enable Optimization in Inference Processors |
| 26 | +This approach minimizes redundant inference calls, significantly improving efficiency without impacting the accuracy or freshness of embeddings. |
25 | 27 |
|
26 | | -To enable optimization, create an ingest pipeline and specify **`skip_existing`** field as **`true`** at the processor level. The feature can be specified for `text_embedding`, `text_image_embedding`, and `sparse_encoding` processors. By default, the feature is set to **`false`**. |
| 28 | +## How to enable the optimization |
27 | 29 |
|
28 | | -### Text Embedding Processor |
| 30 | +To enable this optimization, set the `skip_existing` parameter to `true` in your ingest pipeline processor definition. This option is available for [`text_embedding`](#text-embedding-processor), [`text_image_embedding`](#textimage-embedding-processor), and [`sparse_encoding`](#sparse-encoding-processor) processors. By default, `skip_existing` is set to `false`. |
29 | 31 |
|
30 | | -**Feature Description:** Text Embedding Processor is used to generate vector embedding fields for semantic search. If **`skip_existing`** is set to **`true`**, the text fields with vector field mappings in the `field_map`, will be compared for skipping inference when remain unchanged. |
| 32 | +### Text embedding processor |
31 | 33 |
|
32 | | -**Pipeline Configuration:** |
| 34 | +The [`text_embedding` processor](https://docs.opensearch.org/docs/latest/ingest-pipelines/processors/text-embedding/) generates vector embeddings for text fields, typically used in semantic search. |
33 | 35 |
|
34 | | -``` |
| 36 | +* **Optimization behavior**: If `skip_existing` is `true`, the processor checks whether the text fields mapped in `field_map` have changed. If they haven't, inference is skipped and the existing vector is reused. |
| 37 | + |
| 38 | +**Example pipeline**: |
| 39 | + |
| 40 | +```json |
35 | 41 | PUT /_ingest/pipeline/optimized-ingest-pipeline |
36 | 42 | { |
37 | | - "description": "Optimized Ingest Pipeline", |
38 | | - "processors": [ |
39 | | - { |
40 | | - "text_embedding": { |
41 | | - "model_id": "<model_id>", |
42 | | - "field_map": { |
43 | | - "text":"<vector_field>" |
44 | | - }, |
45 | | - "skip_existing": true |
46 | | - } |
47 | | - } |
48 | | - ] |
| 43 | + "description": "Optimized ingest pipeline", |
| 44 | + "processors": [ |
| 45 | + { |
| 46 | + "text_embedding": { |
| 47 | + "model_id": "<model_id>", |
| 48 | + "field_map": { |
| 49 | + "text": "<vector_field>" |
| 50 | + }, |
| 51 | + "skip_existing": true |
| 52 | + } |
| 53 | + } |
| 54 | + ] |
49 | 55 | } |
50 | | -
|
51 | 56 | ``` |
52 | 57 |
|
53 | | -### Text/Image Embedding Processor |
| 58 | +### Text/image embedding processor |
54 | 59 |
|
55 | | -**Feature Description:** Text/Image Embedding Processor is used to generate combined vector embedding from text and image fields for multi-modal search. If **`skip_existing`** is set to **`true`**, both the text and image fields in the `field_map` will be compared for skipping inference when remain unchanged. Since embeddings are generated for combined text and image fields, inference will only be skipped if both fields match. |
| 60 | +The [`text_image_embedding` processor](https://docs.opensearch.org/docs/latest/ingest-pipelines/processors/text-image-embedding/) generates combined embeddings from text and image fields for multimodal search use cases. |
56 | 61 |
|
57 | | -**Pipeline Configuration:** |
| 62 | +* **Optimization behavior**: Because embeddings are generated for combined text and image fields, inference is skipped only if **both** the text and image fields mapped in `field_map` are unchanged. |
58 | 63 |
|
59 | | -``` |
| 64 | +**Example pipeline**: |
| 65 | + |
| 66 | +```json |
60 | 67 | PUT /_ingest/pipeline/optimized-ingest-pipeline |
61 | 68 | { |
62 | | - "description": "Optimized Ingest Pipeline", |
63 | | - "processors": [ |
64 | | - { |
65 | | - "text_image_embedding": { |
66 | | - "model_id": "<model_id>", |
67 | | - "embedding": "<vector_field>" |
68 | | - "field_map": { |
69 | | - "text":"<input_text_field>", |
70 | | - "image":"<input_image_field>" |
71 | | - }, |
72 | | - "skip_existing": true |
73 | | - } |
74 | | - } |
75 | | - ] |
| 69 | + "description": "Optimized ingest pipeline", |
| 70 | + "processors": [ |
| 71 | + { |
| 72 | + "text_image_embedding": { |
| 73 | + "model_id": "<model_id>", |
| 74 | + "embedding": "<vector_field>", |
| 75 | + "field_map": { |
| 76 | + "text": "<input_text_field>", |
| 77 | + "image": "<input_image_field>" |
| 78 | + }, |
| 79 | + "skip_existing": true |
| 80 | + } |
| 81 | + } |
| 82 | + ] |
76 | 83 | } |
77 | 84 | ``` |
78 | 85 |
|
79 | | -### Sparse Encoding Processor |
| 86 | +### Sparse encoding processor |
80 | 87 |
|
81 | | -**Feature Description:** Sparse Encoding Processor is used to generate a sparse vector/token and weights from text fields for neural sparse search. If **`skip_existing`** is set to **`true`**, both the text field in the `field_map` will be compared for skipping inference when remain unchanged. |
| 88 | +The [`sparse_encoding` processor](https://docs.opensearch.org/docs/latest/ingest-pipelines/processors/sparse-encoding/) generates sparse vectors from text fields, used in neural sparse retrieval. |
82 | 89 |
|
83 | | -**Pipeline Configuration:** |
| 90 | +* **Optimization behavior**: If the text fields in `field_map` are unchanged, the processor skips inference and reuses the existing sparse encoding. |
84 | 91 |
|
85 | | -``` |
| 92 | +**Example pipeline**: |
| 93 | + |
| 94 | +```json |
86 | 95 | PUT /_ingest/pipeline/optimized-ingest-pipeline |
87 | 96 | { |
88 | | - "description": "Optimized Ingest Pipeline", |
89 | | - "processors": [ |
90 | | - { |
91 | | - "sparse_encoding": { |
92 | | - "model_id": "<model_id>", |
93 | | - "prune_type": "max_ratio", |
94 | | - "prune_ratio": "0.1", |
95 | | - "field_map": { |
96 | | - "text":"<vector_field>" |
97 | | - }, |
98 | | - "skip_existing": true |
99 | | - } |
100 | | - } |
101 | | - ] |
| 97 | + "description": "Optimized ingest pipeline", |
| 98 | + "processors": [ |
| 99 | + { |
| 100 | + "sparse_encoding": { |
| 101 | + "model_id": "<model_id>", |
| 102 | + "prune_type": "max_ratio", |
| 103 | + "prune_ratio": "0.1", |
| 104 | + "field_map": { |
| 105 | + "text": "<vector_field>" |
| 106 | + }, |
| 107 | + "skip_existing": true |
| 108 | + } |
| 109 | + } |
| 110 | + ] |
102 | 111 | } |
103 | 112 | ``` |
104 | 113 |
|
105 | | -## Performance Comparison |
| 114 | +## Performance results |
106 | 115 |
|
107 | | -In addition to potential cost savings with skipped inference calls, the feature can also significantly improve latency. |
108 | | -The below tables show latency improvements when the optimization is enabled, compared to the baseline performance. |
| 116 | +In addition to reducing compute costs, skipping redundant inference significantly lowers latency. The following benchmarks compare processor performance with and without the `skip_existing` optimization. |
109 | 117 |
|
110 | | -### Test Environment |
| 118 | +### Test environment |
111 | 119 |
|
112 | | -* **Cluster Setup** |
| 120 | +We used the following cluster setup to run benchmarking tests. |
113 | 121 |
|
114 | | - |
| 122 | + |
115 | 123 |
|
116 | | -### Text Embedding Processor |
117 | 124 |
|
118 | | -* **Model:** `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b` |
119 | | -* **Dataset:** [Trec-Covid](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) |
| 125 | +### Text embedding processor |
120 | 126 |
|
121 | | -* **Single Update Request Example** |
| 127 | +* **Model**: `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b` |
| 128 | +* **Dataset**: [Trec-Covid](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) |
122 | 129 |
|
123 | | -``` |
| 130 | +**Sample requests**: |
| 131 | + |
| 132 | +Single document: |
| 133 | + |
| 134 | +```json |
124 | 135 | PUT /test_index/_doc/1 |
125 | 136 | { |
126 | | - "text": "Hello World", |
| 137 | + "text": "Hello World" |
127 | 138 | } |
128 | 139 | ``` |
129 | 140 |
|
130 | | -* **Batch Update Request Example** |
| 141 | +Bulk update: |
131 | 142 |
|
132 | | -``` |
| 143 | +```json |
133 | 144 | POST _bulk |
134 | | -{ |
135 | | - "index": { |
136 | | - "_index": "test_index", "text": "hello world" |
137 | | - }, |
138 | | - "index": { |
139 | | - "_index": "test_index", "text": "Hi World" |
140 | | - } |
141 | | -} |
| 145 | +{ "index": { "_index": "test_index" } } |
| 146 | +{ "text": "hello world" } |
| 147 | +{ "index": { "_index": "test_index" } } |
| 148 | +{ "text": "Hi World" } |
142 | 149 | ``` |
143 | 150 |
|
144 | | -|**Operation Type** |**Document Size** |**Batch Size** |Baseline **Time** (Skip Existing = False) |**Updated Embedding **Time**** (SkipExisting=True) |Δ Updated vs. Baseline |**Same Embedding **Time**** (SkipExisting=True) |Δ Same vs. Baseline | |
145 | | -|--- |--- |--- |--- |--- |--- |--- |--- | |
146 | | -|Single Update |3000 |1 |1,400,710 ms |1401,216 ms |0.04% |292,020 |-79.15% | |
147 | | -|Batch Update |171,332 |200 |2,247,191 ms |2,192,883 ms |-2.42% |352,767 |-84.30% | |
| 151 | +The following table presents the benchmarking test results for the `text_embedding` processor. |
148 | 152 |
|
149 | | -### Text/Image Embedding Processor |
| 153 | +| Operation type | Doc size | Batch size | Baseline (`skip_existing`=false) | Updated (`skip_existing`=true) | Δ vs. baseline | Unchanged (`skip_existing`=true) | Δ vs. baseline | |
| 154 | +| -------------- | -------- | ---------- | ------------------------------- | ----------------------------- | -------------- | ------------------------------- | -------------- | |
| 155 | +| Single update | 3,000 | 1 | 1,400,710 ms | 1,401,216 ms | +0.04% | 292,020 ms | -79.15% | |
| 156 | +| Batch update | 171,332 | 200 | 2,247,191 ms | 2,192,883 ms | -2.42% | 352,767 ms | -84.30% | |
150 | 157 |
|
151 | | -* **Model:** `amazon.titan-embed-image-v1` |
152 | | -* **Dataset:** [Flickr Images](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset) |
| 158 | +### Text/image embedding processor |
153 | 159 |
|
154 | | -* **Single Update Request Example** |
| 160 | +* **Model**: `amazon.titan-embed-image-v1` |
| 161 | +* **Dataset**: [Flickr Images](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset) |
155 | 162 |
|
156 | | -``` |
| 163 | +**Sample requests**: |
| 164 | + |
| 165 | +Single document: |
| 166 | + |
| 167 | +```json |
157 | 168 | PUT /test_index/_doc/1 |
158 | 169 | { |
159 | | - "text": "Orange table", |
160 | | - "image": "bGlkaHQtd29rfx43..." |
| 170 | + "text": "Orange table", |
| 171 | + "image": "bGlkaHQtd29rfx43..." |
161 | 172 | } |
162 | 173 | ``` |
163 | 174 |
|
164 | | -* **Batch Update Request Example** |
| 175 | +Bulk update: |
165 | 176 |
|
166 | | -``` |
| 177 | +```json |
167 | 178 | POST _bulk |
168 | | -{ |
169 | | - "index": { |
170 | | - "_index": "test_index", "text": "Orange table", "image": "bGlkaHQtd29rfx43..." |
171 | | - }, |
172 | | - "index": { |
173 | | - "_index": "test_index", "text": "Red chair", "image": "aFlkaHQtd29rfx43..." |
174 | | - } |
175 | | -} |
| 179 | +{ "index": { "_index": "test_index" } } |
| 180 | +{ "text": "Orange table", "image": "bGlkaHQtd29rfx43..." } |
| 181 | +{ "index": { "_index": "test_index" } } |
| 182 | +{ "text": "Red chair", "image": "aFlkaHQtd29rfx43..." } |
176 | 183 | ``` |
177 | 184 |
|
178 | | -|**Operation Type** |**Document Size** |**Batch Size** |Baseline **Time** (Skip Existing = False) |**Updated Embedding **Time**** (SkipExisting=True) |Δ Updated vs. Baseline |**Same Embedding **Time**** (SkipExisting=True) |Δ Same vs. Baseline | |
179 | | -|--- |--- |--- |--- |--- |--- |--- |--- | |
180 | | -|Single Update |3000 |1 |1,060,339 ms |1060785 ms |0.04% |465,771 |-56.07% | |
181 | | -|Batch Update |31,783 |200 |1,809,299 ms |1662389 ms |-8.12% |1,571,012 |-13.17% | |
| 185 | +The following table presents the benchmarking test results for the `text_image_embedding` processor. |
182 | 186 |
|
183 | | -### Sparse Encoding Processor |
| 187 | +| Operation type | Doc size | Batch size | Baseline | Updated | Δ vs. baseline | Unchanged | Δ vs. baseline | |
| 188 | +| -------------- | -------- | ---------- | ------------ | ------------ | -------------- | ------------ | -------------- | |
| 189 | +| Single update | 3,000 | 1 | 1,060,339 ms | 1,060,785 ms | +0.04% | 465,771 ms | -56.07% | |
| 190 | +| Batch update | 31,783 | 200 | 1,809,299 ms | 1,662,389 ms | -8.12% | 1,571,012 ms | -13.17% | |
184 | 191 |
|
185 | | -* **Model:** `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b` |
186 | | -* **Dataset:** [Trec-Covid](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) |
187 | | -* **Pruning Method:** `max_ratio`, **Ratio:** `0.1` |
188 | 192 |
|
189 | | -* **Single Update Request Example** |
| 193 | +### Sparse encoding processor |
190 | 194 |
|
191 | | -``` |
| 195 | +* **Model**: `huggingface/sentence-transformers/msmarco-distilbert-base-tas-b` |
| 196 | +* **Dataset**: [Trec-Covid](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip) |
| 197 | +* **Prune method**: `max_ratio`, **ratio**: `0.1` |
| 198 | + |
| 199 | +**Sample requests**: |
| 200 | + |
| 201 | +Single document: |
| 202 | + |
| 203 | +```json |
192 | 204 | PUT /test_index/_doc/1 |
193 | 205 | { |
194 | | - "text": "Hello World", |
| 206 | + "text": "Hello World" |
195 | 207 | } |
196 | 208 | ``` |
197 | 209 |
|
198 | | -* **Batch Update Request Example** |
| 210 | +Bulk update: |
199 | 211 |
|
200 | | -``` |
| 212 | +```json |
201 | 213 | POST _bulk |
202 | | -{ |
203 | | - "index": { |
204 | | - "_index": "test_index", "text": "hello world" |
205 | | - }, |
206 | | - "index": { |
207 | | - "_index": "test_index", "text": "Hi World" |
208 | | - } |
209 | | -} |
| 214 | +{ "index": { "_index": "test_index" } } |
| 215 | +{ "text": "hello world" } |
| 216 | +{ "index": { "_index": "test_index" } } |
| 217 | +{ "text": "Hi World" } |
210 | 218 | ``` |
211 | 219 |
|
212 | | -|**Operation Type** |**Document Size** |**Batch Size** |Baseline **Time** (Skip Existing = False) |**Updated Embedding **Time**** (SkipExisting=True) |Δ Updated vs. Baseline |**Same Embedding **Time**** (SkipExisting=True) |Δ Same vs. Baseline | |
213 | | -|--- |--- |--- |--- |--- |--- |--- |--- | |
214 | | -|Single Update |3000 |1 |1,942,907ms |1,965,918 ms |1.18% |306,766 |-84.21% | |
215 | | -|Batch Update |171,332 |200 |3,077,040 ms |3,101,697 ms |0.80% |475,197 |-84.56% | |
| 220 | +The following table presents the benchmarking test results for the `sparse_encoding` processor. |
| 221 | + |
| 222 | +| Operation type | Doc size | Batch size | Baseline | Updated | Δ vs. baseline | Unchanged | Δ vs. baseline | |
| 223 | +| -------------- | -------- | ---------- | ------------ | ------------ | -------------- | ---------- | -------------- | |
| 224 | +| Single update | 3,000 | 1 | 1,942,907 ms | 1,965,918 ms | +1.18% | 306,766 ms | -84.21% | |
| 225 | +| Batch update | 171,332 | 200 | 3,077,040 ms | 3,101,697 ms | +0.80% | 475,197 ms | -84.56% | |
216 | 226 |
|
217 | 227 | ## Conclusion |
218 | 228 |
|
219 | | -As demonstrated by the cost and performance results, eliminating redundant inference calls is a highly effective optimization. It significantly reduces computational overhead and operational expenses with minimal impact on the initial ingestion process. By reusing existing embeddings when applicable, it streamlines document updates, making them faster and more efficient. Overall, this strategy improves system performance, enhances scalability, and delivers more cost-effective embedding retrieval at scale. |
| 229 | +As demonstrated by the cost and performance results, the `skip_existing` optimization significantly reduces redundant inference operations, which translates to lower costs and improved system performance. By reusing existing embeddings when input fields remain unchanged, ingest pipelines can process updates faster and more efficiently. This strategy improves system performance, enhances scalability, and delivers more cost-effective embedding retrieval at scale. |
| 230 | + |
| 231 | +## What’s next |
| 232 | + |
| 233 | +If you use the Bulk API with ingest pipelines, it's important to understand how different operations behave. |
220 | 234 |
|
| 235 | +The Bulk API supports two operations: `index` and `update`: |
221 | 236 |
|
222 | | -## What’s Next |
| 237 | +* The `index` operation replaces the entire document and **does** trigger ingest pipelines. |
| 238 | +* The `update` operation modifies only the specified fields but **does not** currently trigger ingest pipelines. |
223 | 239 |
|
224 | | -When utilizing the bulk API, you can perform either an index or an update operation. The index operation will replace the entire document, whereas the update operation will only modify the fields specified in the request. It's important to note that ingest pipelines are not currently triggered when the update operation is used within the bulk API. If you desire this functionality, please indicate your support by adding a +1 to the [this](https://github.com/opensearch-project/OpenSearch/issues/17494) GitHub issue. |
| 240 | +If you'd like to see ingest pipeline support for the `update` operation in Bulk API requests, consider supporting [this GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17494) by adding a +1. |
225 | 241 |
|
0 commit comments