Skip to content

Commit ebc64bc

Browse files
Update UpdateForV9 in AttachmentProcessor
We are not going to make this change in V9. We may do it in V10. This change just bumps the annotation to remind us to revisit. Since we are living with this for a while, it seems worth improving the documentation. This now encourages explicitly setting the option one way or the other, since you get a warning if you omit it. It also changes the existing examples to use true rather than false, as that's our recommendation. And it adds a new section with an example where it's true, and moves the content previously in a note into that section.
1 parent 7ffac3b commit ebc64bc

File tree

2 files changed

+73
-28
lines changed

2 files changed

+73
-28
lines changed

docs/reference/ingest/processors/attachment.asciidoc

Lines changed: 68 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ representation. The processor will skip the base64 decoding then.
1919
.Attachment options
2020
[options="header"]
2121
|======
22-
| Name | Required | Default | Description
23-
| `field` | yes | - | The field to get the base64 encoded field from
24-
| `target_field` | no | attachment | The field that will hold the attachment information
25-
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
26-
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
27-
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
28-
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
29-
| `remove_binary` | no | `false` | If `true`, the binary `field` will be removed from the document
30-
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
22+
| Name | Required | Default | Description
23+
| `field` | yes | - | The field to get the base64 encoded field from
24+
| `target_field` | no | attachment | The field that will hold the attachment information
25+
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
26+
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
27+
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
28+
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
29+
| `remove_binary` | encouraged | `false` | If `true`, the binary `field` will be removed from the document. This option is not required, but setting it explicitly is encouraged, and omitting it will result in a warning.
30+
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
3131
|======
3232

3333
[discrete]
@@ -58,7 +58,7 @@ PUT _ingest/pipeline/attachment
5858
{
5959
"attachment" : {
6060
"field" : "data",
61-
"remove_binary": false
61+
"remove_binary": true
6262
}
6363
}
6464
]
@@ -82,7 +82,6 @@ The document's `attachment` object contains extracted properties for the file:
8282
"_seq_no": 22,
8383
"_primary_term": 1,
8484
"_source": {
85-
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
8685
"attachment": {
8786
"content_type": "application/rtf",
8887
"language": "ro",
@@ -94,9 +93,6 @@ The document's `attachment` object contains extracted properties for the file:
9493
----
9594
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
9695

97-
NOTE: Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended
98-
to remove that field from the document. Set `remove_binary` to `true` to automatically remove the field.
99-
10096
[[attachment-fields]]
10197
==== Exported fields
10298

@@ -143,7 +139,7 @@ PUT _ingest/pipeline/attachment
143139
"attachment" : {
144140
"field" : "data",
145141
"properties": [ "content", "title" ],
146-
"remove_binary": false
142+
"remove_binary": true
147143
}
148144
}
149145
]
@@ -154,6 +150,59 @@ NOTE: Extracting contents from binary data is a resource intensive operation and
154150
consumes a lot of resources. It is highly recommended to run pipelines
155151
using this processor in a dedicated ingest node.
156152

153+
[[attachment-keep-binary]]
154+
==== Keeping the attachment binary
155+
156+
Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended to remove
157+
that field from the document, by setting `remove_binary` to `true` to automatically remove the field, as in the other
158+
examples shown on this page. If you _do_ want to keep the binary field, explicitly set `remove_binary` to `false` to
159+
avoid the warning you get from omitting it:
160+
161+
[source,console]
162+
----
163+
PUT _ingest/pipeline/attachment
164+
{
165+
"description" : "Extract attachment information including original binary",
166+
"processors" : [
167+
{
168+
"attachment" : {
169+
"field" : "data",
170+
"remove_binary": false
171+
}
172+
}
173+
]
174+
}
175+
PUT my-index-000001/_doc/my_id?pipeline=attachment
176+
{
177+
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
178+
}
179+
GET my-index-000001/_doc/my_id
180+
----
181+
182+
The document's `_source` object includes the original binary field:
183+
184+
[source,console-result]
185+
----
186+
{
187+
"found": true,
188+
"_index": "my-index-000001",
189+
"_id": "my_id",
190+
"_version": 1,
191+
"_seq_no": 22,
192+
"_primary_term": 1,
193+
"_source": {
194+
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
195+
"attachment": {
196+
"content_type": "application/rtf",
197+
"language": "ro",
198+
"content": "Lorem ipsum dolor sit amet",
199+
"content_length": 28
200+
}
201+
}
202+
}
203+
----
204+
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
205+
157206
[[attachment-cbor]]
158207
==== Use the attachment processor with CBOR
159208

@@ -170,7 +219,7 @@ PUT _ingest/pipeline/cbor-attachment
170219
{
171220
"attachment" : {
172221
"field" : "data",
173-
"remove_binary": false
222+
"remove_binary": true
174223
}
175224
}
176225
]
@@ -226,7 +275,7 @@ PUT _ingest/pipeline/attachment
226275
"field" : "data",
227276
"indexed_chars" : 11,
228277
"indexed_chars_field" : "max_size",
229-
"remove_binary": false
278+
"remove_binary": true
230279
}
231280
}
232281
]
@@ -250,7 +299,6 @@ Returns this:
250299
"_seq_no": 35,
251300
"_primary_term": 1,
252301
"_source": {
253-
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
254302
"attachment": {
255303
"content_type": "application/rtf",
256304
"language": "is",
@@ -274,7 +322,7 @@ PUT _ingest/pipeline/attachment
274322
"field" : "data",
275323
"indexed_chars" : 11,
276324
"indexed_chars_field" : "max_size",
277-
"remove_binary": false
325+
"remove_binary": true
278326
}
279327
}
280328
]
@@ -299,7 +347,6 @@ Returns this:
299347
"_seq_no": 40,
300348
"_primary_term": 1,
301349
"_source": {
302-
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
303350
"max_size": 5,
304351
"attachment": {
305352
"content_type": "application/rtf",
@@ -358,7 +405,7 @@ PUT _ingest/pipeline/attachment
358405
"attachment": {
359406
"target_field": "_ingest._value.attachment",
360407
"field": "_ingest._value.data",
361-
"remove_binary": false
408+
"remove_binary": true
362409
}
363410
}
364411
}
@@ -396,7 +443,6 @@ Returns this:
396443
"attachments" : [
397444
{
398445
"filename" : "ipsum.txt",
399-
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
400446
"attachment" : {
401447
"content_type" : "text/plain; charset=ISO-8859-1",
402448
"language" : "en",
@@ -406,7 +452,6 @@ Returns this:
406452
},
407453
{
408454
"filename" : "test.txt",
409-
"data" : "VGhpcyBpcyBhIHRlc3QK",
410455
"attachment" : {
411456
"content_type" : "text/plain; charset=ISO-8859-1",
412457
"language" : "en",

modules/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
import org.elasticsearch.common.Strings;
1919
import org.elasticsearch.common.logging.DeprecationCategory;
2020
import org.elasticsearch.common.logging.DeprecationLogger;
21-
import org.elasticsearch.core.UpdateForV9;
21+
import org.elasticsearch.core.UpdateForV10;
2222
import org.elasticsearch.ingest.AbstractProcessor;
2323
import org.elasticsearch.ingest.IngestDocument;
2424
import org.elasticsearch.ingest.Processor;
@@ -196,7 +196,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
196196
* @param property property to add
197197
* @param value value to add
198198
*/
199-
private <T> void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
199+
private void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
200200
if (properties.contains(property) && Strings.hasLength(value)) {
201201
additionalFields.put(property.toLowerCase(), value);
202202
}
@@ -233,16 +233,16 @@ public AttachmentProcessor create(
233233
String processorTag,
234234
String description,
235235
Map<String, Object> config
236-
) throws Exception {
236+
) {
237237
String field = readStringProperty(TYPE, processorTag, config, "field");
238238
String resourceName = readOptionalStringProperty(TYPE, processorTag, config, "resource_name");
239239
String targetField = readStringProperty(TYPE, processorTag, config, "target_field", "attachment");
240240
List<String> propertyNames = readOptionalList(TYPE, processorTag, config, "properties");
241241
int indexedChars = readIntProperty(TYPE, processorTag, config, "indexed_chars", NUMBER_OF_CHARS_INDEXED);
242242
boolean ignoreMissing = readBooleanProperty(TYPE, processorTag, config, "ignore_missing", false);
243243
String indexedCharsField = readOptionalStringProperty(TYPE, processorTag, config, "indexed_chars_field");
244-
@UpdateForV9(owner = UpdateForV9.Owner.DATA_MANAGEMENT)
245-
// update the [remove_binary] default to be 'true' assuming enough time has passed. Deprecated in September 2022.
244+
@UpdateForV10(owner = UpdateForV10.Owner.DATA_MANAGEMENT)
245+
// Revisit whether we want to update the [remove_binary] default to be 'true' - would need to find a way to do this safely
246246
Boolean removeBinary = readOptionalBooleanProperty(TYPE, processorTag, config, "remove_binary");
247247
if (removeBinary == null) {
248248
DEPRECATION_LOGGER.warn(

0 commit comments

Comments
 (0)