Skip to content

Commit 3a8b208

Browse files
Update UpdateForV9 in AttachmentProcessor (elastic#118186) (elastic#118281)
Improve the docs around `remove_binary` in `attachment` Since we are living with this for a while, it seems worth improving the documentation. This now encourages explicitly setting the option one way or the other, since you get a warning if you omit it. It also changes the existing examples to use true rather than false, as that's our recommendation. And it adds a new section with an example where it's true, and moves the content previously in a note into that section. (cherry picked from commit bc25a73) # Conflicts: # modules/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java
1 parent 88a724a commit 3a8b208

File tree

2 files changed

+70
-25
lines changed

2 files changed

+70
-25
lines changed

docs/reference/ingest/processors/attachment.asciidoc

Lines changed: 68 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ representation. The processor will skip the base64 decoding then.
1919
.Attachment options
2020
[options="header"]
2121
|======
22-
| Name | Required | Default | Description
23-
| `field` | yes | - | The field to get the base64 encoded field from
24-
| `target_field` | no | attachment | The field that will hold the attachment information
25-
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
26-
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
27-
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
28-
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
29-
| `remove_binary` | no | `false` | If `true`, the binary `field` will be removed from the document
30-
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
22+
| Name | Required | Default | Description
23+
| `field` | yes | - | The field to get the base64 encoded field from
24+
| `target_field` | no | attachment | The field that will hold the attachment information
25+
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
26+
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
27+
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
28+
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
29+
| `remove_binary` | encouraged | `false` | If `true`, the binary `field` will be removed from the document. This option is not required, but setting it explicitly is encouraged, and omitting it will result in a warning.
30+
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
3131
|======
3232

3333
[discrete]
@@ -58,7 +58,7 @@ PUT _ingest/pipeline/attachment
5858
{
5959
"attachment" : {
6060
"field" : "data",
61-
"remove_binary": false
61+
"remove_binary": true
6262
}
6363
}
6464
]
@@ -82,7 +82,6 @@ The document's `attachment` object contains extracted properties for the file:
8282
"_seq_no": 22,
8383
"_primary_term": 1,
8484
"_source": {
85-
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
8685
"attachment": {
8786
"content_type": "application/rtf",
8887
"language": "ro",
@@ -94,9 +93,6 @@ The document's `attachment` object contains extracted properties for the file:
9493
----
9594
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
9695

97-
NOTE: Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended
98-
to remove that field from the document. Set `remove_binary` to `true` to automatically remove the field.
99-
10096
[[attachment-fields]]
10197
==== Exported fields
10298

@@ -143,7 +139,7 @@ PUT _ingest/pipeline/attachment
143139
"attachment" : {
144140
"field" : "data",
145141
"properties": [ "content", "title" ],
146-
"remove_binary": false
142+
"remove_binary": true
147143
}
148144
}
149145
]
@@ -154,6 +150,59 @@ NOTE: Extracting contents from binary data is a resource intensive operation and
154150
consumes a lot of resources. It is highly recommended to run pipelines
155151
using this processor in a dedicated ingest node.
156152

153+
[[attachment-keep-binary]]
154+
==== Keeping the attachment binary
155+
156+
Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended to remove
157+
that field from the document, by setting `remove_binary` to `true` to automatically remove the field, as in the other
158+
examples shown on this page. If you _do_ want to keep the binary field, explicitly set `remove_binary` to `false` to
159+
avoid the warning you get from omitting it:
160+
161+
[source,console]
162+
----
163+
PUT _ingest/pipeline/attachment
164+
{
165+
"description" : "Extract attachment information including original binary",
166+
"processors" : [
167+
{
168+
"attachment" : {
169+
"field" : "data",
170+
"remove_binary": false
171+
}
172+
}
173+
]
174+
}
175+
PUT my-index-000001/_doc/my_id?pipeline=attachment
176+
{
177+
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
178+
}
179+
GET my-index-000001/_doc/my_id
180+
----
181+
182+
The document's `_source` object includes the original binary field:
183+
184+
[source,console-result]
185+
----
186+
{
187+
"found": true,
188+
"_index": "my-index-000001",
189+
"_id": "my_id",
190+
"_version": 1,
191+
"_seq_no": 22,
192+
"_primary_term": 1,
193+
"_source": {
194+
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
195+
"attachment": {
196+
"content_type": "application/rtf",
197+
"language": "ro",
198+
"content": "Lorem ipsum dolor sit amet",
199+
"content_length": 28
200+
}
201+
}
202+
}
203+
----
204+
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
205+
157206
[[attachment-cbor]]
158207
==== Use the attachment processor with CBOR
159208

@@ -170,7 +219,7 @@ PUT _ingest/pipeline/cbor-attachment
170219
{
171220
"attachment" : {
172221
"field" : "data",
173-
"remove_binary": false
222+
"remove_binary": true
174223
}
175224
}
176225
]
@@ -226,7 +275,7 @@ PUT _ingest/pipeline/attachment
226275
"field" : "data",
227276
"indexed_chars" : 11,
228277
"indexed_chars_field" : "max_size",
229-
"remove_binary": false
278+
"remove_binary": true
230279
}
231280
}
232281
]
@@ -250,7 +299,6 @@ Returns this:
250299
"_seq_no": 35,
251300
"_primary_term": 1,
252301
"_source": {
253-
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
254302
"attachment": {
255303
"content_type": "application/rtf",
256304
"language": "is",
@@ -274,7 +322,7 @@ PUT _ingest/pipeline/attachment
274322
"field" : "data",
275323
"indexed_chars" : 11,
276324
"indexed_chars_field" : "max_size",
277-
"remove_binary": false
325+
"remove_binary": true
278326
}
279327
}
280328
]
@@ -299,7 +347,6 @@ Returns this:
299347
"_seq_no": 40,
300348
"_primary_term": 1,
301349
"_source": {
302-
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
303350
"max_size": 5,
304351
"attachment": {
305352
"content_type": "application/rtf",
@@ -358,7 +405,7 @@ PUT _ingest/pipeline/attachment
358405
"attachment": {
359406
"target_field": "_ingest._value.attachment",
360407
"field": "_ingest._value.data",
361-
"remove_binary": false
408+
"remove_binary": true
362409
}
363410
}
364411
}
@@ -396,7 +443,6 @@ Returns this:
396443
"attachments" : [
397444
{
398445
"filename" : "ipsum.txt",
399-
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
400446
"attachment" : {
401447
"content_type" : "text/plain; charset=ISO-8859-1",
402448
"language" : "en",
@@ -406,7 +452,6 @@ Returns this:
406452
},
407453
{
408454
"filename" : "test.txt",
409-
"data" : "VGhpcyBpcyBhIHRlc3QK",
410455
"attachment" : {
411456
"content_type" : "text/plain; charset=ISO-8859-1",
412457
"language" : "en",

modules/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
196196
* @param property property to add
197197
* @param value value to add
198198
*/
199-
private <T> void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
199+
private void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
200200
if (properties.contains(property) && Strings.hasLength(value)) {
201201
additionalFields.put(property.toLowerCase(), value);
202202
}
@@ -233,7 +233,7 @@ public AttachmentProcessor create(
233233
String processorTag,
234234
String description,
235235
Map<String, Object> config
236-
) throws Exception {
236+
) {
237237
String field = readStringProperty(TYPE, processorTag, config, "field");
238238
String resourceName = readOptionalStringProperty(TYPE, processorTag, config, "resource_name");
239239
String targetField = readStringProperty(TYPE, processorTag, config, "target_field", "attachment");

0 commit comments

Comments
 (0)