Skip to content

Conversation

@andrewstucki
Copy link

This PR adds support for a processor that classifies mime types based off of magic bytes. It's essentially a port of the beats processor. Additionally, in order to better support binary-type fields at ingest time, it has a flag for indicating that the data is base64 encoded.

Here's an example of how it works:

~ file /bin/ls
/bin/ls: Mach-O 64-bit executable x86_64
➜  ~ curl --silent -H "Content-Type: application/json" -X POST -u elastic:password http://localhost:9200/_ingest/pipeline/_simulate\?verbose --data-binary @- << EOF | jq '.docs[0].processor_results[0].doc._source.mime'
{
  "pipeline": {
    "processors": [
      {
        "detect_mime_type": {
          "field": "data",
          "target_field": "mime",
          "base64_encoded": true
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "data": "$(base64 /bin/ls)"
      }
    }
  ]
}
EOF
"application/x-mach-binary"

In order to handle a wide variety of mime types, I leverage Apache Tika (which I noticed was already used in the ingest-attachment plugin). Additionally, where Tika can't identify things like JSON or XML without the declaration, both of which are pretty common in APIs, I added some fallback parsing of fields identified as plain/text. In order to avoid allocations with what is essentially just validating JSON/XML syntax, I use some streaming parsers and just check for exceptions while iterating over the documents. This is the same behavior as the beats edge processor.

@andrewstucki andrewstucki added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 Team:Data Management Meta label for data/management team labels Jan 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann
Copy link
Contributor

@andrewstucki, a question and a couple comments on this one. Is this processor intended for a particular solutions use case? Apart from that, it's relatively uncommon for binary blobs to be attached to documents. Also, the Tika library is one with a fairly large API surface so we wouldn't want to add it as a dependency in the x-pack/ingest plugin which is included in the default Elasticsearch distribution (as opposed to ingest-attachment which is not included by default). There are ways we could address that, but we'd need to discuss them.

@andrewstucki
Copy link
Author

So, we currently use beats (edge) processing to fill in mime data for http request/response payloads and we are generally trying to move a bit more towards processing on the ingest node side of things.

While we don't use mime detection in a ton of places currently (I believe heartbeat and some filebeat modules are the main users), I also recently opened up an ECS PR for adding additional binary fields for file content to support the use case of, for example, sending file samples that triggered security alerts. This would be a natural way of supporting the file field set mime classification for that.

Currently however, since the main mime classification for any of our applications happens in beats, the adoption of this processor isn't currently blocking anything. I'm more than happy to take this one slow and even wait until we'd have more modules that can directly leverage it--mainly opened up this PR now to kick off a bit of the discussion.

@mark-vieira
Copy link
Contributor

@elasticmachine update branch

@danhermann
Copy link
Contributor

@andrewstucki, can you move this processor to the ingest-attachment module that already has a dependency on the Tika library so that we can keep that isolated from the other ingest processors?

@danhermann
Copy link
Contributor

@andrewstucki, this processor will need to be moved as mentioned in the comment here: #67961 (comment)

@arteam arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@danhermann danhermann removed their request for review March 15, 2022 16:29
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:12
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@dakrone
Copy link
Member

dakrone commented May 9, 2025

@andrewstucki is this still applicable, or should it be closed due to being older?

@andrewkroh andrewkroh closed this May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.