-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add Ingest Processor for Mime-Type Detection #67961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-core-features (Team:Core/Features) |
|
@andrewstucki, a question and a couple comments on this one. Is this processor intended for a particular solutions use case? Apart from that, it's relatively uncommon for binary blobs to be attached to documents. Also, the Tika library is one with a fairly large API surface so we wouldn't want to add it as a dependency in the |
|
So, we currently use beats (edge) processing to fill in mime data for http request/response payloads and we are generally trying to move a bit more towards processing on the ingest node side of things. While we don't use mime detection in a ton of places currently (I believe Currently however, since the main mime classification for any of our applications happens in beats, the adoption of this processor isn't currently blocking anything. I'm more than happy to take this one slow and even wait until we'd have more modules that can directly leverage it--mainly opened up this PR now to kick off a bit of the discussion. |
|
@elasticmachine update branch |
|
@andrewstucki, can you move this processor to the ingest-attachment module that already has a dependency on the Tika library so that we can keep that isolated from the other ingest processors? |
7cc98af to
fb6d6e3
Compare
|
@andrewstucki, this processor will need to be moved as mentioned in the comment here: #67961 (comment) |
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
@andrewstucki is this still applicable, or should it be closed due to being older? |
This PR adds support for a processor that classifies mime types based off of magic bytes. It's essentially a port of the beats processor. Additionally, in order to better support
binary-type fields at ingest time, it has a flag for indicating that the data is base64 encoded.Here's an example of how it works:
In order to handle a wide variety of mime types, I leverage Apache Tika (which I noticed was already used in the
ingest-attachmentplugin). Additionally, where Tika can't identify things like JSON or XML without the declaration, both of which are pretty common in APIs, I added some fallback parsing of fields identified asplain/text. In order to avoid allocations with what is essentially just validating JSON/XML syntax, I use some streaming parsers and just check for exceptions while iterating over the documents. This is the same behavior as the beats edge processor.