FileTypeRouter Improvement #8439

jlonge4 · 2024-10-05T20:04:24Z

jlonge4
Oct 5, 2024

I have noticed that in a linux env, for instance an aws lambda (python3.12), the FileTypeRouter will output docx and pptx (or other microsoft based flavors of files) as unclassified unless you first run:

import mimetypes
# Add .docx MIME type
mimetypes.add_type("application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx")

How could we implement checking/adding mime types specified at init time are added at init time, reducing unclassified outputs on legitimate mime types.

Any ideas @anakin87 ?

anakin87 · 2024-10-07T05:42:47Z

anakin87
Oct 7, 2024
Maintainer

👋

On Ubuntu 22.04, python 3.8, this works:

from haystack.components.routers import FileTypeRouter

file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/markdown", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"])

path = "MYFILE.docx"

print(file_type_router.run([path]))

>>> {'application/vnd.openxmlformats-officedocument.wordprocessingml.document': [PosixPath('MYFILE.docx')]}

Can you try it in your system? Am I missing something?

3 replies

jlonge4 Oct 7, 2024
Author

@anakin87 Ubuntu most likely has it due to the libre office packages I would imagine, however inside a python lambda function:

import json
import mimetypes

def lambda_handler(event, context):
    guess = mimetypes.guess_type('application/vnd.openxmlformats-officedocument.wordprocessingml.document')
    return {
        'statusCode': 200,
        'body': json.dumps(str(guess))
    }

>>> {"statusCode": 200, "body": "\"(None, None)\""}

anakin87 Oct 7, 2024
Maintainer

Ah...
@vblagoje you worked on this if I remember correctly.
Do you have any ideas/suggestions?

jlonge4 Oct 7, 2024
Author

@anakin87 @vblagoje

Just for good measure, I added a layer and tried your exact code:

{
  "statusCode": 200,
  "body": "\"{'unclassified': [PosixPath('MYFILE.docx')]}\""
}

vblagoje · 2024-10-08T11:41:58Z

vblagoje
Oct 8, 2024
Maintainer

Hey @jlonge4 , @anakin87 and I spoke about this and adding some additional_mimetypes: Dict[str, str] parameter would make the most sense as there are no guarantees how defaults mappings are resolved across various runtime environments and Python versions I suppose.

Would you mind opening a PR for this @jlonge4 ? We'll review and integrate it soon after

1 reply

jlonge4 Oct 8, 2024
Author

@vblagoje @anakin87 Absolutely, will have it ready shortly!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FileTypeRouter Improvement #8439

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FileTypeRouter Improvement #8439

Uh oh!

Uh oh!

jlonge4 Oct 5, 2024

Replies: 2 comments · 4 replies

Uh oh!

anakin87 Oct 7, 2024 Maintainer

Uh oh!

Uh oh!

jlonge4 Oct 7, 2024 Author

Uh oh!

anakin87 Oct 7, 2024 Maintainer

Uh oh!

jlonge4 Oct 7, 2024 Author

Uh oh!

vblagoje Oct 8, 2024 Maintainer

Uh oh!

jlonge4 Oct 8, 2024 Author

jlonge4
Oct 5, 2024

Replies: 2 comments 4 replies

anakin87
Oct 7, 2024
Maintainer

jlonge4 Oct 7, 2024
Author

anakin87 Oct 7, 2024
Maintainer

jlonge4 Oct 7, 2024
Author

vblagoje
Oct 8, 2024
Maintainer

jlonge4 Oct 8, 2024
Author