Skip to content

Add support for MarkItDown#89

Closed
HackyRoot wants to merge 49 commits intomozilla-ai:mainfrom
HackyRoot:markitdown_support
Closed

Add support for MarkItDown#89
HackyRoot wants to merge 49 commits intomozilla-ai:mainfrom
HackyRoot:markitdown_support

Conversation

@HackyRoot
Copy link

What's changing

  • Adds support for MarkItDown package to load and extract text from .pdf and .html files.
  • Adds 'load_filetodata_loaders and 'markdown_to_text to data_cleaners.
  • Adds temporary file to be able to pass the filepath to load_file which uses MarkItDown.
  • Updated tests to refelct the new data_loaders and data_cleaners functions.

If this PR is related to an issue or closes one, please link it here.

Closes #66

How to test it

Steps to test the changes:

  1. Run the Streamlit app:
  python -m streamlit run demo/app.py
  1. Upload a PDF or docx file on the StreamLit webpage.
  2. Check the generated and cleaned text.

Additional notes for reviewers

The changes includes support for the MarkItDown package. The Streamlit app has been updated to use the new load_file and markdown_to_text functions.

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and under /docs)

@daavoo daavoo requested a review from a team January 14, 2025 09:58
Copy link
Contributor

@Kostis-S-Z Kostis-S-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @HackyRoot , highly appreciate your effort into this feature, it has great potential to add support for many file formats! I left some comments here and there, take a look and let me know what you think.

Oh also note that we are using Ruff for formatting the files, thats why the Lint action is failing. To resolve this, just do pip install pre-commit and then pre-commit install before commiting any changes

Edit: @daavoo is running some magic stuff to fix it

@daavoo
Copy link
Contributor

daavoo commented Jan 14, 2025

pre-commit.ci run

@Kostis-S-Z
Copy link
Contributor

@daavoo seems the pre-commit CI is not working properly?

@daavoo
Copy link
Contributor

daavoo commented Jan 17, 2025

@daavoo seems the pre-commit CI is not working properly?

It is because the remaining errors can't be auto-fixed. If load_pdf and load_docx are not used in DATA_LOADERS, they should not be imported

@daavoo daavoo force-pushed the markitdown_support branch from 814e263 to 5087aab Compare January 21, 2025 10:30
@HackyRoot
Copy link
Author

@daavoo any idea why do I have to merge the remote branch before pushing?

@daavoo
Copy link
Contributor

daavoo commented Jan 21, 2025

@daavoo any idea why do I have to merge the remote branch before pushing?

The branch in your fork is missing the latest changes from upstream/main. Before your last merge right now, I updated this PR to be up to date with main.

You can undo the last merge you pushed and then just git pull the remote changes

@stefanfrench
Copy link
Contributor

@HackyRoot - As discussed, going to close this PR and instead we will feature your fork as an extension on the Blueprints Hub. As you suggested, we're going to start tagging certain Issues as 'extensions' and will get feedback from users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data Cleaning with 'markitdown'

4 participants