AzureAIDocumentIntelligenceLoader not able to identify and extract Markdown hyperlinks #24980
Unanswered
DiazBejaranoD
asked this question in
Q&A
Replies: 1 comment
-
Hey there, @DiazBejaranoD! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those bugs together! The For extracting hyperlinks, you might consider using alternative methods within the LangChain framework. Here are two recommended approaches:
These methods provide alternative ways to extract hyperlinks from PDF documents within the LangChain framework [1][2]. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
When file_path contains hyperlinks, markdown produced by AzureAIDocumentIntelligenceLoader does not contain the link.
If I do this same thing with pymupdf4llm, it is able to extract the markdown hyperlink sort of correctly.
Test file: test_doc_pdf.pdf
Parsing code (standard):
Result:
If I use pymupdf4llm, I get:
Is there any configuration or add-on that I am not aware of for this extraction to happen? Otherwise, can this be considered a bug?
System Info
langchain==0.2.6
langchain-community==0.2.6
langchain-core==0.2.10
langchain-google-community==1.0.6
langchain-openai==0.1.13
langchain-text-splitters==0.2.2
langchain-weaviate==0.0.2
Beta Was this translation helpful? Give feedback.
All reactions