AzureAIDocumentIntelligenceLoader not able to identify and extract Markdown hyperlinks #24980

DiazBejaranoD · 2024-08-02T15:30:37Z

DiazBejaranoD
Aug 2, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

loader_azure = AzureAIDocumentIntelligenceLoader(
	api_endpoint=AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT,
	api_key=AZURE_DOCUMENT_INTELLIGENCE_API_KEY,
	file_path=file_path,
	api_model="prebuilt-layout",
	mode=azure_mode
)

Description

When file_path contains hyperlinks, markdown produced by AzureAIDocumentIntelligenceLoader does not contain the link.
If I do this same thing with pymupdf4llm, it is able to extract the markdown hyperlink sort of correctly.

Test file: test_doc_pdf.pdf

Parsing code (standard):

azure_mode="markdown"

loader_azure = AzureAIDocumentIntelligenceLoader(
	api_endpoint=AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT,
	api_key=AZURE_DOCUMENT_INTELLIGENCE_API_KEY,
	file_path=file_path,
	api_model="prebuilt-layout",
	mode=azure_mode
)

documents_azure = loader_azure.load()

Result:

1\. This is a header
===

This is a sample file pointing to GitHub website

This is a paragraph

If I use pymupdf4llm, I get:

# 1. This is a header

[This is a sample file pointing to GitHub website](https://github.com/)

This is a paragraph


-----

Is there any configuration or add-on that I am not aware of for this extraction to happen? Otherwise, can this be considered a bug?

System Info

langchain==0.2.6
langchain-community==0.2.6
langchain-core==0.2.10
langchain-google-community==1.0.6
langchain-openai==0.1.13
langchain-text-splitters==0.2.2
langchain-weaviate==0.0.2

@DiazBejaranoD · 2024-08-02T15:36:31Z

dosubot[bot]
bot Aug 2, 2024

Hey there, @DiazBejaranoD! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those bugs together!

The AzureAIDocumentIntelligenceLoader currently does not have specific parameters or configurations for handling hyperlinks in PDF documents. This might explain why the hyperlinks are not being extracted in the Markdown output.

For extracting hyperlinks, you might consider using alternative methods within the LangChain framework. Here are two recommended approaches:

PDFMiner:

from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader
from bs4 import BeautifulSoup

file_path = "path/to/your/pdf.pdf"
loader = PDFMinerPDFasHTMLLoader(file_path)
data = loader.load()[0]

soup = BeautifulSoup(data.page_content, "html.parser")
links = soup.find_all("a", href=True)
hyperlinks = [link['href'] for link in links]
print(hyperlinks)

Unstructured:

from langchain_community.document_loaders import UnstructuredPDFLoader

file_path = "path/to/your/pdf.pdf"
loader = UnstructuredPDFLoader(file_path, mode="elements")
data = loader.load()

hyperlinks = []
for element in data:
    if 'link' in element.metadata:
        hyperlinks.append(element.metadata['link'])
print(hyperlinks)

These methods provide alternative ways to extract hyperlinks from PDF documents within the LangChain framework ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AzureAIDocumentIntelligenceLoader not able to identify and extract Markdown hyperlinks #24980

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AzureAIDocumentIntelligenceLoader not able to identify and extract Markdown hyperlinks #24980

Uh oh!

DiazBejaranoD Aug 2, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment

Uh oh!

dosubot[bot] bot Aug 2, 2024

DiazBejaranoD
Aug 2, 2024

dosubot[bot]
bot Aug 2, 2024