RecursiveUrlLoader document metadata contains the content type of the response. #20597

coolbeevip · 2024-04-18T10:57:19Z

coolbeevip
Apr 18, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Many web resources exist, such as images, css, js, fonts, etc. All of these are included in the response header file. If the document's metadata contains response header data, it will be more beneficial for developers to choose which documents to use.

Motivation

I want to ignore documents with content-type = text/css...

Proposal (If applicable)

Include response headers or content-type in document metadata.

metadata = {
"source": URL,
"content-type": "text/html; charset=utf-8"
}

coolbeevip · 2024-04-20T16:24:07Z

coolbeevip
Apr 20, 2024
Author

Only small changes are needed

point 1

langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py

Line 29 in c909ae0

def _metadata_extractor(raw_html: str, url: str) -> dict:

def _metadata_extractor(raw_html: str, url: str, content_type: str) -> dict:
    """Extract metadata from raw html using BeautifulSoup."""
    metadata = {"source": url, "content_type": content_type}

point 2

langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py

Line 187 in c909ae0

metadata=self.metadata_extractor(response.text, url),

metadata=self.metadata_extractor(response.text, url, response.headers["Content-Type"]),

point 3

langchain/libs/community/langchain_community/document_loaders/recursive_url_loader.py

Line 273 in c909ae0

metadata=self.metadata_extractor(text, url),

metadata=self.metadata_extractor(text, url, response.headers["Content-Type"]),

0 replies

coolbeevip · 2024-04-26T00:55:57Z

coolbeevip
Apr 26, 2024
Author

related #20875

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RecursiveUrlLoader document metadata contains the content type of the response. #20597

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RecursiveUrlLoader document metadata contains the content type of the response. #20597

Uh oh!

Uh oh!

coolbeevip Apr 18, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 2 comments

Uh oh!

coolbeevip Apr 20, 2024 Author

point 1

point 2

point 3

Uh oh!

coolbeevip Apr 26, 2024 Author

coolbeevip
Apr 18, 2024

coolbeevip
Apr 20, 2024
Author

coolbeevip
Apr 26, 2024
Author