RecursiveUrlLoader document metadata contains the content type of the response. #20597
Closed
coolbeevip
announced in
Ideas
Replies: 2 comments
-
Only small changes are needed point 1def _metadata_extractor(raw_html: str, url: str, content_type: str) -> dict:
"""Extract metadata from raw html using BeautifulSoup."""
metadata = {"source": url, "content_type": content_type} point 2metadata=self.metadata_extractor(response.text, url, response.headers["Content-Type"]), point 3metadata=self.metadata_extractor(text, url, response.headers["Content-Type"]), |
Beta Was this translation helpful? Give feedback.
0 replies
-
related #20875 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
Many web resources exist, such as images, css, js, fonts, etc. All of these are included in the response header file. If the document's metadata contains response header data, it will be more beneficial for developers to choose which documents to use.
Motivation
I want to ignore documents with content-type = text/css...
Proposal (If applicable)
Include response headers or content-type in document metadata.
metadata = {
"source": URL,
"content-type": "text/html; charset=utf-8"
}
Beta Was this translation helpful? Give feedback.
All reactions