AsyncHtmlLoader - Metadata to have title of page #8785
Replies: 1 comment 1 reply
-
Answer generated by a 🤖 AnswerHello, You're correct that the current implementation of AsyncHtmlLoader in LangChain only provides the URL in the metadata. However, you can modify the Here's a simple example of how you could modify the from bs4 import BeautifulSoup
def load(self) -> List[Document]:
"""Load text from the url(s) in web_path."""
results = asyncio.run(self.fetch_all(self.web_paths))
docs = []
for i, text in enumerate(results):
soup = BeautifulSoup(text, 'html.parser')
title = soup.title.string if soup.title else None
metadata = {"source": self.web_paths[i], "title": title}
docs.append(Document(page_content=text, metadata=metadata))
return docs This code uses BeautifulSoup to parse the HTML content and extract the title, which is then stored in the metadata of the Document object along with the URL. Please note that this is a simple example and may not work for all web pages, as the title might not always be present or correctly formatted. I hope this helps! If you have any other questions or need further clarification, feel free to ask. Sources
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
AFAIK the metadata is only having the URL of the page. Is there any way the title of the page can be retrieved? When displayiung source references just the URL is insufficient, a title would be great.
Beta Was this translation helpful? Give feedback.
All reactions