AsyncHtmlLoader - Metadata to have title of page #8785

srinisubramanian · 2023-08-05T04:59:16Z

srinisubramanian
Aug 5, 2023

AFAIK the metadata is only having the URL of the page. Is there any way the title of the page can be retrieved? When displayiung source references just the URL is insufficient, a title would be great.

srinisubramanian · 2023-08-05T13:57:05Z

dosubot[bot]
bot Aug 5, 2023

Answer generated by a 🤖

Answer

Hello,

You're correct that the current implementation of AsyncHtmlLoader in LangChain only provides the URL in the metadata. However, you can modify the load method to include the title in the metadata by parsing the HTML content. LangChain does have a built-in method for parsing HTML content using the BS4HTMLParser, which is part of the BeautifulSoup4 library.

Here's a simple example of how you could modify the load method:

from bs4 import BeautifulSoup

def load(self) -> List[Document]:
    """Load text from the url(s) in web_path."""

    results = asyncio.run(self.fetch_all(self.web_paths))
    docs = []
    for i, text in enumerate(results):
        soup = BeautifulSoup(text, 'html.parser')
        title = soup.title.string if soup.title else None
        metadata = {"source": self.web_paths[i], "title": title}
        docs.append(Document(page_content=text, metadata=metadata))

    return docs

This code uses BeautifulSoup to parse the HTML content and extract the title, which is then stored in the metadata of the Document object along with the URL. Please note that this is a simple example and may not work for all web pages, as the title might not always be present or correctly formatted.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

1 reply

srinisubramanian Aug 5, 2023
Author

Will try this for sure. But this might be something good to include in standard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AsyncHtmlLoader - Metadata to have title of page #8785

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AsyncHtmlLoader - Metadata to have title of page #8785

Uh oh!

srinisubramanian Aug 5, 2023

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Aug 5, 2023

Answer

Sources

Uh oh!

srinisubramanian Aug 5, 2023 Author

srinisubramanian
Aug 5, 2023

Replies: 1 comment 1 reply

dosubot[bot]
bot Aug 5, 2023

srinisubramanian Aug 5, 2023
Author