Integration of the advertools SEO crawler with LangChain #31384
eliasdabbas
announced in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
I'd like to add the integration of advertools crawler into LangChain, to enable the community to do more large-scale crawling, content extraction, with a highly configurable crawler, and a rich representation of crawled websites.
I have already created and published
langchain-advertools
and implemented theDocument
class withload
andlazy_load
.https://www.youtube.com/watch?v=SpsHZBuLypI
https://github.com/eliasdabbas/langchain-advertools/
Motivation
Recursive crawling and scraping websites are separate tasks in this integration, specialized tools are well-positioned for detailed, and powerful content extraction, and this enables developers to separate (intellectually, as well as in their code) these processes from whatever workflows they want to integrate.
Killer feature 1: Special extractor for getting all the textual content on the web page, yet excluding header, footer, nav, script, and many other irrelevant elements from the page. For example, the text of the home page of LanghChain.com has been extracted here:
This is the default with no customization required. Of course custom extraction is supported with XPath/CSS selectors.
It's otherwise very tedious to figure all the custom extractors on the page (and doing it for all the website page templates)
metadata
attribute.The community response seems to be positive:
https://www.linkedin.com/posts/eliasdabbas_advertools-activity-7328845355310043136-dzDz
The integration should be easy and straightforward I think.
Proposal (If applicable)
If you think this makes sense, I would love some guidance on how to make the integration, modify approach/code, add documentation, or anything else that might help.
If not, I'd love to know what it would take to do so.
Thanks a lot! :)
Beta Was this translation helpful? Give feedback.
All reactions