Skip to content

Add support for AI/LLM-based HTML parsing (selectors) #1593

@vdusek

Description

@vdusek

At recent events I attended, I was asked about AI/LLM-based HTML parsing. I also found a few dedicated AI-based scraping frameworks, such as ScrapeGraphAI and Parsera, that appear to be gaining traction.

Right now, we provide an AI-selector workflow only through the PlaywrightCrawler via Stagehand guide.

This means:

  • AI-based selectors are supported only for Playwright, not for HTTP-based crawlers.
  • Even for PlaywrightCrawler, the integration is not very smooth compared to the tools mentioned above.

Example from the ScrapeGraphAI:

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
    source="https://scrapegraphai.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()

It might be worth exploring a more native solution:

  • Better Stagehand integration so that AI-based selectors in Playwright crawlers are as straightforward as in the dedicated AI-scraping libraries.
  • Introduce an AI/LLM-powered crawler built on top of AbstractHttpCrawler, enabling AI/LLM selectors for HTTP-based scraping as well.

This could make Crawlee more usable for AI/LLM-based extractions, and/or for faster prototype scrapers without manual CSS/XPath selectors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions