Skip to content

Crawler: URLs to normalized pages #6

@functor-flow

Description

@functor-flow

Crawler: URLs to normalized pages

full spec at spec

flowchart TD
  Task["SearchTask"]
  HistBefore["SearchHistory (before)"]
  Url["URL + source (serp/nav)"]
  Budget["ScraperAPI budget"]

  subgraph Crawler["Crawler module"]
    BuildReq["buildScraperApiUrl"]
    Fetch["fetchWithEscalation"]
    BlockCheck{"blocked or empty?"}
    Clean["stripHtml / chunkText / findDataDownloadLinks"]
    PageObj["build CrawledPage"]
    Visit["update PageVisit status + summary"]
  end

  HistAfter["SearchHistory (after)"]
  PageOut["CrawledPage (optional)"]

  Task --> BuildReq
  Url --> BuildReq
  Budget --> Fetch

  BuildReq --> Fetch
  Fetch --> BlockCheck
  BlockCheck -- "ok" --> Clean
  BlockCheck -- "blocked/empty" --> Visit

  Clean --> PageObj
  PageObj --> Visit

  Visit --> HistAfter
  PageObj --> PageOut
Loading

Status: Building blocks implemented; not yet wired as a single "crawl URL" step in the loop.

Role. For each URL chosen by Serper or Nav, fetch the page within the ScraperAPI budget and turn it into normalized page content that Embeddings can work with.

Responsibilities.

  • Take URLs from Serper (SERP picks) and Nav (on-site navigation) and try to fetch them.
  • Use the existing ScraperAPI + HTML utilities to fetch and clean pages, handling blocks and budget limits.
  • Produce CrawledPage objects (metadata + text chunks) for the Embeddings step.
  • Update SearchHistory.pages so we know which URLs were tried, what happened (fetched / blocked / no_content), and where they came from (serp vs nav).

TODO

  • Define a minimal CrawledPage type that carries the fields Embeddings and Decision need (URL, domain, text chunks, optional main date/summary) and lines up with SearchHistory.PageVisit.
  • Implement a single crawler function, e.g. crawlUrl(task, history, url, source, budget), that reuses fetchWithEscalation, stripHtml, chunkText, and findDataDownloadLinks and returns { page?: CrawledPage; history: SearchHistory }.
  • Wire the crawler into the Serper step in the loop: for each CandidateUrl returned by Serper, call crawlUrl(...) and pass any CrawledPage into the Embeddings helpers before deciding what to do next.
  • In that function, always update SearchHistory for the URL: set the source (serp or nav), mark status (fetched, blocked, no_content, skipped), and attach a short summary when we have usable content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions