Crawler: URLs to normalized pages

# Crawler: URLs to normalized pages

> full spec at [spec](https://gist.github.com/functor-flow/e5ad7923a0853399800c46f99d35ecfe)

```mermaid
flowchart TD
  Task["SearchTask"]
  HistBefore["SearchHistory (before)"]
  Url["URL + source (serp/nav)"]
  Budget["ScraperAPI budget"]

  subgraph Crawler["Crawler module"]
    BuildReq["buildScraperApiUrl"]
    Fetch["fetchWithEscalation"]
    BlockCheck{"blocked or empty?"}
    Clean["stripHtml / chunkText / findDataDownloadLinks"]
    PageObj["build CrawledPage"]
    Visit["update PageVisit status + summary"]
  end

  HistAfter["SearchHistory (after)"]
  PageOut["CrawledPage (optional)"]

  Task --> BuildReq
  Url --> BuildReq
  Budget --> Fetch

  BuildReq --> Fetch
  Fetch --> BlockCheck
  BlockCheck -- "ok" --> Clean
  BlockCheck -- "blocked/empty" --> Visit

  Clean --> PageObj
  PageObj --> Visit

  Visit --> HistAfter
  PageObj --> PageOut
```

**Status:** Building blocks implemented; not yet wired as a single "crawl URL" step in the loop.

**Role.** For each URL chosen by Serper or Nav, fetch the page within the ScraperAPI budget and turn it into normalized page content that Embeddings can work with.

**Responsibilities.**
- Take URLs from Serper (SERP picks) and Nav (on-site navigation) and try to fetch them.
- Use the existing ScraperAPI + HTML utilities to fetch and clean pages, handling blocks and budget limits.
- Produce `CrawledPage` objects (metadata + text chunks) for the Embeddings step.
- Update `SearchHistory.pages` so we know which URLs were tried, what happened (fetched / blocked / no_content), and where they came from (serp vs nav).

---

## TODO

- [ ] Define a minimal `CrawledPage` type that carries the fields Embeddings and Decision need (URL, domain, text chunks, optional main date/summary) and lines up with `SearchHistory.PageVisit`.
- [ ] Implement a single crawler function, e.g. `crawlUrl(task, history, url, source, budget)`, that reuses `fetchWithEscalation`, `stripHtml`, `chunkText`, and `findDataDownloadLinks` and returns `{ page?: CrawledPage; history: SearchHistory }`.
- [ ] Wire the crawler into the Serper step in the loop: for each `CandidateUrl` returned by Serper, call `crawlUrl(...)` and pass any `CrawledPage` into the Embeddings helpers before deciding what to do next.
- [ ] In that function, always update `SearchHistory` for the URL: set the source (`serp` or `nav`), mark status (`fetched`, `blocked`, `no_content`, `skipped`), and attach a short summary when we have usable content.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crawler: URLs to normalized pages #6

Crawler: URLs to normalized pages

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crawler: URLs to normalized pages #6

Description

Crawler: URLs to normalized pages

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions