-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Crawler: URLs to normalized pages
full spec at spec
flowchart TD
Task["SearchTask"]
HistBefore["SearchHistory (before)"]
Url["URL + source (serp/nav)"]
Budget["ScraperAPI budget"]
subgraph Crawler["Crawler module"]
BuildReq["buildScraperApiUrl"]
Fetch["fetchWithEscalation"]
BlockCheck{"blocked or empty?"}
Clean["stripHtml / chunkText / findDataDownloadLinks"]
PageObj["build CrawledPage"]
Visit["update PageVisit status + summary"]
end
HistAfter["SearchHistory (after)"]
PageOut["CrawledPage (optional)"]
Task --> BuildReq
Url --> BuildReq
Budget --> Fetch
BuildReq --> Fetch
Fetch --> BlockCheck
BlockCheck -- "ok" --> Clean
BlockCheck -- "blocked/empty" --> Visit
Clean --> PageObj
PageObj --> Visit
Visit --> HistAfter
PageObj --> PageOut
Status: Building blocks implemented; not yet wired as a single "crawl URL" step in the loop.
Role. For each URL chosen by Serper or Nav, fetch the page within the ScraperAPI budget and turn it into normalized page content that Embeddings can work with.
Responsibilities.
- Take URLs from Serper (SERP picks) and Nav (on-site navigation) and try to fetch them.
- Use the existing ScraperAPI + HTML utilities to fetch and clean pages, handling blocks and budget limits.
- Produce
CrawledPageobjects (metadata + text chunks) for the Embeddings step. - Update
SearchHistory.pagesso we know which URLs were tried, what happened (fetched / blocked / no_content), and where they came from (serp vs nav).
TODO
- Define a minimal
CrawledPagetype that carries the fields Embeddings and Decision need (URL, domain, text chunks, optional main date/summary) and lines up withSearchHistory.PageVisit. - Implement a single crawler function, e.g.
crawlUrl(task, history, url, source, budget), that reusesfetchWithEscalation,stripHtml,chunkText, andfindDataDownloadLinksand returns{ page?: CrawledPage; history: SearchHistory }. - Wire the crawler into the Serper step in the loop: for each
CandidateUrlreturned by Serper, callcrawlUrl(...)and pass anyCrawledPageinto the Embeddings helpers before deciding what to do next. - In that function, always update
SearchHistoryfor the URL: set the source (serpornav), mark status (fetched,blocked,no_content,skipped), and attach a short summary when we have usable content.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request