-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Embeddings: page scoring & evidence hits
full spec at spec
flowchart TD
Task["SearchTask"]
Page["CrawledPage"]
HistBefore["SearchHistory (before)"]
subgraph Emb["Embeddings module"]
Score["scorePageWithEmbeddings"]
Check{"strong hits?"}
Attach["attachEvidenceFromPage"]
Summary["store page summary"]
end
HistAfter["SearchHistory (after)"]
Task --> Score
Page --> Score
HistBefore --> Attach
Score --> Check
Check -- "yes" --> Attach
Check -- "no" --> Summary
Attach --> HistAfter
Summary --> HistAfter
Status: Core primitives (proof of concept) implemented; not yet wired into the loop.
Role. For each CrawledPage, compare its text chunks to the SearchTask question and pick the snippets that look like real evidence.
Responsibilities.
- Take a
CrawledPage(from Crawler) and theSearchTaskquestion / targeted information. - Use embeddings to score each text chunk against the question (reusing
embedTexts/rankByEmbedding). - Make sure evidence is tied to the right timeframe and other key factors from the
SearchTask(e.g. date, location, asset), so we do not pick a nice-looking snippet that clearly talks about a different period or object (think: wrong day on a weather page). - Turn the best chunks into
EvidenceHitrecords (url, page index, snippet, stance, weight, date) that match theSearchHistorysketch. - Even when a page does not produce any strong evidence hits, still extract one short, representative snippet or summary so
SearchHistoryremembers what was on the page for future query refinement. - Add those hits into
SearchHistory.evidenceand update simple aggregates like total support / total refute.
TODO
- Finalize the
EvidenceHitshape under theSearchHistorymodule (keeping it minimal but enough for Decision to work with). - Implement a pure helper like
scorePageWithEmbeddings(task, page)that returns ranked chunks for that page using the existing embedding functions. - Implement a second helper like
attachEvidenceFromPage(task, history, page)that calls the scorer, picks the top hits, writes them intoSearchHistory(including aggregates), and returns the updated history. - Make sure
attachEvidenceFromPage(or a sibling helper) always writes some concise page summary intoSearchHistoryeven when no evidence hits pass the threshold, so later Serper/Decision steps can see we already inspected that content. - Revisit how we build embedding inputs: the current approach (using the whole
SearchTaskas one embedding query) is naive. Split out or otherwise highlight important pieces (timeframe, location, variable, model/asset) so chunks that match all of these are ranked above snippets that only share loose keywords.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request