Skip to content

add optional rawTextMatch and csrClues fields to page qa model#3236

Draft
emma-sg wants to merge 1 commit intomainfrom
add-raw-text-and-csr-detection-to-db
Draft

add optional rawTextMatch and csrClues fields to page qa model#3236
emma-sg wants to merge 1 commit intomainfrom
add-raw-text-and-csr-detection-to-db

Conversation

@emma-sg
Copy link
Copy Markdown
Member

@emma-sg emma-sg commented Mar 26, 2026

adds new optional fields to capture additional match data for raw text and CSR clues collections. these fields will be used to store enhanced matching information for crawled page analysis.

to make use of these new fields, set crawler_extra_args: "--qaDetectClientSideRendering" and to-warc-from-raw to crawler_extract_full_text in one of the yaml configs used, e.g.

crawler_extract_full_text: to-pages,to-warc,to-warc-from-raw
crawler_extra_args: "--qaDetectClientSideRendering"

adds new optional fields to capture additional match data for
raw text and CSR clues collections. these fields will be used to
store enhanced matching information for crawled page analysis.

to make use of these new fields, set `crawler_extra_args:
"--qaDetectClientSideRendering"` and `to-warc-from-raw` to
`crawler_extract_full_text`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant