fix: limit scraped content size to prevent excessive token usage#1035
Open
VibhorGautam wants to merge 2 commits intoItzCrazyKns:masterfrom
Open
fix: limit scraped content size to prevent excessive token usage#1035VibhorGautam wants to merge 2 commits intoItzCrazyKns:masterfrom
VibhorGautam wants to merge 2 commits intoItzCrazyKns:masterfrom
Conversation
Scraped web pages were being sent to the LLM in full, with no truncation. A single large page could produce 100K+ tokens of markdown, easily exceeding the model's context window. Use the existing splitText utility to cap scraped content at ~6000 tokens per page. Also add per-result and total character limits when assembling the final context for the writer prompt. Fixes ItzCrazyKns#1031
Contributor
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">
<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:117">
P2: Token limiting is applied after full Turndown conversion, leaving unbounded HTML conversion cost for large pages.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Truncate the HTML to 200K chars before passing it to Turndown so we don't waste CPU converting huge pages we mostly discard after tokenization anyway.
Contributor
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">
<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:44">
P2: HTML size limiting is enforced only after full `res.text()` buffering/decoding, so very large responses can still cause significant memory/CPU pressure before the cap applies.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
Thank you for this fix! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When Perplexica scrapes a web page (e.g. from Discover or a user-provided URL), the entire HTML is converted to markdown and passed to the LLM with no size limit. A single large page can easily produce 100K+ tokens of markdown, leading to requests of 481K+ tokens that exceed the model's context window and cause silent failures.
Root cause
scrapeURL.tscallsturndownService.turndown(text)on the full page HTML and pushes the entire result into the context. The search agent (index.ts/api.ts) then concatenates all results into the writer prompt without any cap. The codebase already has a token-awaresplitTextutility insrc/lib/utils/splitText.ts, but it was unused in this pipeline.Fix
scrapeURL.ts: After converting HTML to markdown, truncate to ~6000 tokens using the existingsplitTextutility. Only the first chunk is kept — this preserves the most relevant content (page header, intro, key sections) while staying well within context limits.index.ts/api.ts: Added per-result (24000chars) and total context (80000chars) caps when assembling search results for the writer prompt. This acts as a safety net regardless of which research action produced the content.Testing
npx tsc --noEmit)npm run format:writeFixes #1031
Summary by cubic
Limits scraped web content and caps the assembled search context to prevent token blowups and context window overflows. Fixes #1031.
scrapeURL.ts, cap raw HTML at 200k chars before Turndown, then limit markdown to ~6000 tokens viasplitText(keep first chunk).search/index.tsandsearch/api.ts, cap each result at 24k chars and total context at 80k chars when building the writer prompt.Written for commit 1c66647. Summary will update on new commits.