Skip to content

fix: limit scraped content size to prevent excessive token usage#1035

Open
VibhorGautam wants to merge 2 commits intoItzCrazyKns:masterfrom
VibhorGautam:fix/excessive-token-usage-1031
Open

fix: limit scraped content size to prevent excessive token usage#1035
VibhorGautam wants to merge 2 commits intoItzCrazyKns:masterfrom
VibhorGautam:fix/excessive-token-usage-1031

Conversation

@VibhorGautam
Copy link

@VibhorGautam VibhorGautam commented Mar 8, 2026

Problem

When Perplexica scrapes a web page (e.g. from Discover or a user-provided URL), the entire HTML is converted to markdown and passed to the LLM with no size limit. A single large page can easily produce 100K+ tokens of markdown, leading to requests of 481K+ tokens that exceed the model's context window and cause silent failures.

Root cause

scrapeURL.ts calls turndownService.turndown(text) on the full page HTML and pushes the entire result into the context. The search agent (index.ts / api.ts) then concatenates all results into the writer prompt without any cap. The codebase already has a token-aware splitText utility in src/lib/utils/splitText.ts, but it was unused in this pipeline.

Fix

  • scrapeURL.ts: After converting HTML to markdown, truncate to ~6000 tokens using the existing splitText utility. Only the first chunk is kept — this preserves the most relevant content (page header, intro, key sections) while staying well within context limits.
  • index.ts / api.ts: Added per-result (24000 chars) and total context (80000 chars) caps when assembling search results for the writer prompt. This acts as a safety net regardless of which research action produced the content.

Testing

  • Verified the fix compiles cleanly (npx tsc --noEmit)
  • Formatted with npm run format:write

Fixes #1031


Summary by cubic

Limits scraped web content and caps the assembled search context to prevent token blowups and context window overflows. Fixes #1031.

  • Bug Fixes
    • In scrapeURL.ts, cap raw HTML at 200k chars before Turndown, then limit markdown to ~6000 tokens via splitText (keep first chunk).
    • In search/index.ts and search/api.ts, cap each result at 24k chars and total context at 80k chars when building the writer prompt.

Written for commit 1c66647. Summary will update on new commits.

Scraped web pages were being sent to the LLM in full, with no
truncation. A single large page could produce 100K+ tokens of markdown,
easily exceeding the model's context window.

Use the existing splitText utility to cap scraped content at ~6000 tokens
per page. Also add per-result and total character limits when assembling
the final context for the writer prompt.

Fixes ItzCrazyKns#1031
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">

<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:117">
P2: Token limiting is applied after full Turndown conversion, leaving unbounded HTML conversion cost for large pages.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Truncate the HTML to 200K chars before passing it to Turndown so we
don't waste CPU converting huge pages we mostly discard after
tokenization anyway.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">

<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:44">
P2: HTML size limiting is enforced only after full `res.text()` buffering/decoding, so very large responses can still cause significant memory/CPU pressure before the cap applies.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@ampersandru
Copy link

Thank you for this fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Perplexica using an extreme amount of tokens, even when it is just summarizing one webpage

2 participants