fix: limit scraped content size to prevent excessive token usage by VibhorGautam · Pull Request #1035 · ItzCrazyKns/Vane

VibhorGautam · 2026-03-08T17:01:49Z

Problem

When Perplexica scrapes a web page (e.g. from Discover or a user-provided URL), the entire HTML is converted to markdown and passed to the LLM with no size limit. A single large page can easily produce 100K+ tokens of markdown, leading to requests of 481K+ tokens that exceed the model's context window and cause silent failures.

Root cause

scrapeURL.ts calls turndownService.turndown(text) on the full page HTML and pushes the entire result into the context. The search agent (index.ts / api.ts) then concatenates all results into the writer prompt without any cap. The codebase already has a token-aware splitText utility in src/lib/utils/splitText.ts, but it was unused in this pipeline.

Fix

scrapeURL.ts: After converting HTML to markdown, truncate to ~6000 tokens using the existing splitText utility. Only the first chunk is kept — this preserves the most relevant content (page header, intro, key sections) while staying well within context limits.
index.ts / api.ts: Added per-result (24000 chars) and total context (80000 chars) caps when assembling search results for the writer prompt. This acts as a safety net regardless of which research action produced the content.

Testing

Verified the fix compiles cleanly (npx tsc --noEmit)
Formatted with npm run format:write

Fixes #1031

Summary by cubic

Limits scraped web content and caps the assembled search context to prevent token blowups and context window overflows. Fixes #1031.

Bug Fixes
- In scrapeURL.ts, cap raw HTML at 200k chars before Turndown, then limit markdown to ~6000 tokens via splitText (keep first chunk).
- In search/index.ts and search/api.ts, cap each result at 24k chars and total context at 80k chars when building the writer prompt.

^{Written for commit 1c66647. Summary will update on new commits.}

Scraped web pages were being sent to the LLM in full, with no truncation. A single large page could produce 100K+ tokens of markdown, easily exceeding the model's context window. Use the existing splitText utility to cap scraped content at ~6000 tokens per page. Also add per-result and total character limits when assembling the final context for the writer prompt. Fixes ItzCrazyKns#1031

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">

<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:117">
P2: Token limiting is applied after full Turndown conversion, leaving unbounded HTML conversion cost for large pages.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/lib/agents/search/researcher/actions/scrapeURL.ts

Truncate the HTML to 200K chars before passing it to Turndown so we don't waste CPU converting huge pages we mostly discard after tokenization anyway.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">

<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:44">
P2: HTML size limiting is enforced only after full `res.text()` buffering/decoding, so very large responses can still cause significant memory/CPU pressure before the cap applies.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/lib/agents/search/researcher/actions/scrapeURL.ts

ampersandru · 2026-03-08T19:11:41Z

Thank you for this fix!

cubic-dev-ai bot reviewed Mar 8, 2026

View reviewed changes

src/lib/agents/search/researcher/actions/scrapeURL.ts Show resolved Hide resolved

fix: cap raw HTML before markdown conversion

1c66647

Truncate the HTML to 200K chars before passing it to Turndown so we don't waste CPU converting huge pages we mostly discard after tokenization anyway.

cubic-dev-ai bot reviewed Mar 8, 2026

View reviewed changes

src/lib/agents/search/researcher/actions/scrapeURL.ts Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: limit scraped content size to prevent excessive token usage#1035

fix: limit scraped content size to prevent excessive token usage#1035
VibhorGautam wants to merge 2 commits intoItzCrazyKns:masterfrom
VibhorGautam:fix/excessive-token-usage-1031

VibhorGautam commented Mar 8, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

ampersandru commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VibhorGautam commented Mar 8, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Testing

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ampersandru commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VibhorGautam commented Mar 8, 2026 •

edited by cubic-dev-ai bot

Loading