Skip to content

Perplexica using an extreme amount of tokens, even when it is just summarizing one webpage #1031

@ampersandru

Description

@ampersandru

Describe the bug
Specs:
Chat Model: Qwen3.5-9b (Using Instruct parameters)
Embedding Model: qwen3-embedding:0.6b via ollama (it doesnt seem this is being loaded, I dont see my VRAM go up)
5060ti 16gb
llama-swap with llama.cpp with 32k context window

When clicking on one of the news stories (https://www.wired.com/story/openai-fires-employee-insider-trading-polymarket-kalshi/) from the Perplexica discover page, it will often fail and just not do anything. Upon checking Perplexica logs, I see that it tried to use 481k tokens!:

 ⨯ unhandledRejection:  Error: 400 request (481909 tokens) exceeds the available context size (32768 tokens), try increasing it

    at f.generate (.next/server/chunks/607.js:20:19183)

    at cX.makeStatusError (.next/server/chunks/607.js:27:51395)

    at cX.makeRequest (.next/server/chunks/607.js:27:54864)

    at async q.streamText (.next/server/chunks/136.js:1:2480)

    at async i.research (.next/server/chunks/641.js:541:227)

    at async m.searchAsync (.next/server/app/api/chat/route.js:1:13218) {

  status: 400,

  headers: Headers {

    'access-control-allow-origin': '',

    'content-length': '201',

    'content-type': 'application/json; charset=utf-8',

    date: 'Fri, 06 Mar 2026 17:49:01 GMT',

    server: 'llama.cpp'

  },

  requestID: null,

  error: [Object],

�
  code: 400,

  param: undefined,

  type: 'exceed_context_size_error'

}

The box in red is Perplexica summarizing one page (https://techcrunch.com/2026/03/04/https-techcrunch-com-2026-03-04-google-search-rolls-out-geminis-canvas-in-ai-mode-to-all-us-users/), the green box is openwebui summarizing the same page using qwen3.5-thinking

Image

When doing a balanced search, topic (Do research on benchmarks on is it worth upgrading my current PC that I use for gaming to a and 9800x3d with ddr5 memory? Specs: 49" 5120x1400 240hz monitor RTX 5080 (to be reused) intel i5-14600k overclocked to 5.6ghz 32gb 3600Mhz DDR4 Same 4x m.2), it attempted to use 483k tokens:

 ⨯ unhandledRejection:  Error: 400 request (483846 tokens) exceeds the available context size (32768 tokens), try increasing it

    at f.generate (.next/server/chunks/607.js:20:19183)

    at cX.makeStatusError (.next/server/chunks/607.js:27:51395)

    at cX.makeRequest (.next/server/chunks/607.js:27:54864)

    at async q.streamText (.next/server/chunks/136.js:1:2480)

    at async i.research (.next/server/chunks/641.js:541:227)

    at async m.searchAsync (.next/server/app/api/chat/route.js:1:13218) {

To Reproduce
Steps to reproduce the behavior:

  1. Go to Discover
  2. Click on 'any story'
  3. Fails to summarize

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions