Perplexica using an extreme amount of tokens, even when it is just summarizing one webpage

**Describe the bug**
Specs:
Chat Model: Qwen3.5-9b (Using Instruct parameters)
Embedding Model: qwen3-embedding:0.6b via ollama (it doesnt seem this is being loaded, I dont see my VRAM go up)
5060ti 16gb
llama-swap with llama.cpp with 32k context window

When clicking on one of the news stories (https://www.wired.com/story/openai-fires-employee-insider-trading-polymarket-kalshi/) from the Perplexica discover page, it will often fail and just not do anything. Upon checking Perplexica logs, I see that it tried to use **481k tokens**!:

```
 ⨯ unhandledRejection:  Error: 400 request (481909 tokens) exceeds the available context size (32768 tokens), try increasing it

    at f.generate (.next/server/chunks/607.js:20:19183)

    at cX.makeStatusError (.next/server/chunks/607.js:27:51395)

    at cX.makeRequest (.next/server/chunks/607.js:27:54864)

    at async q.streamText (.next/server/chunks/136.js:1:2480)

    at async i.research (.next/server/chunks/641.js:541:227)

    at async m.searchAsync (.next/server/app/api/chat/route.js:1:13218) {

  status: 400,

  headers: Headers {

    'access-control-allow-origin': '',

    'content-length': '201',

    'content-type': 'application/json; charset=utf-8',

    date: 'Fri, 06 Mar 2026 17:49:01 GMT',

    server: 'llama.cpp'

  },

  requestID: null,

  error: [Object],


  code: 400,

  param: undefined,

  type: 'exceed_context_size_error'

}
```

The box in red is Perplexica summarizing one page (https://techcrunch.com/2026/03/04/https-techcrunch-com-2026-03-04-google-search-rolls-out-geminis-canvas-in-ai-mode-to-all-us-users/), the green box is openwebui summarizing the same page using qwen3.5-thinking

<img width="1254" height="847" alt="Image" src="https://github.com/user-attachments/assets/e332f8a6-6fdf-434c-afb3-ed923b6bb145" />

When doing a balanced search, topic (Do research on benchmarks on is it worth upgrading my current PC that I use for gaming to a and 9800x3d with ddr5 memory? Specs: 49" 5120x1400 240hz monitor RTX 5080 (to be reused) intel i5-14600k overclocked to 5.6ghz 32gb 3600Mhz DDR4 Same 4x m.2), it attempted to use 483k tokens:

```
 ⨯ unhandledRejection:  Error: 400 request (483846 tokens) exceeds the available context size (32768 tokens), try increasing it

    at f.generate (.next/server/chunks/607.js:20:19183)

    at cX.makeStatusError (.next/server/chunks/607.js:27:51395)

    at cX.makeRequest (.next/server/chunks/607.js:27:54864)

    at async q.streamText (.next/server/chunks/136.js:1:2480)

    at async i.research (.next/server/chunks/641.js:541:227)

    at async m.searchAsync (.next/server/app/api/chat/route.js:1:13218) {
```



**To Reproduce**
Steps to reproduce the behavior:

1. Go to `Discover`
2. Click on 'any story'
3. Fails to summarize

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perplexica using an extreme amount of tokens, even when it is just summarizing one webpage #1031

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Perplexica using an extreme amount of tokens, even when it is just summarizing one webpage #1031

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions