Severe latency before thinking/tool execution in follow-up chats with Ollama, even when model/context are configured correctly #12151

raucodes · 2026-03-09T05:21:05Z

raucodes
Mar 9, 2026

What happened?

Hi,

I am testing LibreChat as a local-only daily driver with Ollama, and I am seeing a consistent latency problem that I need help understanding.

Summary

LibreChat is fast on the first message in a new chat, but follow-up messages in the same chat become much slower.

The delay happens before visible thinking / tool execution starts.

This is not an Ollama host performance problem, because the same model on the same dedicated Ollama host responds quickly in AnythingLLM and Open WebUI.

At this point I need to understand whether this is:

expected LibreChat architecture behavior,
a configuration issue,
or a bug/regression.

Environment

LibreChat in Docker
Ollama on a dedicated separate host
local-only setup
Web search enabled
SearXNG + Firecrawl + Jina reranker configured
Memory disabled
Summary disabled

Model / endpoint

Using Ollama with:

model: [Thinking3.5:latest](qwen3.5:122b-a10b-q4_K_M)
Ollama endpoint works fine
same Ollama host + same model are fast in other UIs

I also tested a larger context preset in librechat.yaml.

What I observe

Behavior

New chat: first prompt starts in about 5–6 seconds
Follow-up in same chat: delay increases to ~30s, ~39s, sometimes ~56s
UI stays silent for a long time before “thinking” or web/tool activity appears

Important detail

This happens even when:

Memory is disabled
Summary is disabled
modelSpecs preset is loaded correctly
maxContextTokens is increased to 262144

Why I do not think this is Ollama

Same Ollama host
Same model
AnythingLLM and Open WebUI start much faster
So the bottleneck appears to be inside LibreChat runtime/orchestration, not the model host

Relevant logs

Case 1: smaller effective context in earlier test

LibreChat showed:

promptTokens: 15127
remainingContextTokens: 1073
maxContextTokens: 16200
payloadSize: 19

This suggested context pressure in follow-up chats.

Case 2: after creating a preset with large context

LibreChat debug log shows the preset is loaded:

preset.maxContextTokens: 262144
preset.max_tokens: 8192

And later during the actual request:

remainingContextTokens: 149451
maxContextTokens: 262144
promptTokens: 112693
payloadSize: 27

So the larger context is now really being used.

But the latency still remains.

Runtime path seems agent/tool-based even though I did not create an Agent

I did not manually create an Agent, but the logs still show:

ResumableAgentController
AgentContext
AgentStream
ON_TOOL_EXECUTE

So it looks like this chat flow is internally going through the agent/tool orchestration path.

Timing from logs

Example from one run:

2026-03-08T19:49:45.280Z [saveConvo]
2026-03-08T19:49:58.440Z [ON_TOOL_EXECUTE]
2026-03-08T19:49:59.495Z [onSearchResults]

This means:

about 13 seconds pass before tool execution even starts
then search results arrive about 1 second later

So a major part of the delay seems to happen before actual search/tool execution.

Then after that, there is heavy reranking:

Reranking 60 chunks
Reranking 145 chunks
Reranking 149 chunks
Reranking 251 chunks
Reranking 130 chunks

with very large Jina token usage.

Main question

Can you clarify:

Why follow-up chats become dramatically slower than the first message?
Why this flow appears to use the internal agent/runtime path even without an explicitly created Agent?
Which component is responsible for the long pre-tool delay:
- history/context preparation,
- agent orchestration,
- tool planning,
- or something else?
Is this expected behavior when web search/tools are enabled?
Is there a way to force a simpler / more direct chat path for Ollama-based chats?

Why this matters

I am evaluating LibreChat as a production daily driver for:

Linux/admin troubleshooting
writing emails / tickets / wiki content
code help
local private search

If this latency is expected and not fixable, LibreChat is probably not the right fit for my workflow.

I would really appreciate a developer-level explanation instead of guesswork, because I have now tested:

disabling memory
disabling summary
increasing context
checking logs
comparing against other UIs on the same Ollama backend

Thanks.

Version Information

westcoast-dk@Docker01 LibreChat % docker images | grep librechat

registry.librechat.ai/danny-avila/librechat-dev latest 9f4f6ecc06bd 3 hours ago 2.02GB
registry.librechat.ai/danny-avila/librechat-dev eb0ab169c598 27 hours ago 2.02GB
librechat-jina-reranker-api-jina-api latest 92d953c647cf 6 days ago 1.56GB
ghcr.io/danny-avila/librechat-rag-api-dev latest 880c859fa1d7 7 days ago 3.37GB
registry.librechat.ai/danny-avila/librechat-rag-api-dev-lite latest c632f3ecb6c9 7 days ago 1.52GB
westcoast-dk@Docker01 LibreChat %

32cadb1

Steps to Reproduce

Run LibreChat with:
- Ollama on a separate dedicated host
- Web search enabled
- SearXNG configured
- Firecrawl configured
- Jina reranker configured
- Memory disabled
- Summary disabled
Configure the model used in LibreChat as:
- Thinking3.5:latest
- endpoint: Ollama
Start a new chat in LibreChat.
Send an initial prompt that triggers web search or tool usage.
Observe that the first message starts relatively quickly.
In the same chat/thread, send one or more follow-up prompts.
Observe that each follow-up message becomes significantly slower before “thinking” / tool execution begins.
Check LibreChat debug logs during and after the request, for example:

docker exec -it LibreChat sh -lc 'tail -n 200 /app/logs/debug-*.log 2>/dev/null || tail -n 200 /app/api/logs/debug-*.log 2>/dev/null'

Inspect context-related values with:

docker exec -it LibreChat sh -lc 'grep -Rni -C 5 "maxContextTokens" /app/logs /app/api/logs 2>/dev/null'

Compare timestamps around:
- saveConvo
- ON_TOOL_EXECUTE
- onSearchResults

Actual Result

First message in a new chat starts in about 5–6 seconds.
Follow-up messages in the same thread become much slower.
There can be ~30s to ~56s of silence before visible thinking/tool execution starts.
Logs show the request going through:
- ResumableAgentController
- AgentContext
- AgentStream
- ON_TOOL_EXECUTE
A noticeable delay occurs before ON_TOOL_EXECUTE.
In some runs, the logs also show heavy reranking after search results.

Expected Result

Follow-up messages in the same chat should not become dramatically slower than the first message.
Tool execution / thinking should begin promptly.
The same Ollama model on the same host should not be significantly slower in LibreChat than in other UIs using the same backend.

Additional Notes

This is not likely an Ollama host performance problem, because the same model on the same dedicated Ollama host is fast in AnythingLLM and Open WebUI.
Memory and Summary were disabled during testing.
Increasing maxContextTokens in librechat.yaml was reflected in logs, but did not resolve the latency.

What browsers are you seeing the problem on?

No response

Relevant log output

app/logs/debug-2026-03-09.log-265-2026-03-09T05:02:43.712Z debug: [BaseClient] Truncated tool call outputs:
/app/logs/debug-2026-03-09.log-266-[1]
/app/logs/debug-2026-03-09.log-267-2026-03-09T05:02:43.713Z debug: [BaseClient] Context Count (1/2)
/app/logs/debug-2026-03-09.log-268-{
/app/logs/debug-2026-03-09.log-269-  remainingContextTokens: 12332,
/app/logs/debug-2026-03-09.log:270:  maxContextTokens: 16200,
/app/logs/debug-2026-03-09.log-271-}
/app/logs/debug-2026-03-09.log-272-2026-03-09T05:02:43.713Z debug: [BaseClient] Context Count (2/2)
/app/logs/debug-2026-03-09.log-273-{
/app/logs/debug-2026-03-09.log-274-  remainingContextTokens: 12332,
/app/logs/debug-2026-03-09.log:275:  maxContextTokens: 16200,
/app/logs/debug-2026-03-09.log-276-}
/app/logs/debug-2026-03-09.log-277-2026-03-09T05:02:43.713Z debug: [BaseClient] tokenCountMap:
/app/logs/debug-2026-03-09.log-278-{
/app/logs/debug-2026-03-09.log-279-  8bed2d56-4269-4710-9c0b-beaa00936239: 85,
/app/logs/debug-2026-03-09.log-280-  5f387297-6a5e-432d-926e-b0111fbb3bdc: 1486,
--
/app/logs/debug-2026-03-09.log-287-2026-03-09T05:02:43.713Z debug: [BaseClient]
/app/logs/debug-2026-03-09.log-288-{
/app/logs/debug-2026-03-09.log-289-  promptTokens: 3868,
/app/logs/debug-2026-03-09.log-290-  remainingContextTokens: 12332,
/app/logs/debug-2026-03-09.log-291-  payloadSize: 7,
/app/logs/debug-2026-03-09.log:292:  maxContextTokens: 16200,
/app/logs/debug-2026-03-09.log-293-}
/app/logs/debug-2026-03-09.log-294-2026-03-09T05:02:43.713Z debug: [AgentContext] Applied context to agent: ollama__Thinking3.5__latest___Ollama
/app/logs/debug-2026-03-09.log-295-2026-03-09T05:02:43.713Z debug: [BaseClient] userMessage
/app/logs/debug-2026-03-09.log-296-{
/app/logs/debug-2026-03-09.log-297-  messageId: "905801ce-2be0-4967-91a5-e8039c3f333a",

Screenshots

No response

Code of Conduct

I agree to follow this project's Code of Conduct

danny-avila · 2026-03-09T12:27:42Z

danny-avila
Mar 9, 2026
Maintainer

LibreChat is fast on the first message in a new chat, but follow-up messages in the same chat become much slower.
The delay happens before visible thinking / tool execution starts.

None of the "delay" behaviors described are typical, expected, nor able to be reproduced.

Can you export a chat so I can try to reproduce? Can you try screen recording so I can see the described behaviors?

Testing with Ollama specifically, I am also seeing warm hits on model loads, with the models staying loaded between turns. Each response has low latency between turns. In fact, even 50 turns later, and web search enabled, the latency is still near instant on my end.

This is not an Ollama host performance problem, because the same model on the same dedicated Ollama host responds quickly in AnythingLLM and Open WebUI.

In those apps, do you also enable tools when testing long conversations?

Your librechat.yaml config would also help to see how Ollama is setup on your end.

1 reply

danny-avila Mar 9, 2026
Maintainer

Also, the operations before initiating LLM requests in LibreChat are very fast, and never meaningfully contribute to latency of first token received.

Any latency I've personally observed in LibreChat before first token is usually due to:

MCP Server latency
Database hangups (especially if not using the local dB from docker)
The model/provider, especially as context grows

Both other apps you've compared performance against optimize for local models, so it could very well be that they handle/load Ollama optimally, in a way that LibreChat is not currently doing. However, using Ollama myself, I am not able to observe "severe latency."

I'm making a note to demarcate in the logs clearly when the request first hits the server, as well when the request to the LLM provider is sent out, and initial tokens first received. We can also get more logs from Ollama specifically.

raucodes · 2026-03-09T13:04:18Z

raucodes
Mar 9, 2026
Author

Thanks, that helps narrow it down.
I do have a screen recording showing the behavior, including the later point where the request enters the 30-second window. The full recording is about 300 MB, so I can share it via an external link instead of attaching it directly here.
From my side, the important part is that the first message in a new chat is fast, while follow-up turns in the same chat can become much slower before visible output starts.
I’ll also share my librechat.yaml and note whether tools / MCP were enabled during the test.
Better request-stage logging in LibreChat would be very helpful, especially to distinguish:
request received by server
request sent to provider
first token received
librechat.yaml

https://opnviking.cloud/seafhttp/f/d18fd32eea0d40918b2a/?op=view

7 replies

raucodes Mar 9, 2026
Author

I can't use current_model for title generation because my current model is a reasoning model, and using it for title generation would likely run into a timeout.
I re-enabled memory afterwards because disabling it did not fix the problem.
To verify this, I tested with memory, title generation, and summarization disabled, and the overall time consumption remained the same.
So at least in my setup, the slowdown does not appear to be resolved by disabling those features alone.
I do understand your point about Ollama concurrency and multiple model runs, and I agree that this may still be part of the picture, but it does not seem to fully explain the behavior I am seeing.

I also opened a bug on Ollama's GitHub: ollama/ollama#14578

However, I am using the exact same helper-model combination in OpenWebUI, and so far I have not been able to reproduce the same slowdown there.

danny-avila Mar 9, 2026
Maintainer

Some observability on ollama's end may help, is there any logging that you can see there when making requests to LibreChat?

danny-avila Mar 9, 2026
Maintainer

Comparing any ollama observability with openwebui will help show the difference at the direct request/response level

raucodes Mar 9, 2026
Author

I checked the Ollama-side logs, but the GIN entries are written when the request completes, not when it first arrives.

So they show the total request duration, but they do not let me distinguish between:

when the request first hit Ollama
when generation actually started
when first tokens were streamed back

In other words, they are useful for total runtime, but not sufficient for detailed request-phase analysis.

OpenWebUI:
[GIN] 2026/03/09 - 16:32:45 | 200 | 1m10s | 192.168.0.51 | POST "/api/chat"
[GIN] 2026/03/09 - 16:33:52 | 200 | 1m17s | 192.168.0.51 | POST "/api/chat"

LibreChat:
If I fire up the question
[GIN] 2026/03/09 - 16:36:28 | 200 | 6.405656792s | 192.168.0.50 | POST "/v1/chat/completions"
After rendering the answer:
[GIN] 2026/03/09 - 16:38:13 | 200 | 1m51s | 192.168.0.50 | POST "/v1/chat/completions"

In between silence.

raucodes Mar 9, 2026
Author

I captured another OpenWebUI example as well.

OpenWebUI:
[GIN] 2026/03/09 - 16:49:19 | 200 | 1m42s | 192.168.0.51 | POST "/api/chat"
[GIN] 2026/03/09 - 16:49:28 | 200 | 9.305245375s | 192.168.0.51 | POST "/api/chat"

This looks more reasonable to me: the main response completes, and then the smaller helper/follow-up request appears only about 10 seconds later, not around 2 minutes later.

So I am not concluding that LibreChat is uniquely making multiple requests, because OpenWebUI also appears to do a second request. The difference seems to be the timing/pattern:

OpenWebUI: short follow-up shortly after the main response
LibreChat: much larger silent gap between requests

aniruddhaadak80 · 2026-03-10T05:45:24Z

aniruddhaadak80
Mar 10, 2026

The interesting signal in your write-up is that the delay happens before visible tool execution, which points less to Ollama inference itself and more to the orchestration path around the request. When follow-up turns get slower while first turns stay fast, history preparation, retrieval planning, reranker setup, or internal agent state assembly become stronger suspects than the model host alone.nnThe request-stage timing breakdown you suggested would be very valuable here. Without timestamps for request received, context assembly complete, tool plan issued, provider call started, and first token received, people end up arguing from intuition instead of evidence.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Severe latency before thinking/tool execution in follow-up chats with Ollama, even when model/context are configured correctly #12151

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Severe latency before thinking/tool execution in follow-up chats with Ollama, even when model/context are configured correctly #12151

Uh oh!

raucodes Mar 9, 2026

What happened?

Summary

Environment

Model / endpoint

What I observe

Behavior

Important detail

Why I do not think this is Ollama

Relevant logs

Case 1: smaller effective context in earlier test

Case 2: after creating a preset with large context

Runtime path seems agent/tool-based even though I did not create an Agent

Timing from logs

Main question

Why this matters

Version Information

Steps to Reproduce

Actual Result

Expected Result

Additional Notes

What browsers are you seeing the problem on?

Relevant log output

Screenshots

Code of Conduct

Replies: 3 comments · 8 replies

Uh oh!

danny-avila Mar 9, 2026 Maintainer

Uh oh!

danny-avila Mar 9, 2026 Maintainer

Uh oh!

raucodes Mar 9, 2026 Author

Uh oh!

raucodes Mar 9, 2026 Author

Uh oh!

danny-avila Mar 9, 2026 Maintainer

Uh oh!

danny-avila Mar 9, 2026 Maintainer

Uh oh!

raucodes Mar 9, 2026 Author

Uh oh!

raucodes Mar 9, 2026 Author

Uh oh!

aniruddhaadak80 Mar 10, 2026

raucodes
Mar 9, 2026

Replies: 3 comments 8 replies

danny-avila
Mar 9, 2026
Maintainer

danny-avila Mar 9, 2026
Maintainer

raucodes
Mar 9, 2026
Author

raucodes Mar 9, 2026
Author

danny-avila Mar 9, 2026
Maintainer

danny-avila Mar 9, 2026
Maintainer

raucodes Mar 9, 2026
Author

raucodes Mar 9, 2026
Author

aniruddhaadak80
Mar 10, 2026