Severe latency before thinking/tool execution in follow-up chats with Ollama, even when model/context are configured correctly #12151
Replies: 3 comments 8 replies
-
None of the "delay" behaviors described are typical, expected, nor able to be reproduced. Can you export a chat so I can try to reproduce? Can you try screen recording so I can see the described behaviors? Testing with Ollama specifically, I am also seeing warm hits on model loads, with the models staying loaded between turns. Each response has low latency between turns. In fact, even 50 turns later, and web search enabled, the latency is still near instant on my end.
In those apps, do you also enable tools when testing long conversations? Your librechat.yaml config would also help to see how Ollama is setup on your end. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks, that helps narrow it down. https://opnviking.cloud/seafhttp/f/d18fd32eea0d40918b2a/?op=view |
Beta Was this translation helpful? Give feedback.
-
|
The interesting signal in your write-up is that the delay happens before visible tool execution, which points less to Ollama inference itself and more to the orchestration path around the request. When follow-up turns get slower while first turns stay fast, history preparation, retrieval planning, reranker setup, or internal agent state assembly become stronger suspects than the model host alone. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What happened?
Hi,
I am testing LibreChat as a local-only daily driver with Ollama, and I am seeing a consistent latency problem that I need help understanding.
Summary
LibreChat is fast on the first message in a new chat, but follow-up messages in the same chat become much slower.
The delay happens before visible thinking / tool execution starts.
This is not an Ollama host performance problem, because the same model on the same dedicated Ollama host responds quickly in AnythingLLM and Open WebUI.
At this point I need to understand whether this is:
Environment
Model / endpoint
Using Ollama with:
[Thinking3.5:latest](qwen3.5:122b-a10b-q4_K_M)I also tested a larger context preset in
librechat.yaml.What I observe
Behavior
Important detail
This happens even when:
modelSpecspreset is loaded correctlymaxContextTokensis increased to262144Why I do not think this is Ollama
Relevant logs
Case 1: smaller effective context in earlier test
LibreChat showed:
This suggested context pressure in follow-up chats.
Case 2: after creating a preset with large context
LibreChat debug log shows the preset is loaded:
And later during the actual request:
So the larger context is now really being used.
But the latency still remains.
Runtime path seems agent/tool-based even though I did not create an Agent
I did not manually create an Agent, but the logs still show:
So it looks like this chat flow is internally going through the agent/tool orchestration path.
Timing from logs
Example from one run:
This means:
So a major part of the delay seems to happen before actual search/tool execution.
Then after that, there is heavy reranking:
with very large Jina token usage.
Main question
Can you clarify:
Why follow-up chats become dramatically slower than the first message?
Why this flow appears to use the internal agent/runtime path even without an explicitly created Agent?
Which component is responsible for the long pre-tool delay:
Is this expected behavior when web search/tools are enabled?
Is there a way to force a simpler / more direct chat path for Ollama-based chats?
Why this matters
I am evaluating LibreChat as a production daily driver for:
If this latency is expected and not fixable, LibreChat is probably not the right fit for my workflow.
I would really appreciate a developer-level explanation instead of guesswork, because I have now tested:
Thanks.
Version Information
westcoast-dk@Docker01 LibreChat % docker images | grep librechat
registry.librechat.ai/danny-avila/librechat-dev latest 9f4f6ecc06bd 3 hours ago 2.02GB
registry.librechat.ai/danny-avila/librechat-dev eb0ab169c598 27 hours ago 2.02GB
librechat-jina-reranker-api-jina-api latest 92d953c647cf 6 days ago 1.56GB
ghcr.io/danny-avila/librechat-rag-api-dev latest 880c859fa1d7 7 days ago 3.37GB
registry.librechat.ai/danny-avila/librechat-rag-api-dev-lite latest c632f3ecb6c9 7 days ago 1.52GB
westcoast-dk@Docker01 LibreChat %
32cadb1
Steps to Reproduce
Run LibreChat with:
Configure the model used in LibreChat as:
Thinking3.5:latestStart a new chat in LibreChat.
Send an initial prompt that triggers web search or tool usage.
Observe that the first message starts relatively quickly.
In the same chat/thread, send one or more follow-up prompts.
Observe that each follow-up message becomes significantly slower before “thinking” / tool execution begins.
Check LibreChat debug logs during and after the request, for example:
Compare timestamps around:
saveConvoON_TOOL_EXECUTEonSearchResultsActual Result
First message in a new chat starts in about 5–6 seconds.
Follow-up messages in the same thread become much slower.
There can be ~30s to ~56s of silence before visible thinking/tool execution starts.
Logs show the request going through:
ResumableAgentControllerAgentContextAgentStreamON_TOOL_EXECUTEA noticeable delay occurs before
ON_TOOL_EXECUTE.In some runs, the logs also show heavy reranking after search results.
Expected Result
Additional Notes
maxContextTokensinlibrechat.yamlwas reflected in logs, but did not resolve the latency.What browsers are you seeing the problem on?
No response
Relevant log output
app/logs/debug-2026-03-09.log-265-2026-03-09T05:02:43.712Z debug: [BaseClient] Truncated tool call outputs: /app/logs/debug-2026-03-09.log-266-[1] /app/logs/debug-2026-03-09.log-267-2026-03-09T05:02:43.713Z debug: [BaseClient] Context Count (1/2) /app/logs/debug-2026-03-09.log-268-{ /app/logs/debug-2026-03-09.log-269- remainingContextTokens: 12332, /app/logs/debug-2026-03-09.log:270: maxContextTokens: 16200, /app/logs/debug-2026-03-09.log-271-} /app/logs/debug-2026-03-09.log-272-2026-03-09T05:02:43.713Z debug: [BaseClient] Context Count (2/2) /app/logs/debug-2026-03-09.log-273-{ /app/logs/debug-2026-03-09.log-274- remainingContextTokens: 12332, /app/logs/debug-2026-03-09.log:275: maxContextTokens: 16200, /app/logs/debug-2026-03-09.log-276-} /app/logs/debug-2026-03-09.log-277-2026-03-09T05:02:43.713Z debug: [BaseClient] tokenCountMap: /app/logs/debug-2026-03-09.log-278-{ /app/logs/debug-2026-03-09.log-279- 8bed2d56-4269-4710-9c0b-beaa00936239: 85, /app/logs/debug-2026-03-09.log-280- 5f387297-6a5e-432d-926e-b0111fbb3bdc: 1486, -- /app/logs/debug-2026-03-09.log-287-2026-03-09T05:02:43.713Z debug: [BaseClient] /app/logs/debug-2026-03-09.log-288-{ /app/logs/debug-2026-03-09.log-289- promptTokens: 3868, /app/logs/debug-2026-03-09.log-290- remainingContextTokens: 12332, /app/logs/debug-2026-03-09.log-291- payloadSize: 7, /app/logs/debug-2026-03-09.log:292: maxContextTokens: 16200, /app/logs/debug-2026-03-09.log-293-} /app/logs/debug-2026-03-09.log-294-2026-03-09T05:02:43.713Z debug: [AgentContext] Applied context to agent: ollama__Thinking3.5__latest___Ollama /app/logs/debug-2026-03-09.log-295-2026-03-09T05:02:43.713Z debug: [BaseClient] userMessage /app/logs/debug-2026-03-09.log-296-{ /app/logs/debug-2026-03-09.log-297- messageId: "905801ce-2be0-4967-91a5-e8039c3f333a",Screenshots
No response
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions