Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17506

updated server/public_simplechat additionally with a initial go at a simple minded minimal markdown to html logic, so that if the ai model is outputting markdown text instead of plain text, user gets a basic formatted view of the same. If things dont seem ok, user can disable markdown processing from settings in ui.

look into the previous PR #17451 in this series for details wrt other features added to tools/server/public_simplechat
like peeking into reasoning, working with vision models as well as built in support for a bunch of useful tool calls on the client side with minimal to no setup.

All features (except for pdf - pypdf dep) are implemented internally without depending on any external libraries, and inturn should fit within 50KB compressed. Created using pure html+css+js in general, with additionally python for simpleproxy to bypass the cors++ restrictions in browser environment for direct web access.

mimicing got req in generated req helps with duckduckgo also and
not just yahoo.

also update allowed.domains to allow a url generated by ai when
trying to access the bing's news aggregation url
Use DOMParser parseFromString in text/html mode rather than text/xml
as it makes it more relaxed without worrying about special chars
of xml like & etal
Instead of simple concatenating of tool call id, name and result
now use browser's dom logic to create the xml structure used for
now to store these within content field.

This should take care of transforming / escaping any xml special
chars in the result, so that extracting them later for putting
into different fields in the server handshake doesnt have any
problem.
bing raised a challenge for chrome triggered search requests after
few requests, which were spread few minutes apart, while still
seemingly allowing wget based search to continue (again spread
few minutes apart).

Added a simple helper to trace this, use --debug True to enable
same.
avoid logically duplicate debug log
Instead of enforcing always explicit user triggered tool calling,
now user is given the option whether to use explicit user triggered
tool calling or to use auto triggering after showing tool details
for a user specified amount of seconds.

NOTE: The current logic doesnt account for user clicking the buttons
before the autoclick triggers; need to cancel the auto clicks, if
user triggers before autoclick, ie in future.
also cleanup the existing toolResponseTimeout timer to be in the
same structure and have similar flow convention.
identified by llama.cpp editorconfig check

* convert tab to spaces in json config file
* remove extra space at end of line
Add missing newline to ending bracket line of json config file
include info about the auto option within tools.

use nonwrapped text wrt certain sections, so that the markdown
readme can be viewed properly wrt the structure of the content
in it.
So as to split browser js webworker based tool calls from web
related tool calls.
Remove the unneed (belonging to the other file) stuff from tooljs
and toolweb files.

Update tools manager to make use of the new toolweb module
Initial go at implementing a web search tool call, which uses the
existing UrlText support of the bundled simpleproxy.py.

It allows user to control the search engine to use, by allowing
them to set the search engine url template.

The logic comes with search engine url template strings for
duckduckgo, brave, bing and google. With duckduckgo set by default.
Avoid code duplication, by creating helpers for setup and toolcall.

Also send indication of the path that will be used, when checking
for simpleproxy.py server to be running at runtime setup.
If using wikipedia or so, remember to have sufficient context window
in general wrt the ai engine as well as wrt the handshake / chat
end point.
Moved it into Me->tools, so that end user can modify the same as
required from the settings ui.

TODO: Currently, if tc response is got after a tool call timed out
and user submitted default timed out error response, the delayed
actual response when it is got may overwrite any new content in
user query box, this needs to be tackled.
Now both follow a similar mechanism and do the following

* exit on finding any issue, so that things are in a known
  state from usage perspective, without any confusion/overlook

* check if the cmdlineArgCmd/configCmd being processed is a known
  one or not.

* check value of the cmd is of the expected type

* have a generic flow which can accomodate more cmds in future
  in a simple way
Ensure load_config gets called on encountering --config in cmdline,
so that the user has control over whether cmdline or config file
will decide the final value of any given parameter.

Ensure that str type values in cmdline are picked up directly, without
running them through ast.literal_eval, bcas otherwise one will have to
ensure throught the cmdline arg mechanism that string quote is retained
for literal_eval

Have the """ function note/description below def line immidiately
so that it is interpreted as a function description.
Add a config entry called bearer.insecure which will contain a
token used for bearer auth of http requests

Make bearer.insecure and allowed.domains as needed configs, and
exit program if they arent got through cmdline or config file.
As noted in the comments in code, this is a very insecure flow
for now.
Next will be adding a proxyAuth field also to tools.
User can configure the bearer token to send
instead of using the shared bearer token as is, hash it with
current year and use the hash.

keep /aum path out of auth check.

in future bearer token could be transformed more often, as well as
with additional nounce/dynamic token from server got during initial
/aum handshake as also running counter and so ...

NOTE: All these circus not good enough, given that currently the
simpleproxy.py handshakes work over http. However these skeletons
put in place, for future, if needed.

TODO: There is a once in a bluemoon race when the year transitions
between client generating the request and server handling the req.
But other wise year transitions dont matter bcas client always
creates fresh token, and server checks for year change to genrate
fresh token if required.
Add a new role ToolTemp, which is used to maintain any tool call
response on the client ui side, without submitting it to the server
ie till user or auto submit triggers the submitting of that tool
call response.

When ever a tool call response is got, create a ToolTemp role based
message in the corresponding chat session. And dont directly update
the user query input area, rather leave it to the updated simplechat
show and the new multichatui chat_show helper and inturn whether the
current chat session active in ui is same as the one for which the
tool call response has been recieved.

TODO: Currently the response message is added to the current
active chat session, but this needs to be changed by tracking
chatId/session through the full tool call cycle and then adding
the tool call response in the related chat session, and inturn
updating or not the ui based on whether that chat session is
still the active chat session in ui or not, given that tool call
gets handled in a asynchronous way.

Now when that tool call response is submitted, promote the equiv
tool temp role based message that should be in the session's chat
history as the last message into becoming a normal tool response
message.

SimpleChat.show has been updated to take care of showing any
ToolTemp role message in the user query input area.

A newer chat_show helper added to MultiChatUI, that takes care of
calling SimpleChat.show, provided the chat_show is being requested
for the currently active in ui, chat session. As well as to take
care of passing both the ChatDiv and elInUser. Converts users of
SimpleChat.show to use MultiChatUI.chat_show
Update the immidiate tool call triggering failure and tool call
response timeout paths to use the new ToolTemp and MultiChatUI
based chat show logics.

Actual tool call itself generating errors, is already handled
in the previous commit changes.
Pass chatId to tool call, and use chatId in got tool call resp,
to decide as to to which chat session the async tool call resp
belongs and inturn if auto submit timer should be started if auto
is enabled.
This should ensure that tool call responses can be mapped back to
the chat session for which it was triggered.
Avoid seperate duplicated logic for creating the div+label+el based
element
So there is slightly better typecheck and less extra code.
Try identify headings, and blocks in markdown and convert them
into equivalent stuff in html

Show the same in the chat message blocks.
Remove markdown heading markers

Fix pre equivalent blocks of markdown given that they can have
the block type following ``` marker

Remember to add line break at end of line wrt pre block.
Ensure '---' is treated as a horizontal line and doesnt mess with
unordered list handling.

Take care of unwinding the unordered list everywhere it is needed.
also make flow simple by using same logic for putting the list
content.
Allow for other valid char based markers wrt horizontal lines and
unordered lists

?Also allow for spaces after horizontal line marker, in same line?
Allow fenced code block / pre to be demarkated using either ```
or ~~~

Ensure the termination line wrt fenced block doesnt contain anything
else.

Same starting marker needs to be present wrt ending also
Rather this wont work, need to refresh on regex, been too long.

Rather using split should be simpler

However the extraction of head and body parts with seperation
inbetween for transition should work

Rather the seperation is blindly assumed and corresponding line
discarded for now
Switch to the simpler split based flow.

Include tr wrt the table head block also.

Add a css entry to try and have header cell contents text aling
to left for now, given that there is no border or color shaded
or so distinguishing characteristics wrt the table cells for now.
User can enable or disable the simple minded bruteforce markdown
parsing from the per session settings.

Add grey shading and align text to left wrt table headings of
markdown to html converted tables.
Save copy of data being processed.

Try and sanitize the data passed for markdown to html conversion,
so that if there are any special characters wrt html in the passed
markdown content, it gets translated into a harmless text.

This also ensures that those text dont disappear, bcas of browser
trying to interpret them as html tagged content.

Trap any errors during sanitizing and or processing of the lines
in general and push them into a errors array. Callers of this
markdown class can decide whether to use the converted html or
not based on errors being empty or not or ...

Move the processing of unordered list into a function of its own.
Rather the ordered list can also use the same flow in general except
for some tiny changes including wrt the regex, potentially.
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
Versions Compared: acf457e6-fec8-49e0-af2c-73481a3746f2 vs aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2


Summary

This version introduces extensive web UI enhancements to the SimpleChat frontend without modifying core llama.cpp inference engine code. Three utility binaries were removed from the build configuration. No function-level performance changes were detected in core libraries. The changes are confined to the tools/server/public_simplechat/ directory, implementing tool calling, vision support, reasoning display, and markdown rendering capabilities entirely client-side.

Power Consumption Changes:

  • build.bin.llama-cvector-generator: Reduced by 278999 nJ (removed from build)
  • build.bin.llama-run: Reduced by 245370 nJ (removed from build)
  • build.bin.llama-tts: Reduced by 285154 nJ (removed from build)
  • build.bin.libllama.so: Changed by -0.35 nJ (negligible)
  • All other binaries: No measurable change

Inference Performance Impact:
No functions in the tokenization or inference paths were modified. Functions llama_decode, llama_encode, llama_tokenize, llama_model_load_from_file, and other performance-critical components show zero change in response time and throughput. Tokens per second remains unaffected as no inference engine modifications occurred.

The removed binaries represent standalone utilities for control vector generation, inference running, and text-to-speech functionality, not core inference components.

2 similar comments
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
Versions Compared: acf457e6-fec8-49e0-af2c-73481a3746f2 vs aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2


Summary

This version introduces extensive web UI enhancements to the SimpleChat frontend without modifying core llama.cpp inference engine code. Three utility binaries were removed from the build configuration. No function-level performance changes were detected in core libraries. The changes are confined to the tools/server/public_simplechat/ directory, implementing tool calling, vision support, reasoning display, and markdown rendering capabilities entirely client-side.

Power Consumption Changes:

  • build.bin.llama-cvector-generator: Reduced by 278999 nJ (removed from build)
  • build.bin.llama-run: Reduced by 245370 nJ (removed from build)
  • build.bin.llama-tts: Reduced by 285154 nJ (removed from build)
  • build.bin.libllama.so: Changed by -0.35 nJ (negligible)
  • All other binaries: No measurable change

Inference Performance Impact:
No functions in the tokenization or inference paths were modified. Functions llama_decode, llama_encode, llama_tokenize, llama_model_load_from_file, and other performance-critical components show zero change in response time and throughput. Tokens per second remains unaffected as no inference engine modifications occurred.

The removed binaries represent standalone utilities for control vector generation, inference running, and text-to-speech functionality, not core inference components.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
Versions Compared: acf457e6-fec8-49e0-af2c-73481a3746f2 vs aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2


Summary

This version introduces extensive web UI enhancements to the SimpleChat frontend without modifying core llama.cpp inference engine code. Three utility binaries were removed from the build configuration. No function-level performance changes were detected in core libraries. The changes are confined to the tools/server/public_simplechat/ directory, implementing tool calling, vision support, reasoning display, and markdown rendering capabilities entirely client-side.

Power Consumption Changes:

  • build.bin.llama-cvector-generator: Reduced by 278999 nJ (removed from build)
  • build.bin.llama-run: Reduced by 245370 nJ (removed from build)
  • build.bin.llama-tts: Reduced by 285154 nJ (removed from build)
  • build.bin.libllama.so: Changed by -0.35 nJ (negligible)
  • All other binaries: No measurable change

Inference Performance Impact:
No functions in the tokenization or inference paths were modified. Functions llama_decode, llama_encode, llama_tokenize, llama_model_load_from_file, and other performance-critical components show zero change in response time and throughput. Tokens per second remains unaffected as no inference engine modifications occurred.

The removed binaries represent standalone utilities for control vector generation, inference running, and text-to-speech functionality, not core inference components.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #327

Overview

This PR introduces comprehensive web UI enhancements to tools/server/public_simplechat without modifying core llama.cpp inference engine. Analysis confirms zero impact on performance-critical paths.

Performance Impact Assessment

Core Inference Functions: No changes detected

  • llama_decode: 44,752,344 ns response time (0% change, 0 ns delta)
  • llama_tokenize: 899,200 ns response time (0% change, 0 ns delta)
  • llama_encode: Not modified
  • llama_batch_init: 252 ns response time (0% change, 0 ns delta)

Tokens Per Second Impact: None. The reference model (smollm:135m on 12th Gen Intel i7-1255U) maintains baseline performance as no tokenization or inference functions were modified.

Power Consumption Analysis: Negligible changes across all binaries

  • build.bin.libllama.so: 228,844 nJ (delta: -0.45 nJ, -0.0% change)
  • build.bin.llama-cvector-generator: 278,999 nJ (delta: -0.30 nJ, -0.0% change)
  • build.bin.llama-run: 245,370 nJ (delta: +0.13 nJ, +0.0% change)
  • All other binaries (libggml-base.so, libggml-cpu.so, libggml.so, libmtmd.so, llama-bench, llama-quantize, etc.): 0 nJ change

Code Changes Summary

Scope: 380 commits, 6,395 additions, 799 deletions across 29 files in tools/server/public_simplechat/

Implementation: Client-side web interface with:

  • Multi-session chat management with per-session configuration
  • Tool calling framework using Web Workers for isolation
  • Vision support via base64-encoded image data URLs
  • Markdown rendering and reasoning display
  • IndexedDB persistence for chat history
  • Optional Python proxy server for web access tools

Architecture: Modular JavaScript implementation with classes for message handling (NSChatMessage, ChatMessageEx), session management (SimpleChat), UI orchestration (MultiChatUI), and tool coordination (ToolsManager).

The changes are entirely isolated to the web UI layer, utilizing existing /chat/completions and /completions HTTP endpoints without modifications to request handling or server binary.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #327

Overview

This PR introduces 380 commits across 29 files, adding 6,395 lines and removing 799 lines. All changes are confined to the tools/server/public_simplechat/ directory, consisting entirely of JavaScript, HTML, CSS, and Python proxy server modifications. No C++ source files, build system configurations, or core llama.cpp inference components were modified.

Performance Impact

Zero measurable performance impact on core llama.cpp binaries.

All performance-critical functions show no change:

  • llama_decode: 44,752,504 ns (0% change)
  • llama_encode: 11,254,049 ns (0% change)
  • llama_tokenize: 899,206 ns (0% change)
  • ggml_graph_compute: 1,358,852 ns (0% change)

Power consumption analysis across all binaries shows variations within compiler optimization noise:

  • libllama.so: +0.355 nJ
  • llama-cvector-generator: -0.274 nJ
  • llama-run: +0.158 nJ
  • llama-tts: -0.0003 nJ

Tokens per second: No impact. Since llama_decode, llama_encode, and llama_tokenize response times remain unchanged, inference throughput is unaffected.

Code Changes

The PR transforms the SimpleChat web UI from a basic interface into a feature-rich client supporting tool calling, vision models, reasoning display, markdown rendering, and multi-session management. Changes include:

  • Class-based JavaScript architecture replacing functional approach
  • Tool calling system with Web Worker isolation
  • Python proxy server for CORS bypass
  • IndexedDB-based session persistence
  • Client-side markdown rendering
  • Vision support with base64 image handling

All functionality operates in the browser client layer with no modifications to server-side inference paths.

@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants