From a1ebbe219dbf6a88d035b70b90fb05766488113c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 12:31:05 -0700
Subject: [PATCH 01/73] Add files via upload

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Building-ls1.md    | 133 +++++++++++++++++++++++++++
 docs/Configuring-ls1.md | 195 ++++++++++++++++++++++++++++++++++++++++
 docs/Packaging-ls1.md   | 131 +++++++++++++++++++++++++++
 3 files changed, 459 insertions(+)
 create mode 100644 docs/Building-ls1.md
 create mode 100644 docs/Configuring-ls1.md
 create mode 100644 docs/Packaging-ls1.md
diff --git a/docs/Building-ls1.md b/docs/Building-ls1.md
new file mode 100644
index 0000000000000..3585f42bdc60f
--- /dev/null
+++ b/docs/Building-ls1.md
@@ -0,0 +1,133 @@
+## Building llama-server
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+This file contains instructions for building `llama.cpp` with `cosmocc` to yield a `llama-server` executable that will run on multiple platforms.
+
+### Environment Variables
+
+Let's define some environment variables:
+```
+BUILDING_DIR="1-BUILDING-llama.cpp"
+printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
+```
+
+_Note that if you copy each code block from the guide and paste it into your terminal, each block ends with a message so you won't lose your place in this guide._
+
+---
+### Build Dependencies
+I build with a freshly installed Ubuntu 24.04 VM. Here are some packages that are helpful in creating a working build system. You may need to install more.
+```
+sudo apt install -y git python3-pip build-essential zlib1g-dev \
+    libffi-dev libssl-dev libbz2-dev libreadline-dev libsqlite3-dev \
+    liblzma-dev tk-dev python3-tk cmake zip
+printf "\n**********\n*\n* FINISHED: Build Dependencies.\n*\n**********\n\n"
+```
+
+---
+### Clone this Repo Locally
+Clone this repo into a `~\llama.cpp` directory.
+```
+cd ~
+git clone https://github.com/BradHutchings/llama-server-one.git $BUILDING_DIR
+printf "\n**********\n*\n* FINISHED: Clone this Repo Locally.\n*\n**********\n\n"
+```
+
+**Optional:** Use the `work-in-progress` branch where I implement and test my own changes and where I test upstream changes from `llama.cpp`.
+```
+cd ~/$BUILDING_DIR
+git checkout work-in-progress
+printf "\n**********\n*\n* FINISHED: Checkout work-in-progress.\n*\n**********\n\n"
+```
+
+---
+### Make llama.cpp
+We use the old `Makefile` rather than CMake. We've updated the `Makefile` in this repo to build llama.cpp correctly.
+```
+cd ~/$BUILDING_DIR
+export LLAMA_MAKEFILE=1
+make clean
+make
+printf "\n**********\n*\n* FINISHED: Make llama.cpp.\n*\n**********\n\n"
+```
+
+If the build is successful, it will end with this message:
+
+&nbsp;&nbsp;&nbsp;&nbsp;**NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead.**
+
+If the build fails and you've checked out the `work-in-progress` branch, well, it's in progess, so switch back to the `master` branch and build that.
+
+If the build fails on the `master` branch, please post a note in the [Discussions](https://github.com/BradHutchings/llama-server-one/discussions) area.
+
+#### List Directory
+
+At this point, you should see `llama-server` and other built binaries in the directory listing.
+```
+ls -al
+printf "\n**********\n*\n* FINISHED: List Directory.\n*\n**********\n\n"
+```
+
+---
+### Install Cosmo
+```
+mkdir -p cosmocc
+cd cosmocc
+wget https://cosmo.zip/pub/cosmocc/cosmocc.zip
+unzip cosmocc.zip
+rm cosmocc.zip
+cd ..
+printf "\n**********\n*\n* FINISHED: Install Cosmo.\n*\n**********\n\n"
+```
+
+---
+### Prepare to make llama.cpp with Cosmo
+```
+export PATH="$(pwd)/cosmocc/bin:$PATH"
+export CC="cosmocc -I$(pwd)/cosmocc/include -L$(pwd)/cosmocc/lib"
+export CXX="cosmocc -I$(pwd)/cosmocc/include \
+    -I$(pwd)/cosmocc/include/third_party/libcxx \
+    -L$(pwd)/cosmocc/lib"
+export UNAME_S="cosmocc"
+export UNAME_P="cosmocc"
+export UNAME_M="cosmocc"
+printf "\n**********\n*\n* FINISHED: Prepare to make llama.cpp with Cosmo.\n*\n**********\n\n"
+```
+
+---
+### Make llama.cpp with Cosmo
+```
+make clean
+make
+printf "\n**********\n*\n* FINISHED: Make llama.cpp with Cosmo\n*\n**********\n\n"
+```
+
+If the build is successful, it will end with this message:
+
+&nbsp;&nbsp;&nbsp;&nbsp;**NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead.**
+
+If the build fails and you've checked out the `work-in-progress` branch, well, it's in progess, so switch back to the `master` branch and build that.
+
+If the build fails on the `master` branch, please post a note in the [Discussions](https://github.com/BradHutchings/llama-server-one/discussions) area.
+
+#### List Directory
+
+At this point, you should see `llama-server` and other built binaries in the directory listing.
+```
+ls -al
+printf "\n**********\n*\n* FINISHED: List Directory.\n*\n**********\n\n"
+```
+
+#### Verify Zip Archive
+
+`llama-server` is actually a zip acrhive with an "Actually Portable Executable" (APE) loader prefix. Let's verify the zip archive part:
+```
+unzip -l llama-server
+printf "\n**********\n*\n* FINISHED: Verify Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### Configuring llama-server-one
+
+Now that you've built `llama-server`, you're ready to configure it as `llama-server-one`. Follow instructions in [Configuring-ls1.md](Configuring-ls1.md).
+
diff --git a/docs/Configuring-ls1.md b/docs/Configuring-ls1.md
new file mode 100644
index 0000000000000..bacf2fac0dccf
--- /dev/null
+++ b/docs/Configuring-ls1.md
@@ -0,0 +1,195 @@
+## Configuring llama-server-one
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+This file contains instructions for configuring the `llama-server-one` executable to make it ready to package for multiple platforms.
+
+---
+### Environment Variables
+
+Let's define some environment variables:
+```
+BUILDING_DIR="1-BUILDING-llama.cpp"
+CONFIGURING_DIR="2-CONFIGURING-llama-server-one"
+
+LLAMA_SERVER="llama-server"
+LLAMA_SERVER_ONE="llama-server-one"
+LLAMA_SERVER_ONE_ZIP="llama-server-one.zip"
+DEFAULT_ARGS="default-args"
+printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
+```
+
+---
+### Create Configuration Directory
+
+Next, let's create a directory where we'll configure `llama-server-one`:
+```
+cd ~
+rm -r -f ~/$CONFIGURING_DIR
+mkdir -p $CONFIGURING_DIR
+cp ~/$BUILDING_DIR/$LLAMA_SERVER \
+    ~/$CONFIGURING_DIR/$LLAMA_SERVER_ONE_ZIP
+
+cd ~/$CONFIGURING_DIR
+printf "\n**********\n*\n* FINISHED: Create Configuration Directory.\n*\n**********\n\n"
+```
+
+---
+### Examine Contents of Zip Archive
+
+Look at the contents of the `llama-server-one` zip archive:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Examine Contents of Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### Delete Extraneous Timezone Files
+
+You should notice a bunch of extraneous timezone related files in `/usr/*`. Let's get rid of those:
+```
+zip -d $LLAMA_SERVER_ONE_ZIP "/usr/*"
+printf "\n**********\n*\n* FINISHED: Delete Extraneous Timezone Files.\n*\n**********\n\n"
+```
+
+---
+### Verify Contents of Zip Archive
+
+Verify that these files are no longer in the archive:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### OPTIONAL: Create website Directory in Archive
+
+`llama.cpp` has a built in chat UI. If you'd like to provide a custom UI, you should add a `website` directory to the `llama-server-one` archive. `llama.cpp`'s chat UI is optimized for serving inside the project's source code. But we can copy the unoptimized source:
+```
+mkdir -p website
+cp -r ~/$BUILDING_DIR/examples/server/public_legacy/* website
+zip -0 -r $LLAMA_SERVER_ONE_ZIP website/*
+printf "\n**********\n*\n* FINISHED: Create website Directory in Archive.\n*\n**********\n\n"
+```
+
+#### OPTONAL: Verify website Directory in Archive
+
+Verify that the archive has your website:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
+```
+---
+### Create default-args File
+
+A `default-args` file in the archive can specify sane default parameters. The format of the file is parameter name on a line, parameter value on a line, rinse, repeat. End the file with a `...` line to include user specified parameters.
+
+We don't yet support including the model inside the zip archive (yet). That has a 4GB size limitation on Windows anyway, as `.exe` files cannot exceed 4GB. So let's use an adjacent file called `model.gguf`.
+
+We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible.
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
+```
+
+#### OPTIONAL: Create default-args File with Website
+
+If you added a website to the archive, use this instead:
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+--path
+/zip/website
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File with Website.\n*\n**********\n\n"
+```
+
+---
+### Add default-args File to Archive
+
+Add the `default-args` file to the archive:
+```
+zip -0 -r $LLAMA_SERVER_ONE_ZIP $DEFAULT_ARGS
+printf "\n**********\n*\n* FINISHED: Add default-args File to Archive.\n*\n**********\n\n"
+```
+
+---
+### Verify default-args File in Archive
+
+Verify that the archive contains the `default-args` file:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify default-args File in Archive.\n*\n**********\n\n"
+```
+
+---
+### Remove .zip Extension
+
+Remove the `.zip` from our working file:
+```
+mv $LLAMA_SERVER_ONE_ZIP $LLAMA_SERVER_ONE
+printf "\n**********\n*\n* FINISHED: Remove .zip Extension.\n*\n**********\n\n"
+```
+
+---
+### Download Model
+
+Let's download a small model. We'll use Google Gemma 1B Instruct v3, a surprisingly capable tiny model.
+```
+MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
+wget https://huggingface.co/bradhutchings/Brads-LLMs/resolve/main/models/$MODEL_FILE?download=true \
+    --show-progress --quiet -O model.gguf
+printf "\n**********\n*\n* FINISHED: Download Model.\n*\n**********\n\n"
+```
+
+---
+### Test Run
+
+Now we can test run `llama-server-one`, listening on localhost:8080.
+```
+./$LLAMA_SERVER_ONE
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://127.0.0.1:8080 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+### Test Run on Public Interfaces
+
+If you'd like it to listen on all available interfaces, so you can connect from a browser on another computer:
+```
+./$LLAMA_SERVER_ONE --host 0.0.0.0
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://0.0.0.0:8080 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+Congratulations! You are ready to package your `llams-server-one` executable for deployment. Follow instructions in [Packaging-ls1.md](Packaging-ls1.md).
diff --git a/docs/Packaging-ls1.md b/docs/Packaging-ls1.md
new file mode 100644
index 0000000000000..8ce758783a302
--- /dev/null
+++ b/docs/Packaging-ls1.md
@@ -0,0 +1,131 @@
+## Packaging llama-server-one
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+This file contains instructions for packaging the `llama-server-one` executable for deployment. I'm using Ubuntu 24.04.
+
+---
+### Packaging Folder
+Assuming you configured as instructed in the [Configuring-ls1.md](Configuring-ls1.md) instructions file, let's create a folder with everything you need to package for deployment. You can zip this folder to distribute your `llama-server-one`, model, and arguments file for use on any platform. 
+
+---
+### Environment Variables
+Let's define some environment variables:
+```
+BUILDING_DIR="1-BUILDING-llama.cpp"
+CONFIGURING_DIR="2-CONFIGURING-llama-server-one"
+PACKAGING_DIR="3-PACKAGING-llama-server-one-deploy"
+DEPLOY_ZIP="llama-server-one-deploy.zip"
+
+LLAMA_SERVER="llama-server"
+LLAMA_SERVER_ONE="llama-server-one"
+LLAMA_SERVER_ONE_EXE="llama-server-one.exe"
+LLAMA_SERVER_ONE_ARGS="llama-server-one-args"
+printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
+```
+
+---
+### Create Packaging Directory
+Create a folder and copy `llama-server-one` into the new folder.
+```
+# This should use variables for paths and filenames. So should the packaging instructions.
+cd ~
+rm -r -f $PACKAGING_DIR $DEPLOY_ZIP
+mkdir -p $PACKAGING_DIR
+cd $PACKAGING_DIR
+cp ~/$CONFIGURING_DIR/$LLAMA_SERVER_ONE .
+printf "\n**********\n*\n* FINISHED: Create Packaging Directory.\n*\n**********\n\n"
+```
+
+---
+### Copy llama-server-one as .exe
+
+On Windows, this executable will need to be renamed to a `.exe` file. Since our executable is small, let's just make a copy of `llama-server-one` with the `.exe` extension.
+
+```
+cp $LLAMA_SERVER_ONE $LLAMA_SERVER_ONE_EXE
+printf "\n**********\n*\n* FINISHED: Copy llama-server-one as .exe.\n*\n**********\n\n"
+```
+
+---
+### Copy Model File
+
+We have already downloaded a model in the [Packaging steps](Packaging-ls1.md). Let's copy that into our deploy directory. We'll use the model's original filename and make that work with the `llama-server-args` file (below).
+```
+MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
+cp ~/$CONFIGURING_DIR/model.gguf $MODEL_FILE
+printf "\n**********\n*\n* FINISHED: Copy Model File.\n*\n**********\n\n"
+```
+
+#### OPTINAL: Download Model File Again
+
+If you would rather download it again and save as the original name, here are the commands:
+```
+MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
+wget https://huggingface.co/bradhutchings/Brads-LLMs/resolve/main/models/$MODEL_FILE?download=true \
+    --show-progress --quiet -O $MODEL_FILE
+printf "\n**********\n*\n* FINISHED: Download Model File Again.\n*\n**********\n\n"
+```
+
+---
+### Create llama-server-one-args File
+
+Let's create a `llama-server-one-args` file. These parameters can override or augment the parameters you previously embedded in you `llama-server-one` archive. This file could be edited by the end user to configure llama-file-one without having to construct and type a long command line. Notice that we've overridden the `-m`, `--host`, and `--port` parameters.
+```
+cat << EOF > $LLAMA_SERVER_ONE_ARGS
+-m
+$MODEL_FILE
+--host
+0.0.0.0
+--port
+8888
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create llama-server-one-args File.\n*\n**********\n\n"
+```
+
+---
+### Test Run
+
+Now we can test run `llama-server-one`, listening on all network interfaces, port 8888. Note that these are different from the default args you built into `llama-server-one`. You can connect to it from another web browser.
+```
+./$LLAMA_SERVER_ONE
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://0.0.0.0:8888 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+### Make .zip Acrhive
+
+Let's zip up the files into a `.zip` file you can share and move it to your home directory. The model won't compress much, so we're turning compression off with the `-0` parameter.
+
+```
+zip -0 $DEPLOY_ZIP *
+mv $DEPLOY_ZIP ~
+cd ~
+printf "\n**********\n*\n* FINISHED: Make .zip Acrhive.\n*\n**********\n\n"
+```
+
+---
+### Review What You Created
+Finally, let's review what you created in building, packaging, and deploying `llama-server-one`:
+```
+ls -aldh *llama*
+printf "\n**********\n*\n* FINISHED: Review What You Created.\n*\n**********\n\n"
+```
+
+You should see three directories and a `.zip` file. The `llama-server-one-deploy.zip` file is ready to upload and share.
+
+---
+### Congratulations!
+
+Congratulations! You did it. You built a `llama-server-one` executable that runs on two different CPU architectures and several popular operating systems. If you had any trouble in this process, please post a question in the [Discussions section](https://github.com/BradHutchings/llama-server-one/discussions). I'm happy to help!
+
+-Brad
+

From d195986083cdbae529c39c3dc8ef22d84188c25c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 12:35:27 -0700
Subject: [PATCH 02/73] Update server.cpp

defer() --> defer_task()

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 81 +++++++++++++++++++++++++++++++++++---
 1 file changed, 75 insertions(+), 6 deletions(-)

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 17a292da153c1..108723d70e470 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -31,6 +31,12 @@
 #include <unordered_map>
 #include <unordered_set>
 
+// llama-server-one START
+#ifdef COSMOCC
+#include <cosmo.h>
+#endif
+// llama-server-one END
+
 using json = nlohmann::ordered_json;
 
 constexpr int HTTP_POLLING_SECONDS = 1;
@@ -1594,13 +1600,15 @@ struct server_queue {
         return 0;
     }
 
+    // llama-server-one START - defer() --> defer_task() to make Cosmo STL happy.
     // Add a new task, but defer until one slot is available
-    void defer(server_task task) {
+    void defer_task(server_task task) {
         std::unique_lock<std::mutex> lock(mutex_tasks);
         QUE_DBG("defer task, id = %d\n", task.id);
         queue_tasks_deferred.push_back(std::move(task));
         condition_tasks.notify_one();
     }
+    // llama-server-one END
 
     // Get the next id for creating a new task
     int get_new_id() {
@@ -2637,13 +2645,17 @@ struct server_context {
                     if (slot == nullptr) {
                         // if no slot is available, we defer this task for processing later
                         SRV_DBG("no slot is available, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(task);
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(task);
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -2726,7 +2738,9 @@ struct server_context {
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(task);
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -2762,7 +2776,9 @@ struct server_context {
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(task);
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -2805,7 +2821,9 @@ struct server_context {
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(task);
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -3427,6 +3445,37 @@ inline void signal_handler(int signal) {
 }
 
 int main(int argc, char ** argv) {
+    // llama-server-one START
+    // This implements an args file feature inspired by llamafile's.
+    #ifdef COSMOCC
+    // Args files if present. The names are different to remove confusion during packaging.
+    const std::string& argsFilename = "llama-server-one-args";
+    const std::string& zipArgsFilename = "/zip/default-args";
+    struct stat buffer;
+
+    // At this point, argc, argv represent:
+    //     command (User supplied args)
+    
+    if (stat (argsFilename.c_str(), &buffer) == 0) {
+        argc = cosmo_args(argsFilename.c_str(), &argv);
+    }
+    
+    // At this point, argc, argv represent:
+    //     command (argsFilename args) (User supplied args)
+
+    if (stat (zipArgsFilename.c_str(), &buffer) == 0) {
+        argc = cosmo_args(zipArgsFilename.c_str(), &argv);
+    }
+
+    // At this point, argc, argv represent:
+    //     command (zipArgsFilename args) (argsFilename args) (User supplied args)
+    
+    // Yep, this is counterintuitive, but how the cosmo_args command works.
+    // argsFilename args override zipArgsFilename file args.
+    // User supplied args override argsFilename and zipArgsFilename args.
+    #endif
+    // llama-server-one END
+
     // own arguments required by this example
     common_params params;
 
@@ -4452,6 +4501,26 @@ int main(int argc, char ** argv) {
         }
     }
 
+        // llama-server-one START
+    svr->Get("/chat", [](const httplib::Request & req, httplib::Response & res) {
+        if (req.get_header_value("Accept-Encoding").find("gzip") == std::string::npos) {
+            res.set_content("Error: gzip is not supported by this browser", "text/plain");
+        } else {
+            res.set_header("Content-Encoding", "gzip");
+            // COEP and COOP headers, required by pyodide (python interpreter)
+            res.set_header("Cross-Origin-Embedder-Policy", "require-corp");
+            res.set_header("Cross-Origin-Opener-Policy", "same-origin");
+            res.set_content(reinterpret_cast<const char*>(index_html_gz), index_html_gz_len, "text/html; charset=utf-8");
+        }
+        return false;
+    });
+
+    svr->Get("/chat/", [](const httplib::Request & req, httplib::Response & res) {
+        res.set_redirect("/chat");
+        return false;
+    });
+    // llama-server-one END
+
     // register API routes
     svr->Get ("/health",              handle_health); // public endpoint (no API key check)
     svr->Get ("/metrics",             handle_metrics);

From 71c6a0393c093e0f3e883bf2133fa9e3fed8f973 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 12:37:17 -0700
Subject: [PATCH 03/73] Update llama-context.cpp

Cosmo STL doesn't have std:fill.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 src/llama-context.cpp | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 3479a8cca3d64..9e1a7c607e989 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -1583,7 +1583,15 @@ int32_t llama_context::output_reserve(int32_t n_outputs) {
     embd   = has_embd   ? output_base + logits_size : nullptr;
 
     // set all ids as invalid (negative)
-    std::fill(output_ids.begin(), output_ids.end(), -1);
+        // llama-server-one START
+        #ifndef COSMOCC
+        std::fill(output_ids.begin(), output_ids.end(), -1);
+        #else
+        for (auto iii = output_ids.begin(); iii != output_ids.end(); iii++) {
+            *iii = -1;
+        }
+        #endif
+        // llama-server-one END
 
     ggml_backend_buffer_clear(buf_output.get(), 0);
 
@@ -1623,7 +1631,17 @@ void llama_context::output_reorder() {
                 }
             }
         }
+
+        // llama-server-one START
+        #ifndef COSMOCC
         std::fill(output_ids.begin(), output_ids.end(), -1);
+        #else
+        for (auto iii = output_ids.begin(); iii != output_ids.end(); iii++) {
+            *iii = -1;
+        }
+        #endif
+        // llama-server-one END
+
         for (int32_t i = 0; i < n_outputs; ++i) {
             output_ids[out_ids[i]] = i;
         }

From a4f2d58a2587b9ef8fc97cb0ec7fe4ea4cbe7cbe Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 12:38:26 -0700
Subject: [PATCH 04/73] Rename Makefile to Makefile-llama-cpp-original

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile => Makefile-llama-cpp-original | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename Makefile => Makefile-llama-cpp-original (100%)

diff --git a/Makefile b/Makefile-llama-cpp-original
similarity index 100%
rename from Makefile
rename to Makefile-llama-cpp-original

From 5e9c36ff4b645376b624232303a2339fd45d348f Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 12:38:41 -0700
Subject: [PATCH 05/73] Rename README.md to README-llama.cpp.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-llama.cpp.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-llama.cpp.md (100%)

diff --git a/README.md b/README-llama.cpp.md
similarity index 100%
rename from README.md
rename to README-llama.cpp.md

From 56615005cd33292c1f68d30a14c2c2ce1a84539d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 12:40:27 -0700
Subject: [PATCH 06/73] Add files via upload

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile  | 1704 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 README.md |   75 +++
 2 files changed, 1779 insertions(+)
 create mode 100644 Makefile
 create mode 100644 README.md

diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000000000..35e3d21234a48
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,1704 @@
+ifndef LLAMA_MAKEFILE
+$(error The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
+endif
+
+# Modified by Brad Hutchings to build llama.cpp targets correctly and build with cosmocc.
+
+# Define the default target now so that it is always the first target
+BUILD_TARGETS = \
+	libllava.a \
+	llama-server \
+	llama-batched \
+	llama-batched-bench \
+	llama-bench \
+	llama-cli \
+	llama-convert-llama2c-to-ggml \
+	llama-embedding \
+	llama-eval-callback \
+	llama-export-lora \
+	llama-gbnf-validator \
+	llama-gguf \
+	llama-gguf-hash \
+	llama-gguf-split \
+	llama-gritlm \
+	llama-imatrix \
+	llama-infill \
+	llama-llava-cli \
+	llama-minicpmv-cli\
+	llama-qwen2vl-cli\
+	llama-lookahead \
+	llama-lookup \
+	llama-lookup-create \
+	llama-lookup-merge \
+	llama-lookup-stats \
+	llama-parallel \
+	llama-passkey \
+	llama-perplexity \
+	llama-q8dot \
+	llama-quantize \
+	llama-quantize-stats \
+	llama-retrieval \
+	llama-save-load-state \
+	llama-simple \
+	llama-simple-chat \
+	llama-run \
+	llama-speculative \
+	llama-tokenize \
+	llama-vdot \
+	llama-cvector-generator \
+	llama-gen-docs \
+	tests/test-c.o
+
+# Binaries only useful for tests
+TEST_TARGETS = \
+	tests/test-arg-parser \
+	tests/test-autorelease \
+	tests/test-backend-ops \
+	tests/test-chat \
+	tests/test-chat-template \
+	tests/test-double-float \
+	tests/test-grammar-integration \
+	tests/test-grammar-parser \
+	tests/test-json-schema-to-grammar \
+	tests/test-llama-grammar \
+	tests/test-log \
+	tests/test-model-load-cancel \
+	tests/test-quantize-fns \
+	tests/test-quantize-perf \
+	tests/test-rope \
+	tests/test-sampling \
+	tests/test-tokenizer-0 \
+	tests/test-tokenizer-1-bpe \
+	tests/test-tokenizer-1-spm
+#	tests/test-opt \
+
+# Legacy build targets that were renamed in #7809, but should still be removed when the project is cleaned
+LEGACY_TARGETS_CLEAN = main quantize quantize-stats perplexity imatrix embedding vdot q8dot convert-llama2c-to-ggml \
+	simple batched batched-bench save-load-state server gguf gguf-split eval-callback llama-bench libllava.a llava-cli baby-llama \
+	retrieval speculative infill tokenize parallel export-lora lookahead lookup passkey gritlm
+
+# Legacy build targets that were renamed in #7809, but we want to build binaries that for them that output a deprecation warning if people try to use them.
+#  We don't want to clutter things too much, so we only build replacements for the most commonly used binaries.
+LEGACY_TARGETS_BUILD = main quantize perplexity embedding server
+
+# Deprecation aliases
+ifdef LLAMA_CUBLAS
+$(error LLAMA_CUBLAS is removed. Use GGML_CUDA instead.)
+endif
+
+ifdef LLAMA_CUDA
+GGML_CUDA := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_KOMPUTE
+GGML_KOMPUTE := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_METAL
+GGML_METAL := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_RPC
+GGML_RPC := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_SYCL
+GGML_SYCL := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_SYCL_F16
+GGML_SYCL_F16 := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_OPENBLAS
+GGML_OPENBLAS := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_OPENBLAS64
+GGML_OPENBLAS64 := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_BLIS
+GGML_BLIS := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_NO_LLAMAFILE
+GGML_NO_LLAMAFILE := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_NO_ACCELERATE
+GGML_NO_ACCELERATE := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_NO_OPENMP
+GGML_NO_OPENMP := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_NO_METAL
+GGML_NO_METAL := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifdef LLAMA_DISABLE_LOGS
+REMOVE_WARNING := 1
+endif
+
+ifdef LLAMA_SERVER_VERBOSE
+REMOVE_WARNING := 1
+endif
+
+ifndef UNAME_S
+UNAME_S := $(shell uname -s)
+endif
+
+ifndef UNAME_P
+UNAME_P := $(shell uname -p)
+endif
+
+ifndef UNAME_M
+UNAME_M := $(shell uname -m)
+endif
+
+# In GNU make default CXX is g++ instead of c++.  Let's fix that so that users
+# of non-gcc compilers don't have to provide g++ alias or wrapper.
+DEFCC  := cc
+DEFCXX := c++
+ifeq ($(origin CC),default)
+CC  := $(DEFCC)
+endif
+ifeq ($(origin CXX),default)
+CXX := $(DEFCXX)
+endif
+
+# Mac OS + Arm can report x86_64
+# ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
+ifeq ($(UNAME_S),Darwin)
+	ifndef GGML_NO_METAL
+		GGML_METAL := 1
+	endif
+
+	GGML_NO_OPENMP := 1
+
+	ifneq ($(UNAME_P),arm)
+		SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
+		ifeq ($(SYSCTL_M),1)
+			# UNAME_P := arm
+			# UNAME_M := arm64
+			warn := $(warning Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: https://github.com/ggerganov/whisper.cpp/issues/66\#issuecomment-1282546789)
+		endif
+	endif
+endif
+
+ifdef GGML_METAL
+	GGML_METAL_EMBED_LIBRARY := 1
+endif
+
+ifdef GGML_RPC
+	BUILD_TARGETS += rpc-server
+endif
+
+ifdef GGML_VULKAN
+	BUILD_TARGETS += vulkan-shaders-gen
+endif
+
+default: $(BUILD_TARGETS) $(LEGACY_TARGETS_BUILD)
+
+test: $(TEST_TARGETS)
+	@failures=0; \
+	for test_target in $(TEST_TARGETS); do \
+		if [ "$$test_target" = "tests/test-tokenizer-0" ]; then \
+			./$$test_target $(CURDIR)/models/ggml-vocab-llama-spm.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-llama-bpe.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-phi-3.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-falcon.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-bert-bge.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-starcoder.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-gpt-2.gguf; \
+			./$$test_target $(CURDIR)/models/ggml-vocab-refact.gguf; \
+		elif [ "$$test_target" = "tests/test-tokenizer-1-spm" ]; then \
+			continue; \
+		elif [ "$$test_target" = "tests/test-tokenizer-1-bpe" ]; then \
+			continue; \
+		else \
+			echo "Running test $$test_target..."; \
+			./$$test_target; \
+		fi; \
+		if [ $$? -ne 0 ]; then \
+			printf 'Test %s FAILED!\n\n' $$test_target; \
+			failures=$$(( failures + 1 )); \
+		else \
+			printf 'Test %s passed.\n\n' $$test_target; \
+		fi; \
+	done; \
+	if [ $$failures -gt 0 ]; then \
+		printf '\n%s tests failed.\n' $$failures; \
+		exit 1; \
+	fi
+	@echo 'All tests passed.'
+
+all: $(BUILD_TARGETS) $(TEST_TARGETS) $(LEGACY_TARGETS_BUILD)
+
+ifdef RISCV_CROSS_COMPILE
+CC	:= riscv64-unknown-linux-gnu-gcc
+CXX	:= riscv64-unknown-linux-gnu-g++
+endif
+
+#
+# Compile flags
+#
+
+# keep standard at C11 and C++17
+MK_CPPFLAGS  = -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -I. -DGGML_USE_CPU
+MK_CFLAGS    = -std=c11   -fPIC
+MK_CXXFLAGS  = -std=c++17 -fPIC
+MK_NVCCFLAGS = -std=c++17
+
+ifdef LLAMA_NO_CCACHE
+GGML_NO_CCACHE := 1
+DEPRECATE_WARNING := 1
+endif
+
+ifndef GGML_NO_CCACHE
+CCACHE := $(shell which ccache)
+ifdef CCACHE
+export CCACHE_SLOPPINESS = time_macros
+$(info I ccache found, compilation results will be cached. Disable with GGML_NO_CCACHE.)
+CC    := $(CCACHE) $(CC)
+CXX   := $(CCACHE) $(CXX)
+else
+$(info I ccache not found. Consider installing it for faster compilation.)
+endif # CCACHE
+endif # GGML_NO_CCACHE
+
+# clock_gettime came in POSIX.1b (1993)
+# CLOCK_MONOTONIC came in POSIX.1-2001 / SUSv3 as optional
+# posix_memalign came in POSIX.1-2001 / SUSv3
+# M_PI is an XSI extension since POSIX.1-2001 / SUSv3, came in XPG1 (1985)
+MK_CPPFLAGS += -D_XOPEN_SOURCE=600
+
+# Somehow in OpenBSD whenever POSIX conformance is specified
+# some string functions rely on locale_t availability,
+# which was introduced in POSIX.1-2008, forcing us to go higher
+ifeq ($(UNAME_S),OpenBSD)
+	MK_CPPFLAGS += -U_XOPEN_SOURCE -D_XOPEN_SOURCE=700
+endif
+
+# Data types, macros and functions related to controlling CPU affinity and
+# some memory allocation are available on Linux through GNU extensions in libc
+ifeq ($(UNAME_S),Linux)
+	MK_CPPFLAGS += -D_GNU_SOURCE
+	MK_LDFLAGS  += -ldl
+endif
+
+# RLIMIT_MEMLOCK came in BSD, is not specified in POSIX.1,
+# and on macOS its availability depends on enabling Darwin extensions
+# similarly on DragonFly, enabling BSD extensions is necessary
+ifeq ($(UNAME_S),Darwin)
+	MK_CPPFLAGS += -D_DARWIN_C_SOURCE
+endif
+ifeq ($(UNAME_S),DragonFly)
+	MK_CPPFLAGS += -D__BSD_VISIBLE
+endif
+
+# alloca is a non-standard interface that is not visible on BSDs when
+# POSIX conformance is specified, but not all of them provide a clean way
+# to enable it in such cases
+ifeq ($(UNAME_S),FreeBSD)
+	MK_CPPFLAGS += -D__BSD_VISIBLE
+endif
+ifeq ($(UNAME_S),NetBSD)
+	MK_CPPFLAGS += -D_NETBSD_SOURCE
+endif
+ifeq ($(UNAME_S),OpenBSD)
+	MK_CPPFLAGS += -D_BSD_SOURCE
+endif
+
+ifdef GGML_SCHED_MAX_COPIES
+	MK_CPPFLAGS += -DGGML_SCHED_MAX_COPIES=$(GGML_SCHED_MAX_COPIES)
+endif
+
+ifdef LLAMA_DEBUG
+	MK_CFLAGS    += -O0 -g
+	MK_CXXFLAGS  += -O0 -g
+	MK_LDFLAGS   += -g
+	MK_NVCCFLAGS += -O0 -g
+
+	ifeq ($(UNAME_S),Linux)
+		MK_CPPFLAGS += -D_GLIBCXX_ASSERTIONS
+	endif
+else
+	MK_CPPFLAGS   += -DNDEBUG
+	MK_CFLAGS     += -O3 -g
+	MK_CXXFLAGS   += -O3 -g
+	MK_NVCCFLAGS  += -O3 -g
+endif
+
+ifdef LLAMA_SANITIZE_THREAD
+	MK_CFLAGS   += -fsanitize=thread -g
+	MK_CXXFLAGS += -fsanitize=thread -g
+	MK_LDFLAGS  += -fsanitize=thread -g
+endif
+
+ifdef LLAMA_SANITIZE_ADDRESS
+	MK_CFLAGS   += -fsanitize=address -fno-omit-frame-pointer -g
+	MK_CXXFLAGS += -fsanitize=address -fno-omit-frame-pointer -g
+	MK_LDFLAGS  += -fsanitize=address -fno-omit-frame-pointer -g
+endif
+
+ifdef LLAMA_SANITIZE_UNDEFINED
+	MK_CFLAGS   += -fsanitize=undefined -g
+	MK_CXXFLAGS += -fsanitize=undefined -g
+	MK_LDFLAGS  += -fsanitize=undefined -g
+endif
+
+ifdef LLAMA_SERVER_SSL
+	MK_CPPFLAGS += -DCPPHTTPLIB_OPENSSL_SUPPORT
+	MK_LDFLAGS += -lssl -lcrypto
+endif
+
+ifndef GGML_NO_CPU_AARCH64
+	MK_CPPFLAGS += -DGGML_USE_CPU_AARCH64
+endif
+
+ifeq ($(UNAME_S),cosmocc)
+$(info Setting MK_CFLAGS and MK_CXXFLAGS flags for cosmocc.)
+	
+	WARN_FLAGS_ORIG = \
+		-Wall \
+		-Wextra \
+		-Wpedantic \
+		-Wcast-qual \
+		-Wno-unused-function
+
+	WARN_FLAGS = \
+		-Wcast-qual \
+		-Wno-unused-function
+
+	MK_CFLAGS += \
+		$(WARN_FLAGS) \
+		-Wshadow \
+		-Wstrict-prototypes \
+		-Wpointer-arith \
+		-Wmissing-prototypes \
+		-Werror=implicit-function-declaration \
+		-Wno-implicit-int \
+		-DCOSMOCC=1
+
+	MK_CXXFLAGS += \
+		$(WARN_FLAGS) \
+		-Wmissing-declarations \
+		-Wmissing-noreturn \
+		-Wno-literal-suffix \
+		-DCOSMOCC=1
+
+
+else
+$(info Using default MK_CFLAGS and MK_CXXFLAGS flags.)
+
+	# warnings
+	WARN_FLAGS = \
+		-Wall \
+		-Wextra \
+		-Wpedantic \
+		-Wcast-qual \
+		-Wno-unused-function
+
+	MK_CFLAGS += \
+		$(WARN_FLAGS) \
+		-Wshadow \
+		-Wstrict-prototypes \
+		-Wpointer-arith \
+		-Wmissing-prototypes \
+		-Werror=implicit-int \
+		-Werror=implicit-function-declaration
+
+	MK_CXXFLAGS += \
+		$(WARN_FLAGS) \
+		-Wmissing-declarations \
+		-Wmissing-noreturn
+
+endif
+
+
+
+ifeq ($(LLAMA_FATAL_WARNINGS),1)
+	MK_CFLAGS   += -Werror
+	MK_CXXFLAGS += -Werror
+endif
+
+# this version of Apple ld64 is buggy
+ifneq ($(UNAME_S),cosmocc)
+ifneq '' '$(findstring dyld-1015.7,$(shell $(CC) $(LDFLAGS) -Wl,-v 2>&1))'
+	MK_CPPFLAGS += -DHAVE_BUGGY_APPLE_LINKER
+endif
+endif
+
+# OS specific
+# TODO: support Windows
+ifneq '' '$(filter $(UNAME_S),Linux Darwin FreeBSD NetBSD OpenBSD Haiku)'
+	MK_CFLAGS   += -pthread
+	MK_CXXFLAGS += -pthread
+endif
+
+# detect Windows
+ifneq ($(findstring _NT,$(UNAME_S)),)
+	_WIN32 := 1
+endif
+
+# library name prefix
+ifneq ($(_WIN32),1)
+	LIB_PRE := lib
+endif
+
+# Dynamic Shared Object extension
+ifneq ($(_WIN32),1)
+	DSO_EXT := .so
+else
+	DSO_EXT := .dll
+endif
+
+# Windows Sockets 2 (Winsock) for network-capable apps
+ifeq ($(_WIN32),1)
+	LWINSOCK2 := -lws2_32
+endif
+
+ifdef LLAMA_GPROF
+	MK_CFLAGS   += -pg
+	MK_CXXFLAGS += -pg
+endif
+
+# Architecture specific
+# TODO: probably these flags need to be tweaked on some architectures
+#       feel free to update the Makefile for your architecture and send a pull request or issue
+
+ifndef RISCV_CROSS_COMPILE
+
+ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686 amd64))
+	# Use all CPU extensions that are available:
+	MK_CFLAGS     += -march=native -mtune=native
+	HOST_CXXFLAGS += -march=native -mtune=native
+
+	# Usage AMX build test
+	#MK_CFLAGS     += -march=graniterapids -mtune=graniterapids
+	#HOST_CXXFLAGS += -march=graniterapids -mtune=graniterapids
+
+	# Usage AVX-only
+	#MK_CFLAGS   += -mfma -mf16c -mavx
+	#MK_CXXFLAGS += -mfma -mf16c -mavx
+
+	# Usage SSSE3-only (Not is SSE3!)
+	#MK_CFLAGS   += -mssse3
+	#MK_CXXFLAGS += -mssse3
+endif
+
+ifneq ($(UNAME_S),cosmocc)
+ifneq '' '$(findstring mingw,$(shell $(CC) -dumpmachine))'
+	# The stack is only 16-byte aligned on Windows, so don't let gcc emit aligned moves.
+	# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412
+	# https://github.com/ggml-org/llama.cpp/issues/2922
+	MK_CFLAGS   += -Xassembler -muse-unaligned-vector-move
+	MK_CXXFLAGS += -Xassembler -muse-unaligned-vector-move
+
+	# Target Windows 8 for PrefetchVirtualMemory
+	MK_CPPFLAGS += -D_WIN32_WINNT=0x602
+endif
+endif
+
+ifneq ($(filter aarch64%,$(UNAME_M)),)
+	# Apple M1, M2, etc.
+	# Raspberry Pi 3, 4, Zero 2 (64-bit)
+	# Nvidia Jetson
+	MK_CFLAGS   += -mcpu=native
+	MK_CXXFLAGS += -mcpu=native
+	JETSON_RELEASE_INFO = $(shell jetson_release)
+	ifdef JETSON_RELEASE_INFO
+		ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
+			JETSON_EOL_MODULE_DETECT = 1
+			CC = aarch64-unknown-linux-gnu-gcc
+			cxx = aarch64-unknown-linux-gnu-g++
+		endif
+	endif
+endif
+
+ifneq ($(filter armv6%,$(UNAME_M)),)
+	# Raspberry Pi 1, Zero
+	MK_CFLAGS   += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access
+	MK_CXXFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access
+endif
+
+ifneq ($(filter armv7%,$(UNAME_M)),)
+	# Raspberry Pi 2
+	MK_CFLAGS   += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations
+	MK_CXXFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations
+endif
+
+ifneq ($(filter armv8%,$(UNAME_M)),)
+	# Raspberry Pi 3, 4, Zero 2 (32-bit)
+	MK_CFLAGS   += -mfp16-format=ieee -mno-unaligned-access
+	MK_CXXFLAGS += -mfp16-format=ieee -mno-unaligned-access
+endif
+
+ifneq ($(filter ppc64%,$(UNAME_M)),)
+	POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
+	ifneq (,$(findstring POWER9,$(POWER9_M)))
+		MK_CFLAGS   += -mcpu=power9
+		MK_CXXFLAGS += -mcpu=power9
+	endif
+endif
+
+ifneq ($(filter ppc64le%,$(UNAME_M)),)
+	MK_CFLAGS   += -mcpu=powerpc64le
+	MK_CXXFLAGS += -mcpu=powerpc64le
+	CUDA_POWER_ARCH = 1
+endif
+
+ifneq ($(filter loongarch64%,$(UNAME_M)),)
+	MK_CFLAGS   += -mlasx
+	MK_CXXFLAGS += -mlasx
+endif
+
+ifneq ($(filter riscv64%,$(UNAME_M)),)
+	MK_CFLAGS   += -march=rv64gcv -mabi=lp64d
+	MK_CXXFLAGS += -march=rv64gcv -mabi=lp64d
+endif
+
+else # RISC-V CROSS COMPILATION
+	MK_CFLAGS   += -march=rv64gcv -mabi=lp64d
+	MK_CXXFLAGS += -march=rv64gcv -mabi=lp64d
+endif
+
+ifndef GGML_NO_ACCELERATE
+	# Mac OS - include Accelerate framework.
+	# `-framework Accelerate` works both with Apple Silicon and Mac Intel
+	ifeq ($(UNAME_S),Darwin)
+		MK_CPPFLAGS  += -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DGGML_BLAS_USE_ACCELERATE
+		MK_CPPFLAGS  += -DACCELERATE_NEW_LAPACK
+		MK_CPPFLAGS  += -DACCELERATE_LAPACK_ILP64
+		MK_LDFLAGS   += -framework Accelerate
+		OBJ_GGML_EXT += ggml/src/ggml-blas/ggml-blas.o
+	endif
+endif # GGML_NO_ACCELERATE
+
+ifndef GGML_NO_OPENMP
+	MK_CPPFLAGS += -DGGML_USE_OPENMP
+	MK_CFLAGS   += -fopenmp
+	MK_CXXFLAGS += -fopenmp
+endif # GGML_NO_OPENMP
+
+ifdef GGML_OPENBLAS
+	MK_CPPFLAGS  += -DGGML_USE_BLAS $(shell pkg-config --cflags-only-I openblas)
+	MK_CFLAGS    += $(shell pkg-config --cflags-only-other openblas)
+	MK_LDFLAGS   += $(shell pkg-config --libs openblas)
+	OBJ_GGML_EXT += ggml/src/ggml-blas/ggml-blas.o
+endif # GGML_OPENBLAS
+
+ifdef GGML_OPENBLAS64
+	MK_CPPFLAGS  += -DGGML_USE_BLAS $(shell pkg-config --cflags-only-I openblas64)
+	MK_CFLAGS    += $(shell pkg-config --cflags-only-other openblas64)
+	MK_LDFLAGS   += $(shell pkg-config --libs openblas64)
+	OBJ_GGML_EXT += ggml/src/ggml-blas/ggml-blas.o
+endif # GGML_OPENBLAS64
+
+ifdef GGML_BLIS
+	MK_CPPFLAGS  += -DGGML_USE_BLAS -DGGML_BLAS_USE_BLIS -I/usr/local/include/blis -I/usr/include/blis
+	MK_LDFLAGS   += -lblis -L/usr/local/lib
+	OBJ_GGML_EXT += ggml/src/ggml-blas/ggml-blas.o
+endif # GGML_BLIS
+
+ifdef GGML_NVPL
+	MK_CPPFLAGS  += -DGGML_USE_BLAS -DGGML_BLAS_USE_NVPL -DNVPL_ILP64 -I/usr/local/include/nvpl_blas -I/usr/include/nvpl_blas
+	MK_LDFLAGS   += -L/usr/local/lib -lnvpl_blas_core -lnvpl_blas_ilp64_gomp
+	OBJ_GGML_EXT += ggml/src/ggml-blas/ggml-blas.o
+endif # GGML_NVPL
+
+ifndef GGML_NO_LLAMAFILE
+	MK_CPPFLAGS  += -DGGML_USE_LLAMAFILE
+	OBJ_GGML_EXT += ggml/src/ggml-cpu/llamafile/sgemm.o
+endif
+
+ifndef GGML_NO_AMX
+	MK_CPPFLAGS += -DGGML_USE_AMX
+	OBJ_GGML_EXT += ggml/src/ggml-cpu/amx/amx.o ggml/src/ggml-cpu/amx/mmq.o
+endif
+
+# only necessary for the CPU backend files
+MK_CPPFLAGS += -Iggml/src/ggml-cpu
+
+ifdef GGML_RPC
+	MK_CPPFLAGS  += -DGGML_USE_RPC
+	OBJ_GGML_EXT += ggml/src/ggml-rpc.o
+endif # GGML_RPC
+
+OBJ_CUDA_TMPL      = $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/template-instances/fattn-mma*.cu))
+OBJ_CUDA_TMPL     += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/template-instances/mmq*.cu))
+
+ifdef GGML_CUDA_FA_ALL_QUANTS
+	OBJ_CUDA_TMPL += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/template-instances/fattn-vec*.cu))
+else
+	OBJ_CUDA_TMPL += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/template-instances/fattn-vec*q4_0-q4_0.cu))
+	OBJ_CUDA_TMPL += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/template-instances/fattn-vec*q8_0-q8_0.cu))
+	OBJ_CUDA_TMPL += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/template-instances/fattn-vec*f16-f16.cu))
+endif # GGML_CUDA_FA_ALL_QUANTS
+
+ifdef GGML_CUDA
+	ifneq ('', '$(wildcard /opt/cuda)')
+		CUDA_PATH ?= /opt/cuda
+	else
+		CUDA_PATH ?= /usr/local/cuda
+	endif
+
+	MK_CPPFLAGS  += -DGGML_USE_CUDA -DGGML_CUDA_USE_GRAPHS -I$(CUDA_PATH)/include -I$(CUDA_PATH)/targets/$(UNAME_M)-linux/include
+	MK_LDFLAGS   += -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L$(CUDA_PATH)/lib64 -L/usr/lib64 -L$(CUDA_PATH)/targets/$(UNAME_M)-linux/lib -L$(CUDA_PATH)/lib64/stubs -L/usr/lib/wsl/lib
+	MK_NVCCFLAGS += -use_fast_math
+
+	OBJ_GGML_EXT += ggml/src/ggml-cuda/ggml-cuda.o
+	OBJ_GGML_EXT += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/*.cu))
+	OBJ_GGML_EXT += $(OBJ_CUDA_TMPL)
+
+ifdef LLAMA_FATAL_WARNINGS
+	MK_NVCCFLAGS += -Werror all-warnings
+endif # LLAMA_FATAL_WARNINGS
+
+ifndef JETSON_EOL_MODULE_DETECT
+	MK_NVCCFLAGS += --forward-unknown-to-host-compiler
+endif # JETSON_EOL_MODULE_DETECT
+
+ifdef LLAMA_DEBUG
+	MK_NVCCFLAGS += -lineinfo
+endif # LLAMA_DEBUG
+
+ifdef GGML_CUDA_DEBUG
+	MK_NVCCFLAGS += --device-debug
+endif # GGML_CUDA_DEBUG
+
+ifdef GGML_CUDA_NVCC
+	NVCC = $(CCACHE) $(GGML_CUDA_NVCC)
+else
+	NVCC = $(CCACHE) nvcc
+endif # GGML_CUDA_NVCC
+
+ifdef CUDA_DOCKER_ARCH
+	MK_NVCCFLAGS += -Wno-deprecated-gpu-targets -arch=$(CUDA_DOCKER_ARCH)
+else ifndef CUDA_POWER_ARCH
+	MK_NVCCFLAGS += -arch=native
+endif # CUDA_DOCKER_ARCH
+
+ifdef GGML_CUDA_FORCE_MMQ
+	MK_NVCCFLAGS += -DGGML_CUDA_FORCE_MMQ
+endif # GGML_CUDA_FORCE_MMQ
+
+ifdef GGML_CUDA_FORCE_CUBLAS
+	MK_NVCCFLAGS += -DGGML_CUDA_FORCE_CUBLAS
+endif # GGML_CUDA_FORCE_CUBLAS
+
+ifdef GGML_CUDA_F16
+	MK_NVCCFLAGS += -DGGML_CUDA_F16
+endif # GGML_CUDA_F16
+
+ifdef GGML_CUDA_DMMV_F16
+	MK_NVCCFLAGS += -DGGML_CUDA_F16
+endif # GGML_CUDA_DMMV_F16
+
+ifdef GGML_CUDA_PEER_MAX_BATCH_SIZE
+	MK_NVCCFLAGS += -DGGML_CUDA_PEER_MAX_BATCH_SIZE=$(GGML_CUDA_PEER_MAX_BATCH_SIZE)
+else
+	MK_NVCCFLAGS += -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
+endif # GGML_CUDA_PEER_MAX_BATCH_SIZE
+
+ifdef GGML_CUDA_NO_PEER_COPY
+	MK_NVCCFLAGS += -DGGML_CUDA_NO_PEER_COPY
+endif # GGML_CUDA_NO_PEER_COPY
+
+ifdef GGML_CUDA_CCBIN
+	MK_NVCCFLAGS += -ccbin $(GGML_CUDA_CCBIN)
+endif # GGML_CUDA_CCBIN
+
+ifdef GGML_CUDA_NO_FA
+	MK_NVCCFLAGS += -DGGML_CUDA_NO_FA
+endif # GGML_CUDA_NO_FA
+
+ifdef GGML_CUDA_FA_ALL_QUANTS
+	MK_NVCCFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
+endif # GGML_CUDA_FA_ALL_QUANTS
+
+ifdef JETSON_EOL_MODULE_DETECT
+define NVCC_COMPILE
+	$(NVCC) -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUDA -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda/targets/aarch64-linux/include -std=c++11 -O3 $(NVCCFLAGS) $(CPPFLAGS) -Xcompiler "$(CUDA_CXXFLAGS)" -c $< -o $@
+endef # NVCC_COMPILE
+else
+define NVCC_COMPILE
+	$(NVCC) $(NVCCFLAGS) $(CPPFLAGS) -Xcompiler "$(CUDA_CXXFLAGS)" -c $< -o $@
+endef # NVCC_COMPILE
+endif # JETSON_EOL_MODULE_DETECT
+
+ggml/src/ggml-cuda/%.o: \
+	ggml/src/ggml-cuda/%.cu \
+	ggml/include/ggml.h \
+	ggml/src/ggml-common.h \
+	ggml/src/ggml-cuda/common.cuh
+	$(NVCC_COMPILE)
+
+ggml/src/ggml-cuda/ggml-cuda.o: \
+	ggml/src/ggml-cuda/ggml-cuda.cu \
+	ggml/include/ggml-cuda.h \
+	ggml/include/ggml.h \
+	ggml/include/ggml-backend.h \
+	ggml/src/ggml-backend-impl.h \
+	ggml/src/ggml-common.h \
+	$(wildcard ggml/src/ggml-cuda/*.cuh)
+	$(NVCC_COMPILE)
+endif # GGML_CUDA
+
+ifdef GGML_VULKAN
+	MK_CPPFLAGS  += -DGGML_USE_VULKAN
+	MK_LDFLAGS   += $(shell pkg-config --libs vulkan)
+	OBJ_GGML_EXT += ggml/src/ggml-vulkan.o ggml/src/ggml-vulkan-shaders.o
+
+ifdef GGML_VULKAN_CHECK_RESULTS
+	MK_CPPFLAGS  += -DGGML_VULKAN_CHECK_RESULTS
+endif
+
+ifdef GGML_VULKAN_DEBUG
+	MK_CPPFLAGS  += -DGGML_VULKAN_DEBUG
+endif
+
+ifdef GGML_VULKAN_MEMORY_DEBUG
+	MK_CPPFLAGS  += -DGGML_VULKAN_MEMORY_DEBUG
+endif
+
+ifdef GGML_VULKAN_PERF
+	MK_CPPFLAGS  += -DGGML_VULKAN_PERF
+endif
+
+ifdef GGML_VULKAN_VALIDATE
+	MK_CPPFLAGS  += -DGGML_VULKAN_VALIDATE
+endif
+
+ifdef GGML_VULKAN_RUN_TESTS
+	MK_CPPFLAGS  += -DGGML_VULKAN_RUN_TESTS
+endif
+
+GLSLC_CMD  = glslc
+_ggml_vk_genshaders_cmd = $(shell pwd)/vulkan-shaders-gen
+_ggml_vk_header = ggml/src/ggml-vulkan-shaders.hpp
+_ggml_vk_source = ggml/src/ggml-vulkan-shaders.cpp
+_ggml_vk_input_dir = ggml/src/ggml-vulkan/vulkan-shaders
+_ggml_vk_shader_deps = $(echo $(_ggml_vk_input_dir)/*.comp)
+
+ggml/src/ggml-vulkan.o: ggml/src/ggml-vulkan/ggml-vulkan.cpp ggml/include/ggml-vulkan.h $(_ggml_vk_header) $(_ggml_vk_source)
+	$(CXX) $(CXXFLAGS) $(shell pkg-config --cflags vulkan) -c $< -o $@
+
+$(_ggml_vk_header): $(_ggml_vk_source)
+
+$(_ggml_vk_source): $(_ggml_vk_shader_deps) vulkan-shaders-gen
+	$(_ggml_vk_genshaders_cmd) \
+		--glslc      $(GLSLC_CMD) \
+		--input-dir  $(_ggml_vk_input_dir) \
+		--target-hpp $(_ggml_vk_header) \
+		--target-cpp $(_ggml_vk_source)
+
+vulkan-shaders-gen: ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
+	$(CXX) $(CXXFLAGS) -o $@ $(LDFLAGS) ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
+
+endif # GGML_VULKAN
+
+ifdef GGML_HIP
+	ifeq ($(wildcard /opt/rocm),)
+		ROCM_PATH      ?= /usr
+		AMDGPU_TARGETS ?= $(shell $(shell which amdgpu-arch))
+	else
+		ROCM_PATH	?= /opt/rocm
+		AMDGPU_TARGETS ?= $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch)
+	endif
+
+	MK_CPPFLAGS += -DGGML_USE_HIP -DGGML_USE_CUDA
+
+ifdef GGML_HIP_UMA
+	MK_CPPFLAGS += -DGGML_HIP_UMA
+endif # GGML_HIP_UMA
+
+	MK_LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
+	MK_LDFLAGS += -L$(ROCM_PATH)/lib64 -Wl,-rpath=$(ROCM_PATH)/lib64
+	MK_LDFLAGS += -lhipblas -lamdhip64 -lrocblas
+
+	HIPCC ?= $(CCACHE) $(ROCM_PATH)/bin/hipcc
+
+	HIPFLAGS += $(addprefix --offload-arch=,$(AMDGPU_TARGETS))
+
+ifdef GGML_CUDA_FORCE_MMQ
+	HIPFLAGS += -DGGML_CUDA_FORCE_MMQ
+endif # GGML_CUDA_FORCE_MMQ
+
+ifdef GGML_CUDA_FORCE_CUBLAS
+	HIPFLAGS += -DGGML_CUDA_FORCE_CUBLAS
+endif # GGML_CUDA_FORCE_CUBLAS
+
+ifdef GGML_CUDA_NO_PEER_COPY
+	HIPFLAGS += -DGGML_CUDA_NO_PEER_COPY
+endif # GGML_CUDA_NO_PEER_COPY
+
+ifdef GGML_CUDA_NO_FA
+	HIPFLAGS += -DGGML_CUDA_NO_FA
+endif # GGML_CUDA_NO_FA
+
+	OBJ_GGML_EXT += ggml/src/ggml-cuda/ggml-cuda.o
+	OBJ_GGML_EXT += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/*.cu))
+	OBJ_GGML_EXT += $(OBJ_CUDA_TMPL)
+
+ggml/src/ggml-cuda/ggml-cuda.o: \
+	ggml/src/ggml-cuda/ggml-cuda.cu \
+	ggml/include/ggml-cuda.h \
+	ggml/include/ggml.h \
+	ggml/include/ggml-backend.h \
+	ggml/src/ggml-backend-impl.h \
+	ggml/src/ggml-common.h \
+	$(wildcard ggml/src/ggml-cuda/*.cuh)
+	$(HIPCC) $(CXXFLAGS) $(HIPFLAGS) -x hip -c -o $@ $<
+
+ggml/src/ggml-cuda/%.o: \
+	ggml/src/ggml-cuda/%.cu \
+	ggml/include/ggml.h \
+	ggml/src/ggml-common.h \
+	ggml/src/ggml-cuda/common.cuh
+	$(HIPCC) $(CXXFLAGS) $(HIPFLAGS) -x hip -c -o $@ $<
+endif # GGML_HIP
+
+ifdef GGML_MUSA
+	ifeq ($(wildcard /opt/musa),)
+		MUSA_PATH ?= /usr/local/musa
+	else
+		MUSA_PATH ?= /opt/musa
+	endif
+	MUSA_ARCHITECTURES ?= 21;22;31
+
+	MK_CPPFLAGS += -DGGML_USE_MUSA -DGGML_USE_CUDA
+	MK_LDFLAGS += -L$(MUSA_PATH)/lib -Wl,-rpath=$(MUSA_PATH)/lib
+	MK_LDFLAGS += -lmusa -lmusart -lmublas
+
+	ifndef GGML_NO_OPENMP
+		# For Ubuntu Focal
+		MK_CPPFLAGS += -I/usr/lib/llvm-10/include/openmp
+		MK_LDFLAGS  += -L/usr/lib/llvm-10/lib
+		# For Ubuntu Jammy
+		MK_CPPFLAGS += -I/usr/lib/llvm-14/lib/clang/14.0.0/include
+		MK_LDFLAGS  += -L/usr/lib/llvm-14/lib
+	endif # GGML_NO_OPENMP
+
+	CC  := $(MUSA_PATH)/bin/clang
+	CXX := $(MUSA_PATH)/bin/clang++
+	MCC := $(CCACHE) $(MUSA_PATH)/bin/mcc
+
+	MUSAFLAGS  = -fsigned-char -x musa -mtgpu
+	MUSAFLAGS += $(foreach arch,$(subst ;, ,$(MUSA_ARCHITECTURES)),--cuda-gpu-arch=mp_$(arch))
+
+ifdef GGML_CUDA_FORCE_MMQ
+	MUSAFLAGS += -DGGML_CUDA_FORCE_MMQ
+endif # GGML_CUDA_FORCE_MMQ
+
+ifdef GGML_CUDA_FORCE_CUBLAS
+	MUSAFLAGS += -DGGML_CUDA_FORCE_CUBLAS
+endif # GGML_CUDA_FORCE_CUBLAS
+
+ifdef GGML_CUDA_F16
+	MUSAFLAGS += -DGGML_CUDA_F16
+endif # GGML_CUDA_F16
+
+ifdef GGML_CUDA_DMMV_F16
+	MUSAFLAGS += -DGGML_CUDA_F16
+endif # GGML_CUDA_DMMV_F16
+
+ifdef GGML_CUDA_PEER_MAX_BATCH_SIZE
+	MUSAFLAGS += -DGGML_CUDA_PEER_MAX_BATCH_SIZE=$(GGML_CUDA_PEER_MAX_BATCH_SIZE)
+else
+	MUSAFLAGS += -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
+endif # GGML_CUDA_PEER_MAX_BATCH_SIZE
+
+ifdef GGML_CUDA_NO_PEER_COPY
+	MUSAFLAGS += -DGGML_CUDA_NO_PEER_COPY
+endif # GGML_CUDA_NO_PEER_COPY
+
+ifdef GGML_CUDA_NO_FA
+	MUSAFLAGS += -DGGML_CUDA_NO_FA
+endif # GGML_CUDA_NO_FA
+
+ifdef GGML_CUDA_FA_ALL_QUANTS
+	MUSAFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
+endif # GGML_CUDA_FA_ALL_QUANTS
+
+	OBJ_GGML_EXT += ggml/src/ggml-cuda/ggml-cuda.o
+	OBJ_GGML_EXT += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/*.cu))
+	OBJ_GGML_EXT += $(OBJ_CUDA_TMPL)
+
+ggml/src/ggml-cuda/ggml-cuda.o: \
+	ggml/src/ggml-cuda/ggml-cuda.cu \
+	ggml/include/ggml-cuda.h \
+	ggml/include/ggml.h \
+	ggml/include/ggml-backend.h \
+	ggml/src/ggml-backend-impl.h \
+	ggml/src/ggml-common.h \
+	$(wildcard ggml/src/ggml-cuda/*.cuh)
+	$(MCC) $(CXXFLAGS) $(MUSAFLAGS) -c -o $@ $<
+
+ggml/src/ggml-cuda/%.o: \
+	ggml/src/ggml-cuda/%.cu \
+	ggml/include/ggml.h \
+	ggml/src/ggml-common.h \
+	ggml/src/ggml-cuda/common.cuh
+	$(MCC) $(CXXFLAGS) $(MUSAFLAGS) -c -o $@ $<
+endif # GGML_MUSA
+
+ifdef GGML_METAL
+	MK_CPPFLAGS  += -DGGML_USE_METAL
+	MK_LDFLAGS   += -framework Foundation -framework Metal -framework MetalKit
+	OBJ_GGML_EXT += ggml/src/ggml-metal/ggml-metal.o
+
+ifdef GGML_METAL_USE_BF16
+	MK_CPPFLAGS += -DGGML_METAL_USE_BF16
+endif # GGML_METAL_USE_BF16
+ifdef GGML_METAL_NDEBUG
+	MK_CPPFLAGS += -DGGML_METAL_NDEBUG
+endif
+ifdef GGML_METAL_EMBED_LIBRARY
+	MK_CPPFLAGS  += -DGGML_METAL_EMBED_LIBRARY
+	OBJ_GGML_EXT += ggml/src/ggml-metal-embed.o
+endif
+endif # GGML_METAL
+
+ifdef GGML_METAL
+ggml/src/ggml-metal/ggml-metal.o: \
+	ggml/src/ggml-metal/ggml-metal.m \
+	ggml/src/ggml-metal/ggml-metal-impl.h \
+	ggml/include/ggml-metal.h \
+	ggml/include/ggml.h
+	$(CC) $(CFLAGS) -c $< -o $@
+
+ifdef GGML_METAL_EMBED_LIBRARY
+ggml/src/ggml-metal-embed.o: \
+	ggml/src/ggml-metal/ggml-metal.metal \
+	ggml/src/ggml-metal/ggml-metal-impl.h \
+	ggml/src/ggml-common.h
+	@echo "Embedding Metal library"
+	@sed -e '/__embed_ggml-common.h__/r      ggml/src/ggml-common.h'                -e '/__embed_ggml-common.h__/d'      < ggml/src/ggml-metal/ggml-metal.metal           > ggml/src/ggml-metal/ggml-metal-embed.metal.tmp
+	@sed -e '/#include "ggml-metal-impl.h"/r ggml/src/ggml-metal/ggml-metal-impl.h' -e '/#include "ggml-metal-impl.h"/d' < ggml/src/ggml-metal/ggml-metal-embed.metal.tmp > ggml/src/ggml-metal/ggml-metal-embed.metal
+	$(eval TEMP_ASSEMBLY=$(shell mktemp -d))
+	@echo ".section __DATA, __ggml_metallib"                       >  $(TEMP_ASSEMBLY)/ggml-metal-embed.s
+	@echo ".globl _ggml_metallib_start"                            >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
+	@echo "_ggml_metallib_start:"                                  >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
+	@echo ".incbin \"ggml/src/ggml-metal/ggml-metal-embed.metal\"" >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
+	@echo ".globl _ggml_metallib_end"                              >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
+	@echo "_ggml_metallib_end:"                                    >> $(TEMP_ASSEMBLY)/ggml-metal-embed.s
+	$(CC) $(CFLAGS) -c $(TEMP_ASSEMBLY)/ggml-metal-embed.s -o $@
+	@rm -f ${TEMP_ASSEMBLY}/ggml-metal-embed.s
+	@rmdir ${TEMP_ASSEMBLY}
+endif
+endif # GGML_METAL
+
+DIR_GGML = ggml
+DIR_LLAMA = src
+DIR_COMMON = common
+
+OBJ_GGML = \
+	$(DIR_GGML)/src/ggml.o \
+	$(DIR_GGML)/src/ggml-alloc.o \
+	$(DIR_GGML)/src/ggml-backend.o \
+	$(DIR_GGML)/src/ggml-backend-reg.o \
+	$(DIR_GGML)/src/ggml-opt.o \
+	$(DIR_GGML)/src/ggml-quants.o \
+	$(DIR_GGML)/src/ggml-threading.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu_cpp.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-aarch64.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-hbm.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-quants.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-traits.o \
+	$(DIR_GGML)/src/ggml-cpu/binary-ops.o \
+	$(DIR_GGML)/src/ggml-cpu/unary-ops.o \
+	$(DIR_GGML)/src/gguf.o \
+	$(OBJ_GGML_EXT)
+
+OBJ_LLAMA = \
+	$(DIR_LLAMA)/llama.o \
+	$(DIR_LLAMA)/llama-vocab.o \
+	$(DIR_LLAMA)/llama-grammar.o \
+	$(DIR_LLAMA)/llama-sampling.o \
+	$(DIR_LLAMA)/llama-adapter.o \
+	$(DIR_LLAMA)/llama-arch.o \
+	$(DIR_LLAMA)/llama-batch.o \
+	$(DIR_LLAMA)/llama-chat.o \
+	$(DIR_LLAMA)/llama-context.o \
+	$(DIR_LLAMA)/llama-graph.o \
+	$(DIR_LLAMA)/llama-hparams.o \
+	$(DIR_LLAMA)/llama-impl.o \
+	$(DIR_LLAMA)/llama-io.o \
+	$(DIR_LLAMA)/llama-kv-cache.o \
+	$(DIR_LLAMA)/llama-mmap.o \
+	$(DIR_LLAMA)/llama-model.o \
+	$(DIR_LLAMA)/llama-model-loader.o \
+	$(DIR_LLAMA)/llama-quant.o \
+	$(DIR_LLAMA)/unicode.o \
+	$(DIR_LLAMA)/unicode-data.o
+
+#	$(DIR_LLAMA)/llama-context.o \
+
+OBJ_COMMON = \
+	$(DIR_COMMON)/common.o \
+	$(DIR_COMMON)/arg.o \
+	$(DIR_COMMON)/log.o \
+	$(DIR_COMMON)/console.o \
+	$(DIR_COMMON)/ngram-cache.o \
+	$(DIR_COMMON)/sampling.o \
+	$(DIR_COMMON)/speculative.o \
+	$(DIR_COMMON)/chat.o \
+	$(DIR_COMMON)/build-info.o \
+	$(DIR_COMMON)/json-schema-to-grammar.o
+
+OBJ_ALL = $(OBJ_GGML) $(OBJ_LLAMA) $(OBJ_COMMON)
+
+LIB_GGML   = $(LIB_PRE)ggml$(DSO_EXT)
+LIB_GGML_S = $(LIB_PRE)ggml.a
+
+LIB_LLAMA   = $(LIB_PRE)llama$(DSO_EXT)
+LIB_LLAMA_S = $(LIB_PRE)llama.a
+
+LIB_COMMON   = $(LIB_PRE)common$(DSO_EXT)
+LIB_COMMON_S = $(LIB_PRE)common.a
+
+LIB_ALL   = $(LIB_GGML)   $(LIB_LLAMA)   $(LIB_COMMON)
+LIB_ALL_S = $(LIB_GGML_S) $(LIB_LLAMA_S) $(LIB_COMMON_S)
+
+GF_CC := $(CC)
+include scripts/get-flags.mk
+
+# combine build flags with cmdline overrides
+override CPPFLAGS  := $(MK_CPPFLAGS) $(CPPFLAGS)
+override CFLAGS    := $(CPPFLAGS) $(MK_CFLAGS) $(GF_CFLAGS) $(CFLAGS)
+BASE_CXXFLAGS      := $(MK_CXXFLAGS) $(CXXFLAGS)
+override CXXFLAGS  := $(BASE_CXXFLAGS) $(HOST_CXXFLAGS) $(GF_CXXFLAGS) $(CPPFLAGS)
+override NVCCFLAGS := $(MK_NVCCFLAGS) $(NVCCFLAGS)
+override LDFLAGS   := $(MK_LDFLAGS) $(LDFLAGS)
+
+# identify CUDA host compiler
+ifdef GGML_CUDA
+GF_CC := $(NVCC) $(NVCCFLAGS) 2>/dev/null .c -Xcompiler
+include scripts/get-flags.mk
+CUDA_CXXFLAGS := $(BASE_CXXFLAGS) $(GF_CXXFLAGS) -Wno-pedantic
+endif
+
+ifdef LLAMA_CURL
+override CXXFLAGS := $(CXXFLAGS) -DLLAMA_USE_CURL
+override LDFLAGS  := $(LDFLAGS) -lcurl
+endif
+
+#
+# Print build information
+#
+
+$(info I llama.cpp build info: )
+$(info I UNAME_S:   $(UNAME_S))
+$(info I UNAME_P:   $(UNAME_P))
+$(info I UNAME_M:   $(UNAME_M))
+$(info I CFLAGS:    $(CFLAGS))
+$(info I CXXFLAGS:  $(CXXFLAGS))
+$(info I NVCCFLAGS: $(NVCCFLAGS))
+$(info I LDFLAGS:   $(LDFLAGS))
+ifneq ($(UNAME_S),cosmocc)
+$(info I CC:        $(shell $(CC)   --version | head -n 1))
+$(info I CXX:       $(shell $(CXX)  --version | head -n 1))
+endif
+ifdef GGML_CUDA
+$(info I NVCC:      $(shell $(NVCC) --version | tail -n 1))
+CUDA_VERSION := $(shell $(NVCC) --version | grep -oP 'release (\K[0-9]+\.[0-9])')
+ifeq ($(shell awk -v "v=$(CUDA_VERSION)" 'BEGIN { print (v < 11.7) }'),1)
+
+ifndef CUDA_DOCKER_ARCH
+ifndef CUDA_POWER_ARCH
+$(error I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via environment variable CUDA_DOCKER_ARCH, e.g. by running "export CUDA_DOCKER_ARCH=compute_XX" on Unix-like systems, where XX is the minimum compute capability that the code needs to run on. A list with compute capabilities can be found here: https://developer.nvidia.com/cuda-gpus )
+endif # CUDA_POWER_ARCH
+endif # CUDA_DOCKER_ARCH
+
+endif # eq ($(shell echo "$(CUDA_VERSION) < 11.7" | bc),1)
+endif # GGML_CUDA
+$(info )
+
+ifdef DEPRECATE_WARNING
+$(info !!! DEPRECATION WARNING !!!)
+$(info The following LLAMA_ options are deprecated and will be removed in the future. Use the GGML_ prefix instead)
+$(info   - LLAMA_CUDA)
+$(info   - LLAMA_METAL)
+$(info   - LLAMA_METAL_EMBED_LIBRARY)
+$(info   - LLAMA_OPENMP)
+$(info   - LLAMA_RPC)
+$(info   - LLAMA_SYCL)
+$(info   - LLAMA_SYCL_F16)
+$(info   - LLAMA_OPENBLAS)
+$(info   - LLAMA_OPENBLAS64)
+$(info   - LLAMA_BLIS)
+$(info   - LLAMA_NO_LLAMAFILE)
+$(info   - LLAMA_NO_ACCELERATE)
+$(info   - LLAMA_NO_OPENMP)
+$(info   - LLAMA_NO_METAL)
+$(info   - LLAMA_NO_CCACHE)
+$(info )
+endif
+
+ifdef REMOVE_WARNING
+$(info !!! REMOVAL WARNING !!!)
+$(info The following LLAMA_ options have been removed and are no longer supported)
+$(info   - LLAMA_DISABLE_LOGS   (https://github.com/ggml-org/llama.cpp/pull/9418))
+$(info   - LLAMA_SERVER_VERBOSE (https://github.com/ggml-org/llama.cpp/pull/9418))
+$(info )
+endif
+
+#
+# Build libraries
+#
+
+# Libraries
+LIB_GGML   = libggml.so
+LIB_GGML_S = libggml.a
+
+LIB_LLAMA   = libllama.so
+LIB_LLAMA_S = libllama.a
+
+LIB_COMMON   = libcommon.so
+LIB_COMMON_S = libcommon.a
+
+# Targets
+BUILD_TARGETS += $(LIB_GGML) $(LIB_GGML_S) $(LIB_LLAMA) $(LIB_LLAMA_S) $(LIB_COMMON) $(LIB_COMMON_S)
+
+# Dependency files
+DEP_FILES = $(OBJ_GGML:.o=.d) $(OBJ_LLAMA:.o=.d) $(OBJ_COMMON:.o=.d)
+
+# Default target
+all: $(BUILD_TARGETS)
+
+# force c++ build for source file that have same name as c file
+# Note: need this exception because `ggml-cpu.c` and `ggml-cpu.cpp` both produce the same obj/dep files
+$(DIR_GGML)/%_cpp.o: $(DIR_GGML)/%.cpp
+	$(CXX) $(CXXFLAGS) -MMD -c $< -o $@
+
+# Rules for building object files
+$(DIR_GGML)/%.o: $(DIR_GGML)/%.c
+	$(CC) $(CFLAGS) -MMD -c $< -o $@
+
+$(DIR_GGML)/%.o: $(DIR_GGML)/%.cpp
+	$(CXX) $(CXXFLAGS) -MMD -c $< -o $@
+
+$(DIR_LLAMA)/%.o: $(DIR_LLAMA)/%.cpp
+	$(CXX) $(CXXFLAGS) -MMD -c $< -o $@
+
+$(DIR_COMMON)/%.o: $(DIR_COMMON)/%.cpp
+	$(CXX) $(CXXFLAGS) -MMD -c $< -o $@
+
+# Rules for building libraries
+$(LIB_GGML): $(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)
+
+$(LIB_GGML_S): $(OBJ_GGML)
+	ar rcs $(LIB_GGML_S) $^
+
+$(LIB_LLAMA): $(OBJ_LLAMA) $(LIB_GGML)
+	$(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)
+
+$(LIB_LLAMA_S): $(OBJ_LLAMA)
+	ar rcs $(LIB_LLAMA_S) $^
+
+$(LIB_COMMON): $(OBJ_COMMON) $(LIB_LLAMA) $(LIB_GGML)
+	$(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)
+
+$(LIB_COMMON_S): $(OBJ_COMMON)
+	ar rcs $(LIB_COMMON_S) $^
+
+# Include dependency files
+-include $(DEP_FILES)
+
+# Clean generated server assets
+clean-server-assets:
+	find examples/server -type f -name "*.js.hpp"   -delete
+	find examples/server -type f -name "*.mjs.hpp"  -delete
+	find examples/server -type f -name "*.css.hpp"  -delete
+	find examples/server -type f -name "*.html.hpp" -delete
+
+# Clean rule
+clean: clean-server-assets
+	rm -vrf $(BUILD_TARGETS) $(TEST_TARGETS)
+	rm -rvf *.a *.dll *.so *.dot
+	find ggml src common tests examples pocs -type f -name "*.o" -delete
+	find ggml src common tests examples pocs -type f -name "*.d" -delete
+
+#
+# Examples
+#
+
+# $< is the first prerequisite, i.e. the source file.
+# Explicitly compile this to an object file so that it can be cached with ccache.
+# The source file is then filtered out from $^ (the list of all prerequisites) and the object file is added instead.
+
+# Helper function that replaces .c, .cpp, and .cu file endings with .o:
+GET_OBJ_FILE = $(patsubst %.c,%.o,$(patsubst %.cpp,%.o,$(patsubst %.cu,%.o,$(1))))
+
+llama-cli: examples/main/main.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+	@echo
+	@echo '====  Run ./llama-cli -h for help.  ===='
+	@echo
+
+llama-infill: examples/infill/infill.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-run: \
+	examples/run/run.cpp \
+	examples/run/linenoise.cpp/linenoise.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-run-orig: examples/run/run.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-simple: examples/simple/simple.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-simple-chat: examples/simple-chat/simple-chat.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-tokenize: examples/tokenize/tokenize.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-batched: examples/batched/batched.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-batched-bench: examples/batched-bench/batched-bench.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-quantize: examples/quantize/quantize.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-quantize-stats: examples/quantize-stats/quantize-stats.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-perplexity: examples/perplexity/perplexity.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-imatrix: examples/imatrix/imatrix.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-embedding: examples/embedding/embedding.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-gritlm: examples/gritlm/gritlm.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-save-load-state: examples/save-load-state/save-load-state.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-gguf: examples/gguf/gguf.cpp \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+examples/gguf-hash/deps/sha1/sha1.o: \
+	examples/gguf-hash/deps/sha1/sha1.c
+	$(CC) $(CFLAGS) -Iexamples/gguf-hash/deps -c $< -o $@
+
+examples/gguf-hash/deps/xxhash/xxhash.o: \
+	examples/gguf-hash/deps/xxhash/xxhash.c
+	$(CC) $(CFLAGS) -Iexamples/gguf-hash/deps -c $< -o $@
+
+examples/gguf-hash/deps/sha256/sha256.o: \
+	examples/gguf-hash/deps/sha256/sha256.c
+	$(CC) $(CFLAGS) -Iexamples/gguf-hash/deps -c $< -o $@
+
+llama-gguf-hash: examples/gguf-hash/gguf-hash.cpp examples/gguf-hash/deps/sha1/sha1.o examples/gguf-hash/deps/xxhash/xxhash.o examples/gguf-hash/deps/sha256/sha256.o\
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -Iexamples/gguf-hash/deps -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-gguf-split: examples/gguf-split/gguf-split.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-eval-callback: examples/eval-callback/eval-callback.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-cvector-generator: examples/cvector-generator/cvector-generator.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-convert-llama2c-to-ggml: examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-bench: examples/llama-bench/llama-bench.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-export-lora: examples/export-lora/export-lora.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-retrieval: examples/retrieval/retrieval.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-speculative: examples/speculative/speculative.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-parallel: examples/parallel/parallel.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-lookahead: examples/lookahead/lookahead.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-lookup: examples/lookup/lookup.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-lookup-create: examples/lookup/lookup-create.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-lookup-merge: examples/lookup/lookup-merge.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-lookup-stats: examples/lookup/lookup-stats.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-passkey: examples/passkey/passkey.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-gbnf-validator: examples/gbnf-validator/gbnf-validator.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+ifdef GGML_RPC
+rpc-server: examples/rpc/rpc-server.cpp \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)
+endif # GGML_RPC
+
+llama-server: \
+	examples/server/server.cpp \
+	examples/server/httplib.h \
+	common/chat.h \
+	common/minja/chat-template.hpp \
+	common/json.hpp \
+	common//minja/minja.hpp \
+	$(OBJ_ALL)
+	cmake -DINPUT=examples/server/public/index.html.gz -DOUTPUT=examples/server/index.html.gz.hpp -P scripts/xxd.cmake
+	cmake -DINPUT=examples/server/public_legacy/index.html -DOUTPUT=examples/server/index.html.hpp -P scripts/xxd.cmake
+	cmake -DINPUT=examples/server/public_legacy/loading.html -DOUTPUT=examples/server/loading.html.hpp -P scripts/xxd.cmake
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)
+
+llama-server-orig: \
+	examples/server/server.cpp \
+	examples/server/utils.hpp \
+	examples/server/httplib.h \
+	examples/server/index.html.hpp \
+	examples/server/loading.html.hpp \
+	common/chat.cpp \
+	common/chat.h \
+	common/chat-template.hpp \
+	common/json.hpp \
+	common/minja.hpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)
+
+# Portable equivalent of `cd examples/server/public && xxd -i $(notdir $<) ../$(notdir $<).hpp`:
+examples/server/%.hpp: examples/server/public/% FORCE Makefile
+	@( export NAME=$(subst .,_,$(subst -,_,$(notdir $<))) && \
+		echo "unsigned char $${NAME}[] = {" && \
+		cat $< | od -v -t x1 -An | sed -E 's/([0-9a-fA-F]+)/0x\1, /g' && \
+		echo "};" && \
+		echo "unsigned int $${NAME}_len = $(shell cat $< | wc -c );" \
+	) > $@
+
+llama-gen-docs: examples/gen-docs/gen-docs.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+libllava.a: examples/llava/llava.cpp \
+	examples/llava/llava.h \
+	examples/llava/clip.cpp \
+	examples/llava/clip.h \
+	common/stb_image.h \
+	common/base64.hpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -static -fPIC -c $< -o $@ -Wno-cast-qual
+
+llama-llava-cli: examples/llava/llava-cli.cpp \
+	examples/llava/llava.cpp \
+	examples/llava/llava.h \
+	examples/llava/clip.cpp \
+	examples/llava/clip.h \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual
+
+llama-minicpmv-cli: examples/llava/minicpmv-cli.cpp \
+	examples/llava/llava.cpp \
+	examples/llava/llava.h \
+	examples/llava/clip.cpp \
+	examples/llava/clip.h \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual
+
+llama-qwen2vl-cli: examples/llava/qwen2vl-cli.cpp \
+	examples/llava/llava.cpp \
+	examples/llava/llava.h \
+	examples/llava/clip.cpp \
+	examples/llava/clip.h \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual
+
+ifeq ($(UNAME_S),Darwin)
+swift: examples/batched.swift
+	(cd examples/batched.swift; make build)
+endif
+
+common/build-info.cpp: $(wildcard .git/index) scripts/build-info.sh
+	@sh scripts/build-info.sh "$(CC)" > $@.tmp
+	@if ! cmp -s $@.tmp $@; then \
+		mv $@.tmp $@; \
+	else \
+		rm $@.tmp; \
+	fi
+
+common/build-info.o: common/build-info.cpp
+	$(CXX) $(CXXFLAGS) -c $(filter-out %.h,$^) -o $@
+
+#
+# Tests
+#
+
+tests: $(TEST_TARGETS)
+
+tests/test-arg-parser: tests/test-arg-parser.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-llama-grammar: tests/test-llama-grammar.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-log: tests/test-log.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-grammar-parser: tests/test-grammar-parser.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-grammar-integration: tests/test-grammar-integration.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-double-float: tests/test-double-float.cpp
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-json-schema-to-grammar: tests/test-json-schema-to-grammar.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -Iexamples/server -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-chat: tests/test-chat.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -Iexamples/server -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-opt: tests/test-opt.cpp \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-quantize-fns: tests/test-quantize-fns.cpp \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-quantize-perf: tests/test-quantize-perf.cpp \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-sampling: tests/test-sampling.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-tokenizer-0: tests/test-tokenizer-0.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-tokenizer-1-bpe: tests/test-tokenizer-1-bpe.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-tokenizer-1-spm: tests/test-tokenizer-1-spm.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-rope: tests/test-rope.cpp ggml/src/ggml.o \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-c.o: tests/test-c.c include/llama.h
+	$(CC) $(CFLAGS) -c $(filter-out %.h,$^) -o $@
+
+tests/test-backend-ops: tests/test-backend-ops.cpp \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-model-load-cancel: tests/test-model-load-cancel.cpp tests/get-model.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-autorelease: tests/test-autorelease.cpp tests/get-model.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+tests/test-chat-template: tests/test-chat-template.cpp \
+	$(OBJ_ALL)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+#
+# PoCs
+#
+
+llama-vdot: pocs/vdot/vdot.cpp ggml/src/ggml.o \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+llama-q8dot: pocs/vdot/q8dot.cpp ggml/src/ggml.o \
+	$(OBJ_GGML)
+	$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
+	$(CXX) $(CXXFLAGS) $(filter-out $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
+
+#
+# Deprecated binaries that we want to keep around long enough for people to migrate to the new filenames, then these can be removed.
+#
+# Mark legacy binary targets as .PHONY so that they are always checked.
+.PHONY: FORCE main quantize perplexity embedding server
+
+# Define the object file target
+examples/deprecation-warning/deprecation-warning.o: examples/deprecation-warning/deprecation-warning.cpp
+	$(CXX) $(CXXFLAGS) -c $< -o $@
+
+# NOTE: We currently will always build the deprecation-warning `main` and `server` binaries to help users migrate.
+#  Eventually we will want to remove these target from building all the time.
+main: examples/deprecation-warning/deprecation-warning.o
+	$(CXX) $(CXXFLAGS) $< -o $@ $(LDFLAGS)
+	@echo "NOTICE: The 'main' binary is deprecated. Please use 'llama-cli' instead."
+
+server: examples/deprecation-warning/deprecation-warning.o
+	$(CXX) $(CXXFLAGS) $< -o $@ $(LDFLAGS)
+	@echo "NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead."
+
+quantize: examples/deprecation-warning/deprecation-warning.o
+ifneq (,$(wildcard quantize))
+	$(CXX) $(CXXFLAGS) $< -o $@ $(LDFLAGS)
+	@echo "#########"
+	@echo "WARNING: The 'quantize' binary is deprecated. Please use 'llama-quantize' instead."
+	@echo "  Remove the 'quantize' binary to remove this warning."
+	@echo "#########"
+endif
+
+perplexity: examples/deprecation-warning/deprecation-warning.o
+ifneq (,$(wildcard perplexity))
+	$(CXX) $(CXXFLAGS) $< -o $@ $(LDFLAGS)
+	@echo "#########"
+	@echo "WARNING: The 'perplexity' binary is deprecated. Please use 'llama-perplexity' instead."
+	@echo "  Remove the 'perplexity' binary to remove this warning."
+	@echo "#########"
+endif
+
+embedding: examples/deprecation-warning/deprecation-warning.o
+ifneq (,$(wildcard embedding))
+	$(CXX) $(CXXFLAGS) $< -o $@ $(LDFLAGS)
+	@echo "#########"
+	@echo "WARNING: The 'embedding' binary is deprecated. Please use 'llama-embedding' instead."
+	@echo "  Remove the 'embedding' binary to remove this warning."
+	@echo "#########"
+endif
diff --git a/README.md b/README.md
new file mode 100644
index 0000000000000..8217d819a7848
--- /dev/null
+++ b/README.md
@@ -0,0 +1,75 @@
+## llama-server-one
+Based on [llama.cpp](https://github.com/ggml-org/llama.cpp).
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+<!--
+**THIS REPO IS NOT QUITE READY FOR PUBIC USE. I WILL REMOVE THIS NOTICE WHEN IT IS READY.**
+-->
+
+---
+### Project Goals
+
+The goal of this project is to build a single `llama-server-one executable` file that can run "anywhere":
+- x86_64 Windows
+- x86_64 Linux
+- ARM Windows
+- ARM Linux
+- ARM MacOS
+
+I am inspired by the [llamafile project](https://github.com/Mozilla-Ocho/llamafile). The main drawback of that project is that it has not kept up-to-date with llama.cpp and therefore, does not always support the latest models when llama.cpp supports them. Support for new models in llamafile takes work and time.
+
+I want to use the MIT license as used by llama.cpp.
+
+GPU support is not important to me and can be handled by platform specific builds of llama.cpp. CPU inference is quite adequate for many private end-user applications.
+
+The ability to package support files, such as a custom web, UI into the executable file is important to me. This is implemented.
+
+The ability to package default arguments, in an "args" file, into the executable file is important to me. This is implemented.
+
+The ability to read arguments from a file adjacent to the executable file is important to me. This is implemented.
+
+The ability to package a gguf model into the executable file is important to me. This is not implemented yet.
+
+I welcome any of my changes being implemented in the official llama.cpp.
+
+---
+### Documentation
+Follow these guides in order to build, package, and deploy `llama-server-one`:
+- My start-to-finish guide for building `llama-server` with Cosmo is in the [Building-ls1.md](docs/Building-ls1.md) file.
+- My guide for configuring a `llama-server-one` executable is in the [Configuring-ls1.md](docs/Configuring-ls1.md) file.
+- My guide for packaging a `llama-server-one` executable for deployment is in the [Packaging-ls1.md](docs/Packaging-ls1.md) file.
+
+---
+### Modifications to llama.cpp
+
+To get this from the llama.cpp source base, there are few files that need to be modified:
+
+1. [Makefile](Makefile) -- Extensive modifications to bring up to date, as it is deprecated in favor of a CMake system, and to support COSMOCC.
+
+2. [src/llama-context.cpp](src/llama-context.cpp) -- COSMOCC doesn't have std::fill in its Standard Templates Library.
+
+3. [examples/server/server.cpp](examples/server/server.cpp) -- Support embedded or adjacent "args" file, fix Cosmo name conflict with "defer" task member, add additional meta data to `model_meta`.
+
+---
+### Reference
+
+Here are some projects and pages you should be familiar with if you want to get the most out of `llama-server-one`:
+- [llama.cpp](https://github.com/ggml-org/llama.cpp) - Georgi Gerganov and his team are the rock stars who are making the plumbing so LLMs can be available for developers of all kinds. The `llama.cpp` project is the industry standard for inference. I only fork it here because I want to make it a little better for my applications while preserving all its goodness.
+- [llamafile](https://github.com/Mozilla-Ocho/llamafile) - `Llamafile` lets you distribute and run LLMs with a single file. It is a Mozilla Foundation project that brough the Cosmopolitan C Library and llama.cpp together. It has some popular GPU support. It is based on an older version of llama.cpp and does not support all of the latest models supported by llama.cpp. Llamafile is an inspiration for this project.
+- [Cosmopolitan Libc](https://github.com/jart/cosmopolitan) - `Cosmopolitan` is a project for building cross-platform binaries that run on x86_64 and ARM architectures, supporting Linux, Windows, macOS, and other operating systems. Like `llamafile`, I use Cosmo compile cross-platform executables of `llama.cpp` targets, including `llama-server`.
+- [Actually Portable Executable (APE) Specification](https://github.com/jart/cosmopolitan/blob/master/ape/specification.md) - Within the Cosmopolitan Libc repo is documentation about how the cross CPU, cross platform executable works.
+- [Brad's LLMs](https://huggingface.co/bradhutchings/Brads-LLMs) - I share private local LLMs built with `llamafile` in a Hugging Face repo.
+
+---
+### To Do List
+
+In no particular order of importance, these are the things that bother me:
+- Package gguf file into executable file. The zip item needs to be aligned for mmap. There is a zipalign.c tool source in llamafile that seems loosely inspired by the Android zipalign too. I feel like there should be a more generic solution for this problem.
+- GPU support without a complicated kludge, and that can support all supported platform / CPU / GPU triads. Perhaps a plugin system with shared library dispatch? Invoking dev tools on Apple Metal like llamafile does is "complicated".
+- Code signing instructions. Might have to sign executables within the zip package, plus the package itself.
+- Clean up remaining build warnings, either by fixing source (i.e. Cosmo) or finding the magical compiler flags.
+- Copy the `cosmo_args` function into `server.cpp` so it could potentially be incorporated upstream in non-Cosmo builds. `common/arg2.cpp` might be a good landing spot. License in [Cosmo source code](https://github.com/jart/cosmopolitan/blob/master/tool/args/args2.c) appears to be MIT compatible with attribution. 
+- The `--ctx-size` parameter doesn't seem quite right given that new models have the training (or max) context size in their metadata. That size should be used subject to a maximum in a passed parameter. E.g. So a 128K model can run comfortably on a smaller device.
+- Write docs for a Deploying step. It should address the args file, removing the extra executable depending on platform, models, host, port. context size.

From 30d11b3790b65e205faa6c26af47e385f278908e Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 14:15:54 -0700
Subject: [PATCH 07/73] Create Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 204 ++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 docs/Configuring-ls1-Brads-Env.md

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
new file mode 100644
index 0000000000000..0399e0fef1884
--- /dev/null
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -0,0 +1,204 @@
+## Configuring llama-server-one in my Environment
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+This file contains instructions for configuring the `llama-server-one` executable to make it ready to package for multiple platforms.
+Instructioons have been customized for my environment. You should use these [Configuring Instructions](Configuring-ls1.md).
+
+---
+### Environment Variables
+
+Let's define some environment variables:
+```
+BUILDING_DIR="1-BUILDING-llama.cpp"
+CONFIGURING_DIR="2-CONFIGURING-llama-server-one"
+
+LLAMA_SERVER="llama-server"
+LLAMA_SERVER_ONE="llama-server-one"
+LLAMA_SERVER_ONE_ZIP="llama-server-one.zip"
+DEFAULT_ARGS="default-args"
+printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
+```
+
+---
+### Create Configuration Directory
+
+Next, let's create a directory where we'll configure `llama-server-one`:
+```
+cd ~
+rm -r -f ~/$CONFIGURING_DIR
+mkdir -p $CONFIGURING_DIR
+cp ~/$BUILDING_DIR/$LLAMA_SERVER \
+    ~/$CONFIGURING_DIR/$LLAMA_SERVER_ONE_ZIP
+
+cd ~/$CONFIGURING_DIR
+printf "\n**********\n*\n* FINISHED: Create Configuration Directory.\n*\n**********\n\n"
+```
+
+---
+### Examine Contents of Zip Archive
+
+Look at the contents of the `llama-server-one` zip archive:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Examine Contents of Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### Delete Extraneous Timezone Files
+
+You should notice a bunch of extraneous timezone related files in `/usr/*`. Let's get rid of those:
+```
+zip -d $LLAMA_SERVER_ONE_ZIP "/usr/*"
+printf "\n**********\n*\n* FINISHED: Delete Extraneous Timezone Files.\n*\n**********\n\n"
+```
+
+---
+### Verify Contents of Zip Archive
+
+Verify that these files are no longer in the archive:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### OPTIONAL: Create website Directory in Archive
+
+`llama.cpp` has a built in chat UI. If you'd like to provide a custom UI, you should add a `website` directory to the `llama-server-one` archive. `llama.cpp`'s chat UI is optimized for serving inside the project's source code. But we can copy the unoptimized source:
+```
+mkdir website
+cp -r /mnt/hyperv/web-apps/completion-tool/* website
+rm website/*.txt
+rm website/images/*.svg
+rm website/images/*.psd
+zip -0 -r $LLAMA_SERVER_ONE_ZIP website/*
+printf "\n**********\n*\n* FINISHED: Create website Directory in Archive.\n*\n**********\n\n"
+```
+
+#### OPTONAL: Verify website Directory in Archive
+
+Verify that the archive has your website:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
+```
+---
+### Create default-args File
+
+A `default-args` file in the archive can specify sane default parameters. The format of the file is parameter name on a line, parameter value on a line, rinse, repeat. End the file with a `...` line to include user specified parameters.
+
+We don't yet support including the model inside the zip archive (yet). That has a 4GB size limitation on Windows anyway, as `.exe` files cannot exceed 4GB. So let's use an adjacent file called `model.gguf`.
+
+We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible.
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
+```
+
+#### OPTIONAL: Create default-args File with Website
+
+If you added a website to the archive, use this instead:
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+--path
+/zip/website
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File with Website.\n*\n**********\n\n"
+```
+
+---
+### Add default-args File to Archive
+
+Add the `default-args` file to the archive:
+```
+zip -0 -r $LLAMA_SERVER_ONE_ZIP $DEFAULT_ARGS
+printf "\n**********\n*\n* FINISHED: Add default-args File to Archive.\n*\n**********\n\n"
+```
+
+---
+### Verify default-args File in Archive
+
+Verify that the archive contains the `default-args` file:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify default-args File in Archive.\n*\n**********\n\n"
+```
+
+---
+### Remove .zip Extension
+
+Remove the `.zip` from our working file:
+```
+mv $LLAMA_SERVER_ONE_ZIP $LLAMA_SERVER_ONE
+printf "\n**********\n*\n* FINISHED: Remove .zip Extension.\n*\n**********\n\n"
+```
+
+---
+### Download Model
+
+Let's download a small model. We'll use Google Gemma 1B Instruct v3, a surprisingly capable tiny model.
+```
+MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
+cp /mnt/hyperv/$MODEL_FILE model.gguf
+printf "\n**********\n*\n* FINISHED: Download Model.\n*\n**********\n\n"
+```
+
+---
+### Test Run
+
+Now we can test run `llama-server-one`, listening on localhost:8080.
+```
+./$LLAMA_SERVER_ONE
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://127.0.0.1:8080 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+### Test Run on Public Interfaces
+
+If you'd like it to listen on all available interfaces, so you can connect from a browser on another computer:
+```
+./$LLAMA_SERVER_ONE --host 0.0.0.0
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://0.0.0.0:8080 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+### Copy llama-server-one for Deployment
+Congratulations! You are ready to copy `llams-server-one` executable to the share for deployment.
+
+```
+sudo cp llama-server-one /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-LLMs
+printf "\n**********\n*\n* FINISHED: Copy llama-server-one for Deployment.\n*\n**********\n\n"
+```

From 1ac6ea9c950a787d989b6ffb4b30ce7ad31efc87 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 14:42:49 -0700
Subject: [PATCH 08/73] Delete docs/Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 204 ------------------------------
 1 file changed, 204 deletions(-)
 delete mode 100644 docs/Configuring-ls1-Brads-Env.md

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
deleted file mode 100644
index 0399e0fef1884..0000000000000
--- a/docs/Configuring-ls1-Brads-Env.md
+++ /dev/null
@@ -1,204 +0,0 @@
-## Configuring llama-server-one in my Environment
-
-Brad Hutchings<br/>
-brad@bradhutchings.com
-
-This file contains instructions for configuring the `llama-server-one` executable to make it ready to package for multiple platforms.
-Instructioons have been customized for my environment. You should use these [Configuring Instructions](Configuring-ls1.md).
-
----
-### Environment Variables
-
-Let's define some environment variables:
-```
-BUILDING_DIR="1-BUILDING-llama.cpp"
-CONFIGURING_DIR="2-CONFIGURING-llama-server-one"
-
-LLAMA_SERVER="llama-server"
-LLAMA_SERVER_ONE="llama-server-one"
-LLAMA_SERVER_ONE_ZIP="llama-server-one.zip"
-DEFAULT_ARGS="default-args"
-printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
-```
-
----
-### Create Configuration Directory
-
-Next, let's create a directory where we'll configure `llama-server-one`:
-```
-cd ~
-rm -r -f ~/$CONFIGURING_DIR
-mkdir -p $CONFIGURING_DIR
-cp ~/$BUILDING_DIR/$LLAMA_SERVER \
-    ~/$CONFIGURING_DIR/$LLAMA_SERVER_ONE_ZIP
-
-cd ~/$CONFIGURING_DIR
-printf "\n**********\n*\n* FINISHED: Create Configuration Directory.\n*\n**********\n\n"
-```
-
----
-### Examine Contents of Zip Archive
-
-Look at the contents of the `llama-server-one` zip archive:
-```
-unzip -l $LLAMA_SERVER_ONE_ZIP 
-printf "\n**********\n*\n* FINISHED: Examine Contents of Zip Archive.\n*\n**********\n\n"
-```
-
----
-### Delete Extraneous Timezone Files
-
-You should notice a bunch of extraneous timezone related files in `/usr/*`. Let's get rid of those:
-```
-zip -d $LLAMA_SERVER_ONE_ZIP "/usr/*"
-printf "\n**********\n*\n* FINISHED: Delete Extraneous Timezone Files.\n*\n**********\n\n"
-```
-
----
-### Verify Contents of Zip Archive
-
-Verify that these files are no longer in the archive:
-```
-unzip -l $LLAMA_SERVER_ONE_ZIP 
-printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n**********\n\n"
-```
-
----
-### OPTIONAL: Create website Directory in Archive
-
-`llama.cpp` has a built in chat UI. If you'd like to provide a custom UI, you should add a `website` directory to the `llama-server-one` archive. `llama.cpp`'s chat UI is optimized for serving inside the project's source code. But we can copy the unoptimized source:
-```
-mkdir website
-cp -r /mnt/hyperv/web-apps/completion-tool/* website
-rm website/*.txt
-rm website/images/*.svg
-rm website/images/*.psd
-zip -0 -r $LLAMA_SERVER_ONE_ZIP website/*
-printf "\n**********\n*\n* FINISHED: Create website Directory in Archive.\n*\n**********\n\n"
-```
-
-#### OPTONAL: Verify website Directory in Archive
-
-Verify that the archive has your website:
-```
-unzip -l $LLAMA_SERVER_ONE_ZIP 
-printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
-```
----
-### Create default-args File
-
-A `default-args` file in the archive can specify sane default parameters. The format of the file is parameter name on a line, parameter value on a line, rinse, repeat. End the file with a `...` line to include user specified parameters.
-
-We don't yet support including the model inside the zip archive (yet). That has a 4GB size limitation on Windows anyway, as `.exe` files cannot exceed 4GB. So let's use an adjacent file called `model.gguf`.
-
-We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible.
-```
-cat << EOF > $DEFAULT_ARGS
--m
-model.gguf
---host
-127.0.0.1
---port
-8080
---ctx-size
-8192
-...
-EOF
-printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
-```
-
-#### OPTIONAL: Create default-args File with Website
-
-If you added a website to the archive, use this instead:
-```
-cat << EOF > $DEFAULT_ARGS
--m
-model.gguf
---host
-127.0.0.1
---port
-8080
---ctx-size
-8192
---path
-/zip/website
-...
-EOF
-printf "\n**********\n*\n* FINISHED: Create Default args File with Website.\n*\n**********\n\n"
-```
-
----
-### Add default-args File to Archive
-
-Add the `default-args` file to the archive:
-```
-zip -0 -r $LLAMA_SERVER_ONE_ZIP $DEFAULT_ARGS
-printf "\n**********\n*\n* FINISHED: Add default-args File to Archive.\n*\n**********\n\n"
-```
-
----
-### Verify default-args File in Archive
-
-Verify that the archive contains the `default-args` file:
-```
-unzip -l $LLAMA_SERVER_ONE_ZIP 
-printf "\n**********\n*\n* FINISHED: Verify default-args File in Archive.\n*\n**********\n\n"
-```
-
----
-### Remove .zip Extension
-
-Remove the `.zip` from our working file:
-```
-mv $LLAMA_SERVER_ONE_ZIP $LLAMA_SERVER_ONE
-printf "\n**********\n*\n* FINISHED: Remove .zip Extension.\n*\n**********\n\n"
-```
-
----
-### Download Model
-
-Let's download a small model. We'll use Google Gemma 1B Instruct v3, a surprisingly capable tiny model.
-```
-MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
-cp /mnt/hyperv/$MODEL_FILE model.gguf
-printf "\n**********\n*\n* FINISHED: Download Model.\n*\n**********\n\n"
-```
-
----
-### Test Run
-
-Now we can test run `llama-server-one`, listening on localhost:8080.
-```
-./$LLAMA_SERVER_ONE
-```
-
-After starting up and loading the model, it should display:
-
-**main: server is listening on http://127.0.0.1:8080 - starting the main loop**<br/>
-**srv  update_slots: all slots are idle**
-
-Hit `ctrl-C` on your keyboard to stop it.
-
----
-### Test Run on Public Interfaces
-
-If you'd like it to listen on all available interfaces, so you can connect from a browser on another computer:
-```
-./$LLAMA_SERVER_ONE --host 0.0.0.0
-```
-
-After starting up and loading the model, it should display:
-
-**main: server is listening on http://0.0.0.0:8080 - starting the main loop**<br/>
-**srv  update_slots: all slots are idle**
-
-Hit `ctrl-C` on your keyboard to stop it.
-
----
-### Copy llama-server-one for Deployment
-Congratulations! You are ready to copy `llams-server-one` executable to the share for deployment.
-
-```
-sudo cp llama-server-one /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-LLMs
-printf "\n**********\n*\n* FINISHED: Copy llama-server-one for Deployment.\n*\n**********\n\n"
-```

From 6b2862b6cd73bdc3384cc790c1d78baeef711f87 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 13:23:32 -0700
Subject: [PATCH 09/73] Update server.cpp

Removed unused parameter name.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 108723d70e470..b96c850d45bb1 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -4515,7 +4515,7 @@ int main(int argc, char ** argv) {
         return false;
     });
 
-    svr->Get("/chat/", [](const httplib::Request & req, httplib::Response & res) {
+    svr->Get("/chat/", [](const httplib::Request &, httplib::Response & res) {
         res.set_redirect("/chat");
         return false;
     });

From 7939706d48a5114eb586acc5144b041806a868f1 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 14:16:51 -0700
Subject: [PATCH 10/73] Create Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 204 ++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 docs/Configuring-ls1-Brads-Env.md

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
new file mode 100644
index 0000000000000..0399e0fef1884
--- /dev/null
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -0,0 +1,204 @@
+## Configuring llama-server-one in my Environment
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+This file contains instructions for configuring the `llama-server-one` executable to make it ready to package for multiple platforms.
+Instructioons have been customized for my environment. You should use these [Configuring Instructions](Configuring-ls1.md).
+
+---
+### Environment Variables
+
+Let's define some environment variables:
+```
+BUILDING_DIR="1-BUILDING-llama.cpp"
+CONFIGURING_DIR="2-CONFIGURING-llama-server-one"
+
+LLAMA_SERVER="llama-server"
+LLAMA_SERVER_ONE="llama-server-one"
+LLAMA_SERVER_ONE_ZIP="llama-server-one.zip"
+DEFAULT_ARGS="default-args"
+printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
+```
+
+---
+### Create Configuration Directory
+
+Next, let's create a directory where we'll configure `llama-server-one`:
+```
+cd ~
+rm -r -f ~/$CONFIGURING_DIR
+mkdir -p $CONFIGURING_DIR
+cp ~/$BUILDING_DIR/$LLAMA_SERVER \
+    ~/$CONFIGURING_DIR/$LLAMA_SERVER_ONE_ZIP
+
+cd ~/$CONFIGURING_DIR
+printf "\n**********\n*\n* FINISHED: Create Configuration Directory.\n*\n**********\n\n"
+```
+
+---
+### Examine Contents of Zip Archive
+
+Look at the contents of the `llama-server-one` zip archive:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Examine Contents of Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### Delete Extraneous Timezone Files
+
+You should notice a bunch of extraneous timezone related files in `/usr/*`. Let's get rid of those:
+```
+zip -d $LLAMA_SERVER_ONE_ZIP "/usr/*"
+printf "\n**********\n*\n* FINISHED: Delete Extraneous Timezone Files.\n*\n**********\n\n"
+```
+
+---
+### Verify Contents of Zip Archive
+
+Verify that these files are no longer in the archive:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### OPTIONAL: Create website Directory in Archive
+
+`llama.cpp` has a built in chat UI. If you'd like to provide a custom UI, you should add a `website` directory to the `llama-server-one` archive. `llama.cpp`'s chat UI is optimized for serving inside the project's source code. But we can copy the unoptimized source:
+```
+mkdir website
+cp -r /mnt/hyperv/web-apps/completion-tool/* website
+rm website/*.txt
+rm website/images/*.svg
+rm website/images/*.psd
+zip -0 -r $LLAMA_SERVER_ONE_ZIP website/*
+printf "\n**********\n*\n* FINISHED: Create website Directory in Archive.\n*\n**********\n\n"
+```
+
+#### OPTONAL: Verify website Directory in Archive
+
+Verify that the archive has your website:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
+```
+---
+### Create default-args File
+
+A `default-args` file in the archive can specify sane default parameters. The format of the file is parameter name on a line, parameter value on a line, rinse, repeat. End the file with a `...` line to include user specified parameters.
+
+We don't yet support including the model inside the zip archive (yet). That has a 4GB size limitation on Windows anyway, as `.exe` files cannot exceed 4GB. So let's use an adjacent file called `model.gguf`.
+
+We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible.
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
+```
+
+#### OPTIONAL: Create default-args File with Website
+
+If you added a website to the archive, use this instead:
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+--path
+/zip/website
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File with Website.\n*\n**********\n\n"
+```
+
+---
+### Add default-args File to Archive
+
+Add the `default-args` file to the archive:
+```
+zip -0 -r $LLAMA_SERVER_ONE_ZIP $DEFAULT_ARGS
+printf "\n**********\n*\n* FINISHED: Add default-args File to Archive.\n*\n**********\n\n"
+```
+
+---
+### Verify default-args File in Archive
+
+Verify that the archive contains the `default-args` file:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify default-args File in Archive.\n*\n**********\n\n"
+```
+
+---
+### Remove .zip Extension
+
+Remove the `.zip` from our working file:
+```
+mv $LLAMA_SERVER_ONE_ZIP $LLAMA_SERVER_ONE
+printf "\n**********\n*\n* FINISHED: Remove .zip Extension.\n*\n**********\n\n"
+```
+
+---
+### Download Model
+
+Let's download a small model. We'll use Google Gemma 1B Instruct v3, a surprisingly capable tiny model.
+```
+MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
+cp /mnt/hyperv/$MODEL_FILE model.gguf
+printf "\n**********\n*\n* FINISHED: Download Model.\n*\n**********\n\n"
+```
+
+---
+### Test Run
+
+Now we can test run `llama-server-one`, listening on localhost:8080.
+```
+./$LLAMA_SERVER_ONE
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://127.0.0.1:8080 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+### Test Run on Public Interfaces
+
+If you'd like it to listen on all available interfaces, so you can connect from a browser on another computer:
+```
+./$LLAMA_SERVER_ONE --host 0.0.0.0
+```
+
+After starting up and loading the model, it should display:
+
+**main: server is listening on http://0.0.0.0:8080 - starting the main loop**<br/>
+**srv  update_slots: all slots are idle**
+
+Hit `ctrl-C` on your keyboard to stop it.
+
+---
+### Copy llama-server-one for Deployment
+Congratulations! You are ready to copy `llams-server-one` executable to the share for deployment.
+
+```
+sudo cp llama-server-one /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-LLMs
+printf "\n**********\n*\n* FINISHED: Copy llama-server-one for Deployment.\n*\n**********\n\n"
+```

From 04b231064629e1135f2c76fb765daadb2ed74181 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 14:38:01 -0700
Subject: [PATCH 11/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 0399e0fef1884..a8ef579475cd8 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -160,7 +160,7 @@ printf "\n**********\n*\n* FINISHED: Remove .zip Extension.\n*\n**********\n\n"
 Let's download a small model. We'll use Google Gemma 1B Instruct v3, a surprisingly capable tiny model.
 ```
 MODEL_FILE="Google-Gemma-1B-Instruct-v3-q8_0.gguf"
-cp /mnt/hyperv/$MODEL_FILE model.gguf
+cp /mnt/hyperv/models/$MODEL_FILE model.gguf
 printf "\n**********\n*\n* FINISHED: Download Model.\n*\n**********\n\n"
 ```
 

From 33953bb1e6d28585ffdc84a827658cde3c49be40 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 15:03:04 -0700
Subject: [PATCH 12/73] Update server.cpp

model_meta addition.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 49 +++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index b96c850d45bb1..bdbd6c453ec0e 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -3404,16 +3404,53 @@ struct server_context {
         SRV_DBG("%s", "run slots completed\n");
     }
 
+    // llama-server-one START
     json model_meta() const {
+        char general_architecture[64];
+        char general_type[64];
+        char general_name[64];
+        char general_version[64];
+        char general_finetune[64];
+        char general_basename[64];
+        char general_size_label[64];
+        char general_license[64];
+
+        general_architecture[0] = 0;
+        general_type[0] = 0;
+        general_name[0] = 0;
+        general_version[0] = 0;
+        general_finetune[0] = 0;
+        general_basename[0] = 0;
+        general_size_label[0] = 0;
+        general_license[0] = 0;
+
+        llama_model_meta_val_str(model, "general.architecture", general_architecture, 64);
+        llama_model_meta_val_str(model, "general.type", general_type, 64);
+        llama_model_meta_val_str(model, "general.name", general_name, 64);
+        llama_model_meta_val_str(model, "general.version",      general_version, 64);
+        llama_model_meta_val_str(model, "general.finetune",     general_finetune, 64);
+        llama_model_meta_val_str(model, "general.basename",     general_basename, 64);
+        llama_model_meta_val_str(model, "general.size_label",   general_size_label, 64);
+        llama_model_meta_val_str(model, "general.license",      general_license, 64);
+
         return json {
-            {"vocab_type",  llama_vocab_type       (vocab)},
-            {"n_vocab",     llama_vocab_n_tokens   (vocab)},
-            {"n_ctx_train", llama_model_n_ctx_train(model)},
-            {"n_embd",      llama_model_n_embd     (model)},
-            {"n_params",    llama_model_n_params   (model)},
-            {"size",        llama_model_size       (model)},
+            {"vocab_type",  llama_vocab_type            (vocab)},
+            {"n_vocab",     llama_vocab_n_tokens        (vocab)},
+            {"n_ctx_train", llama_n_ctx_train           (model)},
+            {"n_embd",      llama_n_embd                (model)},
+            {"n_params",    llama_model_n_params        (model)},
+            {"size",        llama_model_size            (model)},
+            {"general.architecture", general_architecture },
+            {"general.type", general_type },
+            {"general.name", general_name },
+            {"general.version", general_version },
+            {"general.finetune", general_finetune },
+            {"general.basename", general_basename },
+            {"general.size_label", general_size_label },
+            {"general.license", general_license },
         };
     }
+    // llama-server-one END
 };
 
 static void log_server_request(const httplib::Request & req, const httplib::Response & res) {

From 2cab5711ab7e08628e41d747edc3c983c98ae582 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 15:41:25 -0700
Subject: [PATCH 13/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 24 +++---------------------
 1 file changed, 3 insertions(+), 21 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index a8ef579475cd8..24920918d16ae 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -64,7 +64,7 @@ printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n*******
 ```
 
 ---
-### OPTIONAL: Create website Directory in Archive
+### Create website Directory in Archive
 
 `llama.cpp` has a built in chat UI. If you'd like to provide a custom UI, you should add a `website` directory to the `llama-server-one` archive. `llama.cpp`'s chat UI is optimized for serving inside the project's source code. But we can copy the unoptimized source:
 ```
@@ -77,7 +77,7 @@ zip -0 -r $LLAMA_SERVER_ONE_ZIP website/*
 printf "\n**********\n*\n* FINISHED: Create website Directory in Archive.\n*\n**********\n\n"
 ```
 
-#### OPTONAL: Verify website Directory in Archive
+#### Verify website Directory in Archive
 
 Verify that the archive has your website:
 ```
@@ -102,29 +102,11 @@ model.gguf
 8080
 --ctx-size
 8192
-...
-EOF
-printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
-```
-
-#### OPTIONAL: Create default-args File with Website
-
-If you added a website to the archive, use this instead:
-```
-cat << EOF > $DEFAULT_ARGS
--m
-model.gguf
---host
-127.0.0.1
---port
-8080
---ctx-size
-8192
 --path
 /zip/website
 ...
 EOF
-printf "\n**********\n*\n* FINISHED: Create Default args File with Website.\n*\n**********\n\n"
+printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
 ```
 
 ---

From 83c8bdd9c070109bafa9b05a96f9880a617589b9 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 31 Mar 2025 16:45:44 -0700
Subject: [PATCH 14/73] Update server.cpp

Change process name for Cosmo.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index bdbd6c453ec0e..a640727148aa6 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -3485,6 +3485,9 @@ int main(int argc, char ** argv) {
     // llama-server-one START
     // This implements an args file feature inspired by llamafile's.
     #ifdef COSMOCC
+    // Keep the build from showing up as ape in the process list.
+    pthread_setname_np(pthread_self(), "llama-server-one");
+    
     // Args files if present. The names are different to remove confusion during packaging.
     const std::string& argsFilename = "llama-server-one-args";
     const std::string& zipArgsFilename = "/zip/default-args";

From 35f9c7df8618151897d13534f363eb6f607582cd Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:24:38 -0700
Subject: [PATCH 15/73] Update and rename README.md to README-LS1.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-LS1.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-LS1.md (100%)

diff --git a/README.md b/README-LS1.md
similarity index 100%
rename from README.md
rename to README-LS1.md

From 5f96ceca69e6cd816bdd20fc41ceee3df6f66160 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:24:52 -0700
Subject: [PATCH 16/73] Update and rename README-LS1.md to README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README-LS1.md => README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README-LS1.md => README.md (100%)

diff --git a/README-LS1.md b/README.md
similarity index 100%
rename from README-LS1.md
rename to README.md

From aa907edf38bf180f3cb5987bfb874fb8807d614b Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:25:08 -0700
Subject: [PATCH 17/73] Update and rename README.md to README-LS1.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-LS1.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-LS1.md (100%)

diff --git a/README.md b/README-LS1.md
similarity index 100%
rename from README.md
rename to README-LS1.md

From ab621d2023de1e10ffe7bc1a0735562c8842a34d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:25:25 -0700
Subject: [PATCH 18/73] Rename README-llama.cpp.md to README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README-llama.cpp.md => README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README-llama.cpp.md => README.md (100%)

diff --git a/README-llama.cpp.md b/README.md
similarity index 100%
rename from README-llama.cpp.md
rename to README.md

From db1564e8a289c56a922d690a247ac9276298774c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:27:36 -0700
Subject: [PATCH 19/73] Rename README.md to README-llama.cpp.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-llama.cpp.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-llama.cpp.md (100%)

diff --git a/README.md b/README-llama.cpp.md
similarity index 100%
rename from README.md
rename to README-llama.cpp.md

From f2a4d28eb232b90e26afb357afefe510aca57e8a Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:27:51 -0700
Subject: [PATCH 20/73] Update and rename README-LS1.md to README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README-LS1.md => README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README-LS1.md => README.md (100%)

diff --git a/README-LS1.md b/README.md
similarity index 100%
rename from README-LS1.md
rename to README.md

From 582dcfc4929f9d4b4fe78bca77477b6bbf810d57 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 10:54:52 -0700
Subject: [PATCH 21/73] Update README.md

Updated  to-do list.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 8217d819a7848..42f8d0812dd8e 100644
--- a/README.md
+++ b/README.md
@@ -70,6 +70,8 @@ In no particular order of importance, these are the things that bother me:
 - GPU support without a complicated kludge, and that can support all supported platform / CPU / GPU triads. Perhaps a plugin system with shared library dispatch? Invoking dev tools on Apple Metal like llamafile does is "complicated".
 - Code signing instructions. Might have to sign executables within the zip package, plus the package itself.
 - Clean up remaining build warnings, either by fixing source (i.e. Cosmo) or finding the magical compiler flags.
-- Copy the `cosmo_args` function into `server.cpp` so it could potentially be incorporated upstream in non-Cosmo builds. `common/arg2.cpp` might be a good landing spot. License in [Cosmo source code](https://github.com/jart/cosmopolitan/blob/master/tool/args/args2.c) appears to be MIT compatible with attribution. 
+- Copy the `cosmo_args` function into `server.cpp` so it could potentially be incorporated upstream in non-Cosmo builds. `common/arg2.cpp` might be a good landing spot. License in [Cosmo source code](https://github.com/jart/cosmopolitan/blob/master/tool/args/args2.c) appears to be MIT compatible with attribution.
+  - The args thing is cute, but it might be easier as a yaml file. Key value pairs. Flags can be keys with null values.
 - The `--ctx-size` parameter doesn't seem quite right given that new models have the training (or max) context size in their metadata. That size should be used subject to a maximum in a passed parameter. E.g. So a 128K model can run comfortably on a smaller device.
 - Write docs for a Deploying step. It should address the args file, removing the extra executable depending on platform, models, host, port. context size.
+- Make a '.gitattributes' file so we can set the default file to be displayed and keep the README.md from llama.cpp. This will help in syncing changes continually from upstream. Reference: https://git-scm.com/docs/gitattributes

From 2cb14dfea7684115ef18a63ce6dc762e8d411d5e Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 11:06:08 -0700
Subject: [PATCH 22/73] Update Configuring-ls1-Brads-Env.md

--threads-http

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 24920918d16ae..d7882023198dc 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -91,7 +91,7 @@ A `default-args` file in the archive can specify sane default parameters. The fo
 
 We don't yet support including the model inside the zip archive (yet). That has a 4GB size limitation on Windows anyway, as `.exe` files cannot exceed 4GB. So let's use an adjacent file called `model.gguf`.
 
-We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible.
+We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible. The `--threads-http` parameter ensures that the browser can ask for all the image files in our default UI at once.
 ```
 cat << EOF > $DEFAULT_ARGS
 -m
@@ -102,6 +102,8 @@ model.gguf
 8080
 --ctx-size
 8192
+--threads-http
+8
 --path
 /zip/website
 ...

From b36532cd21ca4f7fcbb6dff7a252fb55edbcb038 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 4 Apr 2025 11:06:36 -0700
Subject: [PATCH 23/73] Update README.md

Misc.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 42f8d0812dd8e..7e33723360242 100644
--- a/README.md
+++ b/README.md
@@ -74,4 +74,4 @@ In no particular order of importance, these are the things that bother me:
   - The args thing is cute, but it might be easier as a yaml file. Key value pairs. Flags can be keys with null values.
 - The `--ctx-size` parameter doesn't seem quite right given that new models have the training (or max) context size in their metadata. That size should be used subject to a maximum in a passed parameter. E.g. So a 128K model can run comfortably on a smaller device.
 - Write docs for a Deploying step. It should address the args file, removing the extra executable depending on platform, models, host, port. context size.
-- Make a '.gitattributes' file so we can set the default file to be displayed and keep the README.md from llama.cpp. This will help in syncing changes continually from upstream. Reference: https://git-scm.com/docs/gitattributes
+- Make a `.gitattributes` file so we can set the default file to be displayed and keep the README.md from llama.cpp. This will help in syncing changes continually from upstream. Reference: https://git-scm.com/docs/gitattributes

From c0dfba3cdb39f9cd02f41558ad2083b8c798069e Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:53:01 -0700
Subject: [PATCH 24/73] Update and rename Makefile to Makefile-LS1

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile => Makefile-LS1 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename Makefile => Makefile-LS1 (100%)

diff --git a/Makefile b/Makefile-LS1
similarity index 100%
rename from Makefile
rename to Makefile-LS1

From 871a988ad4c44a4f35ee43bf4fa6ac02a0b6f7b7 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:53:14 -0700
Subject: [PATCH 25/73] Rename Makefile-llama-cpp-original to Makefile

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile-llama-cpp-original => Makefile | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename Makefile-llama-cpp-original => Makefile (100%)

diff --git a/Makefile-llama-cpp-original b/Makefile
similarity index 100%
rename from Makefile-llama-cpp-original
rename to Makefile

From 0899a81259f62d9580547fa21aaa8cfd81d7db84 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:53:27 -0700
Subject: [PATCH 26/73] Update and rename README.md to README-LS1.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-LS1.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-LS1.md (100%)

diff --git a/README.md b/README-LS1.md
similarity index 100%
rename from README.md
rename to README-LS1.md

From 7e2be1be573fdf8bae03f4cc9636f39140fd0252 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:53:38 -0700
Subject: [PATCH 27/73] Rename README-llama.cpp.md to README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README-llama.cpp.md => README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README-llama.cpp.md => README.md (100%)

diff --git a/README-llama.cpp.md b/README.md
similarity index 100%
rename from README-llama.cpp.md
rename to README.md

From a217dab44a73775ea1b7beef07b8a1dccff60ba8 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:55:22 -0700
Subject: [PATCH 28/73] Rename README.md to README-llama-cpp.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-llama-cpp.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-llama-cpp.md (100%)

diff --git a/README.md b/README-llama-cpp.md
similarity index 100%
rename from README.md
rename to README-llama-cpp.md

From 94c5915fd55084697e6e5f57ec679993fec76e2c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:55:30 -0700
Subject: [PATCH 29/73] Update and rename README-LS1.md to README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README-LS1.md => README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README-LS1.md => README.md (100%)

diff --git a/README-LS1.md b/README.md
similarity index 100%
rename from README-LS1.md
rename to README.md

From e9d662a346019c5bbc938be1560db35d93ecdf86 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:55:57 -0700
Subject: [PATCH 30/73] Rename Makefile to Makefile-llama-cpp

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile => Makefile-llama-cpp | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename Makefile => Makefile-llama-cpp (100%)

diff --git a/Makefile b/Makefile-llama-cpp
similarity index 100%
rename from Makefile
rename to Makefile-llama-cpp

From 6a44b2a6f65d77f810e3dc671cbe917d7254ff55 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 10:56:09 -0700
Subject: [PATCH 31/73] Update and rename Makefile-LS1 to Makefile

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile-LS1 => Makefile | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename Makefile-LS1 => Makefile (100%)

diff --git a/Makefile-LS1 b/Makefile
similarity index 100%
rename from Makefile-LS1
rename to Makefile

From 59707a8ffaf943127d4c38be860a1780f57f48f5 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 11:56:48 -0700
Subject: [PATCH 32/73] Update Makefile

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Makefile b/Makefile
index 35e3d21234a48..e3ad8d0f98482 100644
--- a/Makefile
+++ b/Makefile
@@ -1021,10 +1021,14 @@ OBJ_GGML = \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu_cpp.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-aarch64.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-hbm.o \
+	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-impl.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-quants.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-traits.o \
 	$(DIR_GGML)/src/ggml-cpu/binary-ops.o \
+	$(DIR_GGML)/src/ggml-cpu/cpu-feats-x86.o \
+	$(DIR_GGML)/src/ggml-cpu/ops.o \
 	$(DIR_GGML)/src/ggml-cpu/unary-ops.o \
+	$(DIR_GGML)/src/ggml-cpu/vec.o \
 	$(DIR_GGML)/src/gguf.o \
 	$(OBJ_GGML_EXT)
 

From 9b230e3aaa0848c3c9423f7d3966b4eb04f086e7 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 12:00:54 -0700
Subject: [PATCH 33/73] Update Makefile

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/Makefile b/Makefile
index e3ad8d0f98482..cfe3f24dce1ab 100644
--- a/Makefile
+++ b/Makefile
@@ -1021,7 +1021,6 @@ OBJ_GGML = \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu_cpp.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-aarch64.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-hbm.o \
-	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-impl.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-quants.o \
 	$(DIR_GGML)/src/ggml-cpu/ggml-cpu-traits.o \
 	$(DIR_GGML)/src/ggml-cpu/binary-ops.o \

From c543fd0949e9f951fea5c27e699922ba0f059add Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 13:07:38 -0700
Subject: [PATCH 34/73] Update common.cpp

Cosmo - Find the user's cache directory when we don't know what OS they are on at compile time.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 common/common.cpp | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/common/common.cpp b/common/common.cpp
index 94f545f815c27..52c6b0bbdbea4 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -840,6 +840,21 @@ std::string fs_get_cache_directory() {
         cache_directory = std::getenv("HOME") + std::string("/Library/Caches/");
 #elif defined(_WIN32)
         cache_directory = std::getenv("LOCALAPPDATA");
+
+// llama-server-one START
+#elif defined(COSMOCC)
+        // We don't know what OS we are running on at compile time, just CPU architecture.
+        // try various environment variables, fall back to ~/.cache.
+        // FUTURE: Checkj if the directories actually exist.
+        cache_directory = std::getenv("LOCALAPPDATA");
+        if (cache_directory == "") {
+            cache_directory = std::getenv("XDG_CACHE_HOME");
+        }
+        if (cache_directory == "") {
+            cache_directory = std::getenv("HOME") + std::string("/.cache/");
+        }
+        
+// llama-server-one END        
 #else
 #  error Unknown architecture
 #endif

From 68355a50fe3678017e9db2ec8241270cfd16de15 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 16:23:53 -0700
Subject: [PATCH 35/73] Create index.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 index.md | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 index.md

diff --git a/index.md b/index.md
new file mode 100644
index 0000000000000..1b7c81cbb68b0
--- /dev/null
+++ b/index.md
@@ -0,0 +1 @@
+Seeing if the index.md file overrides???

From 5a54935bc3249e32c85410112cccda8d29b77df8 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 16:25:51 -0700
Subject: [PATCH 36/73] Update and rename index.md to .gitattributes

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 .gitattributes | 1 +
 index.md       | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)
 create mode 100644 .gitattributes
 delete mode 100644 index.md

diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000000000..79e9525952e75
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+README.md README-llama-server-one.md
diff --git a/index.md b/index.md
deleted file mode 100644
index 1b7c81cbb68b0..0000000000000
--- a/index.md
+++ /dev/null
@@ -1 +0,0 @@
-Seeing if the index.md file overrides???

From 92aa15e6cf1cf5479f8646a279cc16354427ca75 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Wed, 16 Apr 2025 16:27:23 -0700
Subject: [PATCH 37/73] Delete .gitattributes

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 .gitattributes | 1 -
 1 file changed, 1 deletion(-)
 delete mode 100644 .gitattributes

diff --git a/.gitattributes b/.gitattributes
deleted file mode 100644
index 79e9525952e75..0000000000000
--- a/.gitattributes
+++ /dev/null
@@ -1 +0,0 @@
-README.md README-llama-server-one.md

From 6afeac8811940283aff43dc75120d9f71891bde7 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 18 Apr 2025 15:00:32 -0700
Subject: [PATCH 38/73] Update Building-ls1.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Building-ls1.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Building-ls1.md b/docs/Building-ls1.md
index 3585f42bdc60f..09b07cb45e400 100644
--- a/docs/Building-ls1.md
+++ b/docs/Building-ls1.md
@@ -21,7 +21,7 @@ I build with a freshly installed Ubuntu 24.04 VM. Here are some packages that ar
 ```
 sudo apt install -y git python3-pip build-essential zlib1g-dev \
     libffi-dev libssl-dev libbz2-dev libreadline-dev libsqlite3-dev \
-    liblzma-dev tk-dev python3-tk cmake zip
+    liblzma-dev tk-dev python3-tk cmake zip npm
 printf "\n**********\n*\n* FINISHED: Build Dependencies.\n*\n**********\n\n"
 ```
 

From 0442ee401474c0bd46993a86383edde33e2c66cc Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 18 Apr 2025 15:04:52 -0700
Subject: [PATCH 39/73] Create Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 145 +++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)
 create mode 100644 docs/Buidling-ls1-Brads-Env.md

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
new file mode 100644
index 0000000000000..35bc46cb4d723
--- /dev/null
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -0,0 +1,145 @@
+## Building llama-server
+
+Brad Hutchings<br/>
+brad@bradhutchings.com
+
+This file contains instructions for building `llama.cpp` with `cosmocc` to yield a `llama-server` executable that will run on multiple platforms.
+
+### Environment Variables
+
+Let's define some environment variables:
+```
+BUILDING_DIR="1-BUILDING-llama.cpp"
+printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
+```
+
+_Note that if you copy each code block from the guide and paste it into your terminal, each block ends with a message so you won't lose your place in this guide._
+
+---
+### Build Dependencies
+I build with a freshly installed Ubuntu 24.04 VM. Here are some packages that are helpful in creating a working build system. You may need to install more.
+```
+sudo apt install -y git python3-pip build-essential zlib1g-dev \
+    libffi-dev libssl-dev libbz2-dev libreadline-dev libsqlite3-dev \
+    liblzma-dev tk-dev python3-tk cmake zip npm
+printf "\n**********\n*\n* FINISHED: Build Dependencies.\n*\n**********\n\n"
+```
+
+---
+### Clone this Repo Locally
+Clone this repo into a `~\llama.cpp` directory.
+```
+cd ~
+git clone https://github.com/BradHutchings/llama-server-one.git $BUILDING_DIR
+printf "\n**********\n*\n* FINISHED: Clone this Repo Locally.\n*\n**********\n\n"
+```
+
+**Optional:** Use the `work-in-progress` branch where I implement and test my own changes and where I test upstream changes from `llama.cpp`.
+```
+cd ~/$BUILDING_DIR
+git checkout work-in-progress
+printf "\n**********\n*\n* FINISHED: Checkout work-in-progress.\n*\n**********\n\n"
+```
+
+---
+### Customize WebUI
+```
+APP_NAME='Mmojo Chat'
+sed -i -e 's/<title>.*<\/title>/<title>$APP_NAME<\/title>/g' examples/server/webui/index.html
+sed -i -e 's/>llama.cpp<\/div>/>$APP_NAME<\/div>/g' examples/server/webui/src/components/Header.tsx
+
+cd examples/server/webui
+npm i
+npm run build
+printf "\n**********\n*\n* FINISHED: Customize WebUI.\n*\n**********\n\n"
+```
+
+---
+### Make llama.cpp
+We use the old `Makefile` rather than CMake. We've updated the `Makefile` in this repo to build llama.cpp correctly.
+```
+cd ~/$BUILDING_DIR
+export LLAMA_MAKEFILE=1
+make clean
+make
+printf "\n**********\n*\n* FINISHED: Make llama.cpp.\n*\n**********\n\n"
+```
+
+If the build is successful, it will end with this message:
+
+&nbsp;&nbsp;&nbsp;&nbsp;**NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead.**
+
+If the build fails and you've checked out the `work-in-progress` branch, well, it's in progess, so switch back to the `master` branch and build that.
+
+If the build fails on the `master` branch, please post a note in the [Discussions](https://github.com/BradHutchings/llama-server-one/discussions) area.
+
+#### List Directory
+
+At this point, you should see `llama-server` and other built binaries in the directory listing.
+```
+ls -al
+printf "\n**********\n*\n* FINISHED: List Directory.\n*\n**********\n\n"
+```
+
+---
+### Install Cosmo
+```
+mkdir -p cosmocc
+cd cosmocc
+wget https://cosmo.zip/pub/cosmocc/cosmocc.zip
+unzip cosmocc.zip
+rm cosmocc.zip
+cd ..
+printf "\n**********\n*\n* FINISHED: Install Cosmo.\n*\n**********\n\n"
+```
+
+---
+### Prepare to make llama.cpp with Cosmo
+```
+export PATH="$(pwd)/cosmocc/bin:$PATH"
+export CC="cosmocc -I$(pwd)/cosmocc/include -L$(pwd)/cosmocc/lib"
+export CXX="cosmocc -I$(pwd)/cosmocc/include \
+    -I$(pwd)/cosmocc/include/third_party/libcxx \
+    -L$(pwd)/cosmocc/lib"
+export UNAME_S="cosmocc"
+export UNAME_P="cosmocc"
+export UNAME_M="cosmocc"
+printf "\n**********\n*\n* FINISHED: Prepare to make llama.cpp with Cosmo.\n*\n**********\n\n"
+```
+
+---
+### Make llama.cpp with Cosmo
+```
+make clean
+make
+printf "\n**********\n*\n* FINISHED: Make llama.cpp with Cosmo\n*\n**********\n\n"
+```
+
+If the build is successful, it will end with this message:
+
+&nbsp;&nbsp;&nbsp;&nbsp;**NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead.**
+
+If the build fails and you've checked out the `work-in-progress` branch, well, it's in progess, so switch back to the `master` branch and build that.
+
+If the build fails on the `master` branch, please post a note in the [Discussions](https://github.com/BradHutchings/llama-server-one/discussions) area.
+
+#### List Directory
+
+At this point, you should see `llama-server` and other built binaries in the directory listing.
+```
+ls -al
+printf "\n**********\n*\n* FINISHED: List Directory.\n*\n**********\n\n"
+```
+
+#### Verify Zip Archive
+
+`llama-server` is actually a zip acrhive with an "Actually Portable Executable" (APE) loader prefix. Let's verify the zip archive part:
+```
+unzip -l llama-server
+printf "\n**********\n*\n* FINISHED: Verify Zip Archive.\n*\n**********\n\n"
+```
+
+---
+### Configuring llama-server-one
+
+Now that you've built `llama-server`, you're ready to configure it as `llama-server-one`. Follow instructions in [Configuring-ls1-Brads-Env.md](Configuring-ls1-Brads-Env.md).

From c13ceb33617263352192d01b0c862605efaf263b Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 18 Apr 2025 15:05:42 -0700
Subject: [PATCH 40/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 35bc46cb4d723..21ee19e26a7f2 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -47,10 +47,10 @@ printf "\n**********\n*\n* FINISHED: Checkout work-in-progress.\n*\n**********\n
 APP_NAME='Mmojo Chat'
 sed -i -e 's/<title>.*<\/title>/<title>$APP_NAME<\/title>/g' examples/server/webui/index.html
 sed -i -e 's/>llama.cpp<\/div>/>$APP_NAME<\/div>/g' examples/server/webui/src/components/Header.tsx
-
 cd examples/server/webui
 npm i
 npm run build
+cd ~/$BUILDING_DIR
 printf "\n**********\n*\n* FINISHED: Customize WebUI.\n*\n**********\n\n"
 ```
 

From 4f23a5646276197f1895795168a436543e67b703 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Fri, 18 Apr 2025 15:27:27 -0700
Subject: [PATCH 41/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 21ee19e26a7f2..aa0a8c59230d5 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -45,8 +45,8 @@ printf "\n**********\n*\n* FINISHED: Checkout work-in-progress.\n*\n**********\n
 ### Customize WebUI
 ```
 APP_NAME='Mmojo Chat'
-sed -i -e 's/<title>.*<\/title>/<title>$APP_NAME<\/title>/g' examples/server/webui/index.html
-sed -i -e 's/>llama.cpp<\/div>/>$APP_NAME<\/div>/g' examples/server/webui/src/components/Header.tsx
+sed -i -e "s/<title>.*<\/title>/<title>$APP_NAME<\/title>/g" examples/server/webui/index.html
+sed -i -e "s/>llama.cpp<\/div>/>$APP_NAME<\/div>/g" examples/server/webui/src/components/Header.tsx
 cd examples/server/webui
 npm i
 npm run build

From f557dc5f2844f370a2997b1c1291aa051900ffdb Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 11:56:58 -0700
Subject: [PATCH 42/73] Rename server.cpp to server-ls1.cpp

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/{server.cpp => server-ls1.cpp} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename examples/server/{server.cpp => server-ls1.cpp} (100%)

diff --git a/examples/server/server.cpp b/examples/server/server-ls1.cpp
similarity index 100%
rename from examples/server/server.cpp
rename to examples/server/server-ls1.cpp

From a7cea41ddafb39d98c5fafe302159bda2610519d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 12:01:17 -0700
Subject: [PATCH 43/73] Create server.cpp

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 examples/server/server.cpp

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
new file mode 100644
index 0000000000000..8b137891791fe
--- /dev/null
+++ b/examples/server/server.cpp
@@ -0,0 +1 @@
+

From 3b8f05aa8fc01202f95761233b70475cde91a33c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 12:01:52 -0700
Subject: [PATCH 44/73] Update server.cpp

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 4639 ++++++++++++++++++++++++++++++++++++
 1 file changed, 4639 insertions(+)

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 8b137891791fe..c580ec123299c 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -1 +1,4640 @@
+#include "utils.hpp"
 
+#include "arg.h"
+#include "common.h"
+#include "json-schema-to-grammar.h"
+#include "llama.h"
+#include "log.h"
+#include "sampling.h"
+#include "speculative.h"
+
+// Change JSON_ASSERT from assert() to GGML_ASSERT:
+#define JSON_ASSERT GGML_ASSERT
+#include "json.hpp"
+// mime type for sending response
+#define MIMETYPE_JSON "application/json; charset=utf-8"
+
+// auto generated files (see README.md for details)
+#include "index.html.gz.hpp"
+#include "loading.html.hpp"
+
+#include <atomic>
+#include <chrono>
+#include <condition_variable>
+#include <cstddef>
+#include <cinttypes>
+#include <deque>
+#include <memory>
+#include <mutex>
+#include <signal.h>
+#include <thread>
+#include <unordered_map>
+#include <unordered_set>
+
+using json = nlohmann::ordered_json;
+
+constexpr int HTTP_POLLING_SECONDS = 1;
+
+enum stop_type {
+    STOP_TYPE_NONE,
+    STOP_TYPE_EOS,
+    STOP_TYPE_WORD,
+    STOP_TYPE_LIMIT,
+};
+
+// state diagram: https://github.com/ggml-org/llama.cpp/pull/9283
+enum slot_state {
+    SLOT_STATE_IDLE,
+    SLOT_STATE_STARTED, // TODO: this state is only used for setting up the initial prompt processing; maybe merge it with launch_slot_with_task in the future
+    SLOT_STATE_PROCESSING_PROMPT,
+    SLOT_STATE_DONE_PROMPT,
+    SLOT_STATE_GENERATING,
+};
+
+enum server_state {
+    SERVER_STATE_LOADING_MODEL,  // Server is starting up, model not fully loaded yet
+    SERVER_STATE_READY,          // Server is ready and model is loaded
+};
+
+enum server_task_type {
+    SERVER_TASK_TYPE_COMPLETION,
+    SERVER_TASK_TYPE_EMBEDDING,
+    SERVER_TASK_TYPE_RERANK,
+    SERVER_TASK_TYPE_INFILL,
+    SERVER_TASK_TYPE_CANCEL,
+    SERVER_TASK_TYPE_NEXT_RESPONSE,
+    SERVER_TASK_TYPE_METRICS,
+    SERVER_TASK_TYPE_SLOT_SAVE,
+    SERVER_TASK_TYPE_SLOT_RESTORE,
+    SERVER_TASK_TYPE_SLOT_ERASE,
+    SERVER_TASK_TYPE_SET_LORA,
+};
+
+enum oaicompat_type {
+    OAICOMPAT_TYPE_NONE,
+    OAICOMPAT_TYPE_CHAT,
+    OAICOMPAT_TYPE_COMPLETION,
+    OAICOMPAT_TYPE_EMBEDDING,
+};
+
+// https://community.openai.com/t/openai-chat-list-of-error-codes-and-types/357791/11
+enum error_type {
+    ERROR_TYPE_INVALID_REQUEST,
+    ERROR_TYPE_AUTHENTICATION,
+    ERROR_TYPE_SERVER,
+    ERROR_TYPE_NOT_FOUND,
+    ERROR_TYPE_PERMISSION,
+    ERROR_TYPE_UNAVAILABLE, // custom error
+    ERROR_TYPE_NOT_SUPPORTED, // custom error
+};
+
+struct slot_params {
+    bool stream        = true;
+    bool cache_prompt  = true; // remember the prompt to avoid reprocessing all prompt
+    bool return_tokens = false;
+
+    int32_t n_keep    =  0; // number of tokens to keep from initial prompt
+    int32_t n_discard =  0; // number of tokens after n_keep that may be discarded when shifting context, 0 defaults to half
+    int32_t n_predict = -1; // new tokens to predict
+    int32_t n_indent  =  0; // mininum line indentation for the generated text in number of whitespace characters
+
+    int64_t t_max_prompt_ms  = -1; // TODO: implement
+    int64_t t_max_predict_ms = -1; // if positive, limit the generation phase to this time limit
+
+    std::vector<common_adapter_lora_info> lora;
+
+    std::vector<std::string> antiprompt;
+    std::vector<std::string> response_fields;
+    bool timings_per_token = false;
+    bool post_sampling_probs = false;
+    bool ignore_eos = false;
+
+    struct common_params_sampling sampling;
+    struct common_params_speculative speculative;
+
+    // OAI-compat fields
+    bool                  verbose                   = false;
+    oaicompat_type        oaicompat                 = OAICOMPAT_TYPE_NONE;
+    std::string           oaicompat_model;
+    std::string           oaicompat_cmpl_id;
+    common_chat_format    oaicompat_chat_format     = COMMON_CHAT_FORMAT_CONTENT_ONLY;
+
+    json to_json() const {
+        std::vector<std::string> samplers;
+        samplers.reserve(sampling.samplers.size());
+        for (const auto & sampler : sampling.samplers) {
+            samplers.emplace_back(common_sampler_type_to_str(sampler));
+        }
+
+        json lora = json::array();
+        for (size_t i = 0; i < this->lora.size(); ++i) {
+            lora.push_back({{"id", i}, {"scale", this->lora[i].scale}});
+        }
+
+        auto grammar_triggers = json::array();
+        for (const auto & trigger : sampling.grammar_triggers) {
+            server_grammar_trigger ct(std::move(trigger));
+            grammar_triggers.push_back(ct.to_json());
+        }
+
+        return json {
+            {"n_predict",                 n_predict},     // Server configured n_predict
+            {"seed",                      sampling.seed},
+            {"temperature",               sampling.temp},
+            {"dynatemp_range",            sampling.dynatemp_range},
+            {"dynatemp_exponent",         sampling.dynatemp_exponent},
+            {"top_k",                     sampling.top_k},
+            {"top_p",                     sampling.top_p},
+            {"min_p",                     sampling.min_p},
+            {"xtc_probability",           sampling.xtc_probability},
+            {"xtc_threshold",             sampling.xtc_threshold},
+            {"typical_p",                 sampling.typ_p},
+            {"repeat_last_n",             sampling.penalty_last_n},
+            {"repeat_penalty",            sampling.penalty_repeat},
+            {"presence_penalty",          sampling.penalty_present},
+            {"frequency_penalty",         sampling.penalty_freq},
+            {"dry_multiplier",            sampling.dry_multiplier},
+            {"dry_base",                  sampling.dry_base},
+            {"dry_allowed_length",        sampling.dry_allowed_length},
+            {"dry_penalty_last_n",        sampling.dry_penalty_last_n},
+            {"dry_sequence_breakers",     sampling.dry_sequence_breakers},
+            {"mirostat",                  sampling.mirostat},
+            {"mirostat_tau",              sampling.mirostat_tau},
+            {"mirostat_eta",              sampling.mirostat_eta},
+            {"stop",                      antiprompt},
+            {"max_tokens",                n_predict}, // User configured n_predict
+            {"n_keep",                    n_keep},
+            {"n_discard",                 n_discard},
+            {"ignore_eos",                sampling.ignore_eos},
+            {"stream",                    stream},
+            {"logit_bias",                format_logit_bias(sampling.logit_bias)},
+            {"n_probs",                   sampling.n_probs},
+            {"min_keep",                  sampling.min_keep},
+            {"grammar",                   sampling.grammar},
+            {"grammar_lazy",              sampling.grammar_lazy},
+            {"grammar_triggers",          grammar_triggers},
+            {"preserved_tokens",          sampling.preserved_tokens},
+            {"chat_format",               common_chat_format_name(oaicompat_chat_format)},
+            {"samplers",                  samplers},
+            {"speculative.n_max",         speculative.n_max},
+            {"speculative.n_min",         speculative.n_min},
+            {"speculative.p_min",         speculative.p_min},
+            {"timings_per_token",         timings_per_token},
+            {"post_sampling_probs",       post_sampling_probs},
+            {"lora",                      lora},
+        };
+    }
+};
+
+struct server_task {
+    int id    = -1; // to be filled by server_queue
+    int index = -1; // used when there are multiple prompts (batch request)
+
+    server_task_type type;
+
+    // used by SERVER_TASK_TYPE_CANCEL
+    int id_target = -1;
+
+    // used by SERVER_TASK_TYPE_INFERENCE
+    slot_params  params;
+    llama_tokens prompt_tokens;
+    int id_selected_slot = -1;
+
+    // used by SERVER_TASK_TYPE_SLOT_SAVE, SERVER_TASK_TYPE_SLOT_RESTORE, SERVER_TASK_TYPE_SLOT_ERASE
+    struct slot_action {
+        int slot_id;
+        std::string filename;
+        std::string filepath;
+    };
+    slot_action slot_action;
+
+    // used by SERVER_TASK_TYPE_METRICS
+    bool metrics_reset_bucket = false;
+
+    // used by SERVER_TASK_TYPE_SET_LORA
+    std::vector<common_adapter_lora_info> set_lora;
+
+    server_task(server_task_type type) : type(type) {}
+
+    static slot_params params_from_json_cmpl(
+            const llama_context * ctx,
+            const common_params & params_base,
+            const json & data) {
+        const llama_model * model = llama_get_model(ctx);
+        const llama_vocab * vocab = llama_model_get_vocab(model);
+
+        slot_params params;
+
+        // Sampling parameter defaults are loaded from the global server context (but individual requests can still override them)
+        slot_params defaults;
+        defaults.sampling    = params_base.sampling;
+        defaults.speculative = params_base.speculative;
+
+        // enabling this will output extra debug information in the HTTP responses from the server
+        params.verbose           = params_base.verbosity > 9;
+        params.timings_per_token = json_value(data, "timings_per_token", false);
+
+        params.stream           = json_value(data, "stream",             false);
+        params.cache_prompt     = json_value(data, "cache_prompt",       true);
+        params.return_tokens    = json_value(data, "return_tokens",      false);
+        params.n_predict        = json_value(data, "n_predict",          json_value(data, "max_tokens", defaults.n_predict));
+        params.n_indent         = json_value(data, "n_indent",           defaults.n_indent);
+        params.n_keep           = json_value(data, "n_keep",             defaults.n_keep);
+        params.n_discard        = json_value(data, "n_discard",          defaults.n_discard);
+      //params.t_max_prompt_ms  = json_value(data, "t_max_prompt_ms",    defaults.t_max_prompt_ms); // TODO: implement
+        params.t_max_predict_ms = json_value(data, "t_max_predict_ms",   defaults.t_max_predict_ms);
+        params.response_fields  = json_value(data, "response_fields",   std::vector<std::string>());
+
+        params.sampling.top_k              = json_value(data, "top_k",              defaults.sampling.top_k);
+        params.sampling.top_p              = json_value(data, "top_p",              defaults.sampling.top_p);
+        params.sampling.min_p              = json_value(data, "min_p",              defaults.sampling.min_p);
+        params.sampling.xtc_probability    = json_value(data, "xtc_probability",    defaults.sampling.xtc_probability);
+        params.sampling.xtc_threshold      = json_value(data, "xtc_threshold",      defaults.sampling.xtc_threshold);
+        params.sampling.typ_p              = json_value(data, "typical_p",          defaults.sampling.typ_p);
+        params.sampling.temp               = json_value(data, "temperature",        defaults.sampling.temp);
+        params.sampling.dynatemp_range     = json_value(data, "dynatemp_range",     defaults.sampling.dynatemp_range);
+        params.sampling.dynatemp_exponent  = json_value(data, "dynatemp_exponent",  defaults.sampling.dynatemp_exponent);
+        params.sampling.penalty_last_n     = json_value(data, "repeat_last_n",      defaults.sampling.penalty_last_n);
+        params.sampling.penalty_repeat     = json_value(data, "repeat_penalty",     defaults.sampling.penalty_repeat);
+        params.sampling.penalty_freq       = json_value(data, "frequency_penalty",  defaults.sampling.penalty_freq);
+        params.sampling.penalty_present    = json_value(data, "presence_penalty",   defaults.sampling.penalty_present);
+        params.sampling.dry_multiplier     = json_value(data, "dry_multiplier",     defaults.sampling.dry_multiplier);
+        params.sampling.dry_base           = json_value(data, "dry_base",           defaults.sampling.dry_base);
+        params.sampling.dry_allowed_length = json_value(data, "dry_allowed_length", defaults.sampling.dry_allowed_length);
+        params.sampling.dry_penalty_last_n = json_value(data, "dry_penalty_last_n", defaults.sampling.dry_penalty_last_n);
+        params.sampling.mirostat           = json_value(data, "mirostat",           defaults.sampling.mirostat);
+        params.sampling.mirostat_tau       = json_value(data, "mirostat_tau",       defaults.sampling.mirostat_tau);
+        params.sampling.mirostat_eta       = json_value(data, "mirostat_eta",       defaults.sampling.mirostat_eta);
+        params.sampling.seed               = json_value(data, "seed",               defaults.sampling.seed);
+        params.sampling.n_probs            = json_value(data, "n_probs",            defaults.sampling.n_probs);
+        params.sampling.min_keep           = json_value(data, "min_keep",           defaults.sampling.min_keep);
+        params.post_sampling_probs         = json_value(data, "post_sampling_probs", defaults.post_sampling_probs);
+
+        params.speculative.n_min = json_value(data, "speculative.n_min", defaults.speculative.n_min);
+        params.speculative.n_max = json_value(data, "speculative.n_max", defaults.speculative.n_max);
+        params.speculative.p_min = json_value(data, "speculative.p_min", defaults.speculative.p_min);
+
+        params.speculative.n_min = std::min(params.speculative.n_max, params.speculative.n_min);
+        params.speculative.n_min = std::max(params.speculative.n_min, 0);
+        params.speculative.n_max = std::max(params.speculative.n_max, 0);
+
+        // Use OpenAI API logprobs only if n_probs wasn't provided
+        if (data.contains("logprobs") && params.sampling.n_probs == defaults.sampling.n_probs){
+            params.sampling.n_probs = json_value(data, "logprobs", defaults.sampling.n_probs);
+        }
+
+        if (data.contains("lora")) {
+            if (data.at("lora").is_array()) {
+                params.lora = parse_lora_request(params_base.lora_adapters, data.at("lora"));
+            } else {
+                throw std::runtime_error("Error: 'lora' must be an array of objects with 'id' and 'scale' fields");
+            }
+        } else {
+            params.lora = params_base.lora_adapters;
+        }
+
+        // TODO: add more sanity checks for the input parameters
+
+        if (params.sampling.penalty_last_n < -1) {
+            throw std::runtime_error("Error: repeat_last_n must be >= -1");
+        }
+
+        if (params.sampling.dry_penalty_last_n < -1) {
+            throw std::runtime_error("Error: dry_penalty_last_n must be >= -1");
+        }
+
+        if (params.sampling.penalty_last_n == -1) {
+            // note: should be the slot's context and not the full context, but it's ok
+            params.sampling.penalty_last_n = llama_n_ctx(ctx);
+        }
+
+        if (params.sampling.dry_penalty_last_n == -1) {
+            params.sampling.dry_penalty_last_n = llama_n_ctx(ctx);
+        }
+
+        if (params.sampling.dry_base < 1.0f) {
+            params.sampling.dry_base = defaults.sampling.dry_base;
+        }
+
+        // sequence breakers for DRY
+        {
+            // Currently, this is not compatible with TextGen WebUI, Koboldcpp and SillyTavern format
+            // Ref: https://github.com/oobabooga/text-generation-webui/blob/d1af7a41ade7bd3c3a463bfa640725edb818ebaf/extensions/openai/typing.py#L39
+
+            if (data.contains("dry_sequence_breakers")) {
+                params.sampling.dry_sequence_breakers = json_value(data, "dry_sequence_breakers", std::vector<std::string>());
+                if (params.sampling.dry_sequence_breakers.empty()) {
+                    throw std::runtime_error("Error: dry_sequence_breakers must be a non-empty array of strings");
+                }
+            }
+        }
+
+        // process "json_schema" and "grammar"
+        if (data.contains("json_schema") && !data.contains("grammar")) {
+            try {
+                auto schema                  = json_value(data, "json_schema", json::object());
+                SRV_DBG("JSON schema: %s\n", schema.dump(2).c_str());
+                params.sampling.grammar      = json_schema_to_grammar(schema);
+                SRV_DBG("Converted grammar: %s\n", params.sampling.grammar.c_str());
+            } catch (const std::exception & e) {
+                throw std::runtime_error(std::string("\"json_schema\": ") + e.what());
+            }
+        } else {
+            params.sampling.grammar      = json_value(data, "grammar", defaults.sampling.grammar);
+            SRV_DBG("Grammar: %s\n", params.sampling.grammar.c_str());
+            params.sampling.grammar_lazy = json_value(data, "grammar_lazy", defaults.sampling.grammar_lazy);
+            SRV_DBG("Grammar lazy: %s\n", params.sampling.grammar_lazy ? "true" : "false");
+        }
+
+        {
+            auto it = data.find("chat_format");
+            if (it != data.end()) {
+                params.oaicompat_chat_format = static_cast<common_chat_format>(it->get<int>());
+                SRV_INF("Chat format: %s\n", common_chat_format_name(params.oaicompat_chat_format).c_str());
+            } else {
+                params.oaicompat_chat_format = defaults.oaicompat_chat_format;
+            }
+        }
+
+        {
+            const auto preserved_tokens = data.find("preserved_tokens");
+            if (preserved_tokens != data.end()) {
+                for (const auto & t : *preserved_tokens) {
+                    auto ids = common_tokenize(vocab, t.get<std::string>(), /* add_special= */ false, /* parse_special= */ true);
+                    if (ids.size() == 1) {
+                        SRV_DBG("Preserved token: %d\n", ids[0]);
+                        params.sampling.preserved_tokens.insert(ids[0]);
+                    } else {
+                        // This may happen when using a tool call style meant for a model with special tokens to preserve on a model without said tokens.
+                        SRV_DBG("Not preserved because more than 1 token: %s\n", t.get<std::string>().c_str());
+                    }
+                }
+            }
+            const auto grammar_triggers = data.find("grammar_triggers");
+            if (grammar_triggers != data.end()) {
+                for (const auto & t : *grammar_triggers) {
+                    server_grammar_trigger ct(t);
+                    if (ct.value.type == COMMON_GRAMMAR_TRIGGER_TYPE_WORD) {
+                        const auto & word = ct.value.value;
+                        auto ids = common_tokenize(vocab, word, /* add_special= */ false, /* parse_special= */ true);
+                        if (ids.size() == 1) {
+                            auto token = ids[0];
+                            if (std::find(params.sampling.preserved_tokens.begin(), params.sampling.preserved_tokens.end(), (llama_token) token) == params.sampling.preserved_tokens.end()) {
+                                throw std::runtime_error("Grammar trigger word should be marked as preserved token: " + word);
+                            }
+                            SRV_DBG("Grammar trigger token: %d (`%s`)\n", token, word.c_str());
+                            common_grammar_trigger trigger;
+                            trigger.type = COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN;
+                            trigger.value = word;
+                            trigger.token = token;
+                            params.sampling.grammar_triggers.push_back(std::move(trigger));
+                        } else {
+                            SRV_DBG("Grammar trigger word: `%s`\n", word.c_str());
+                            params.sampling.grammar_triggers.push_back({COMMON_GRAMMAR_TRIGGER_TYPE_WORD, word});
+                        }
+                    } else {
+                        params.sampling.grammar_triggers.push_back(std::move(ct.value));
+                    }
+                }
+            }
+            if (params.sampling.grammar_lazy && params.sampling.grammar_triggers.empty()) {
+                throw std::runtime_error("Error: no triggers set for lazy grammar!");
+            }
+        }
+
+        {
+            params.sampling.logit_bias.clear();
+            params.ignore_eos = json_value(data, "ignore_eos", false);
+
+            const auto & logit_bias = data.find("logit_bias");
+            if (logit_bias != data.end() && logit_bias->is_array()) {
+                const int n_vocab = llama_vocab_n_tokens(vocab);
+                for (const auto & el : *logit_bias) {
+                    // TODO: we may want to throw errors here, in case "el" is incorrect
+                    if (el.is_array() && el.size() == 2) {
+                        float bias;
+                        if (el[1].is_number()) {
+                            bias = el[1].get<float>();
+                        } else if (el[1].is_boolean() && !el[1].get<bool>()) {
+                            bias = -INFINITY;
+                        } else {
+                            continue;
+                        }
+
+                        if (el[0].is_number_integer()) {
+                            llama_token tok = el[0].get<llama_token>();
+                            if (tok >= 0 && tok < n_vocab) {
+                                params.sampling.logit_bias.push_back({tok, bias});
+                            }
+                        } else if (el[0].is_string()) {
+                            auto toks = common_tokenize(vocab, el[0].get<std::string>(), false);
+                            for (auto tok : toks) {
+                                params.sampling.logit_bias.push_back({tok, bias});
+                            }
+                        }
+                    }
+                }
+            }
+        }
+
+        {
+            params.antiprompt.clear();
+
+            const auto & stop = data.find("stop");
+            if (stop != data.end() && stop->is_array()) {
+                for (const auto & word : *stop) {
+                    if (!word.empty()) {
+                        params.antiprompt.push_back(word);
+                    }
+                }
+            }
+        }
+
+        {
+            const auto samplers = data.find("samplers");
+            if (samplers != data.end()) {
+                if (samplers->is_array()) {
+                    params.sampling.samplers = common_sampler_types_from_names(*samplers, false);
+                } else if (samplers->is_string()){
+                    params.sampling.samplers = common_sampler_types_from_chars(samplers->get<std::string>());
+                }
+            } else {
+                params.sampling.samplers = defaults.sampling.samplers;
+            }
+        }
+
+        std::string model_name = params_base.model_alias.empty() ? DEFAULT_OAICOMPAT_MODEL : params_base.model_alias;
+        params.oaicompat_model = json_value(data, "model", model_name);
+
+        return params;
+    }
+
+    // utility function
+    static std::unordered_set<int> get_list_id(const std::vector<server_task> & tasks) {
+        std::unordered_set<int> ids(tasks.size());
+        for (size_t i = 0; i < tasks.size(); i++) {
+            ids.insert(tasks[i].id);
+        }
+        return ids;
+    }
+};
+
+struct result_timings {
+    int32_t prompt_n = -1;
+    double prompt_ms;
+    double prompt_per_token_ms;
+    double prompt_per_second;
+
+    int32_t predicted_n = -1;
+    double predicted_ms;
+    double predicted_per_token_ms;
+    double predicted_per_second;
+
+    // Optional speculative metrics - only included when > 0
+    int32_t draft_n = 0;
+    int32_t draft_n_accepted = 0;
+
+    json to_json() const {
+        json base = {
+            {"prompt_n",               prompt_n},
+            {"prompt_ms",              prompt_ms},
+            {"prompt_per_token_ms",    prompt_per_token_ms},
+            {"prompt_per_second",      prompt_per_second},
+
+            {"predicted_n",            predicted_n},
+            {"predicted_ms",           predicted_ms},
+            {"predicted_per_token_ms", predicted_per_token_ms},
+            {"predicted_per_second",   predicted_per_second},
+        };
+
+        if (draft_n > 0) {
+            base["draft_n"] = draft_n;
+            base["draft_n_accepted"] = draft_n_accepted;
+        }
+
+        return base;
+    }
+};
+
+struct server_task_result {
+    int id           = -1;
+    int id_slot      = -1;
+    virtual bool is_error() {
+        // only used by server_task_result_error
+        return false;
+    }
+    virtual bool is_stop() {
+        // only used by server_task_result_cmpl_*
+        return false;
+    }
+    virtual int get_index() {
+        return -1;
+    }
+    virtual json to_json() = 0;
+    virtual ~server_task_result() = default;
+};
+
+// using shared_ptr for polymorphism of server_task_result
+using server_task_result_ptr = std::unique_ptr<server_task_result>;
+
+inline std::string stop_type_to_str(stop_type type) {
+    switch (type) {
+        case STOP_TYPE_EOS:   return "eos";
+        case STOP_TYPE_WORD:  return "word";
+        case STOP_TYPE_LIMIT: return "limit";
+        default:              return "none";
+    }
+}
+
+struct completion_token_output {
+    llama_token tok;
+    float prob;
+    std::string text_to_send;
+    struct prob_info {
+        llama_token tok;
+        std::string txt;
+        float prob;
+    };
+    std::vector<prob_info> probs;
+
+    json to_json(bool post_sampling_probs) const {
+        json probs_for_token = json::array();
+        for (const auto & p : probs) {
+            std::string txt(p.txt);
+            txt.resize(validate_utf8(txt));
+            probs_for_token.push_back(json {
+                {"id",      p.tok},
+                {"token",   txt},
+                {"bytes",   str_to_bytes(p.txt)},
+                {
+                    post_sampling_probs ? "prob" : "logprob",
+                    post_sampling_probs ? p.prob : logarithm(p.prob)
+                },
+            });
+        }
+        return probs_for_token;
+    }
+
+    static json probs_vector_to_json(const std::vector<completion_token_output> & probs, bool post_sampling_probs) {
+        json out = json::array();
+        for (const auto & p : probs) {
+            std::string txt(p.text_to_send);
+            txt.resize(validate_utf8(txt));
+            out.push_back(json {
+                {"id",           p.tok},
+                {"token",        txt},
+                {"bytes",        str_to_bytes(p.text_to_send)},
+                {
+                    post_sampling_probs ? "prob" : "logprob",
+                    post_sampling_probs ? p.prob : logarithm(p.prob)
+                },
+                {
+                    post_sampling_probs ? "top_probs" : "top_logprobs",
+                    p.to_json(post_sampling_probs)
+                },
+            });
+        }
+        return out;
+    }
+
+    static float logarithm(float x) {
+        // nlohmann::json converts -inf to null, so we need to prevent that
+        return x == 0.0f ? std::numeric_limits<float>::lowest() : std::log(x);
+    }
+
+    static std::vector<unsigned char> str_to_bytes(const std::string & str) {
+        std::vector<unsigned char> bytes;
+        for (unsigned char c : str) {
+            bytes.push_back(c);
+        }
+        return bytes;
+    }
+};
+
+struct server_task_result_cmpl_final : server_task_result {
+    int index = 0;
+
+    std::string content;
+    llama_tokens tokens;
+
+    bool stream;
+    result_timings timings;
+    std::string prompt;
+
+    bool truncated;
+    int32_t n_decoded;
+    int32_t n_prompt_tokens;
+    int32_t n_tokens_cached;
+    bool has_new_line;
+    std::string stopping_word;
+    stop_type stop = STOP_TYPE_NONE;
+
+    bool post_sampling_probs;
+    std::vector<completion_token_output> probs_output;
+    std::vector<std::string>  response_fields;
+
+    slot_params generation_params;
+
+    // OAI-compat fields
+    bool                  verbose                  = false;
+    oaicompat_type        oaicompat                = OAICOMPAT_TYPE_NONE;
+    std::string           oaicompat_model;
+    std::string           oaicompat_cmpl_id;
+    common_chat_format    oaicompat_chat_format    = COMMON_CHAT_FORMAT_CONTENT_ONLY;
+
+    virtual int get_index() override {
+        return index;
+    }
+
+    virtual bool is_stop() override {
+        return true; // in stream mode, final responses are considered stop
+    }
+
+    virtual json to_json() override {
+        switch (oaicompat) {
+            case OAICOMPAT_TYPE_NONE:
+                return to_json_non_oaicompat();
+            case OAICOMPAT_TYPE_COMPLETION:
+                return to_json_oaicompat();
+            case OAICOMPAT_TYPE_CHAT:
+                return stream ? to_json_oaicompat_chat_stream() : to_json_oaicompat_chat();
+            default:
+                GGML_ASSERT(false && "Invalid oaicompat_type");
+        }
+    }
+
+    json to_json_non_oaicompat() {
+        json res = json {
+            {"index",               index},
+            {"content",             stream ? "" : content}, // in stream mode, content is already in last partial chunk
+            {"tokens",              stream ? llama_tokens {} : tokens},
+            {"id_slot",             id_slot},
+            {"stop",                true},
+            {"model",               oaicompat_model},
+            {"tokens_predicted",    n_decoded},
+            {"tokens_evaluated",    n_prompt_tokens},
+            {"generation_settings", generation_params.to_json()},
+            {"prompt",              prompt},
+            {"has_new_line",        has_new_line},
+            {"truncated",           truncated},
+            {"stop_type",           stop_type_to_str(stop)},
+            {"stopping_word",       stopping_word},
+            {"tokens_cached",       n_tokens_cached},
+            {"timings",             timings.to_json()},
+        };
+        if (!stream && !probs_output.empty()) {
+            res["completion_probabilities"] = completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs);
+        }
+        return response_fields.empty() ? res : json_get_nested_values(response_fields, res);
+    }
+
+    json to_json_oaicompat() {
+        std::time_t t = std::time(0);
+        json logprobs = json(nullptr); // OAI default to null
+        if (!stream && probs_output.size() > 0) {
+            logprobs = json{
+                {"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
+            };
+        }
+        json finish_reason = "length";
+        if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
+            finish_reason = "stop";
+        }
+        json res = json {
+            {"choices",            json::array({
+                json{
+                    {"text",          stream ? "" : content}, // in stream mode, content is already in last partial chunk
+                    {"index",         index},
+                    {"logprobs",      logprobs},
+                    {"finish_reason", finish_reason},
+                }
+            })},
+            {"created",            t},
+            {"model",              oaicompat_model},
+            {"system_fingerprint", build_info},
+            {"object",             "text_completion"},
+            {"usage", json {
+                {"completion_tokens", n_decoded},
+                {"prompt_tokens",     n_prompt_tokens},
+                {"total_tokens",      n_decoded + n_prompt_tokens}
+            }},
+            {"id", oaicompat_cmpl_id}
+        };
+
+        // extra fields for debugging purposes
+        if (verbose) {
+            res["__verbose"] = to_json_non_oaicompat();
+        }
+        if (timings.prompt_n >= 0) {
+            res.push_back({"timings", timings.to_json()});
+        }
+
+        return res;
+    }
+
+    json to_json_oaicompat_chat() {
+        std::string finish_reason = "length";
+        common_chat_msg msg;
+        if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
+            SRV_DBG("Parsing chat message: %s\n", content.c_str());
+            msg = common_chat_parse(content, oaicompat_chat_format);
+            finish_reason = msg.tool_calls.empty() ? "stop" : "tool_calls";
+        } else {
+            msg.content = content;
+        }
+
+        json message {
+            {"role", "assistant"},
+        };
+        if (!msg.reasoning_content.empty()) {
+            message["reasoning_content"] = msg.reasoning_content;
+        }
+        if (msg.content.empty() && !msg.tool_calls.empty()) {
+            message["content"] = json();
+        } else {
+            message["content"] = msg.content;
+        }
+        if (!msg.tool_calls.empty()) {
+            auto tool_calls = json::array();
+            for (const auto & tc : msg.tool_calls) {
+                tool_calls.push_back({
+                    {"type", "function"},
+                    {"function", {
+                        {"name", tc.name},
+                        {"arguments", tc.arguments},
+                    }},
+                    // Some templates generate and require an id (sometimes in a very specific format, e.g. Mistral Nemo).
+                    // We only generate a random id for the ones that don't generate one by themselves
+                    // (they also won't get to see it as their template likely doesn't use it, so it's all for the client)
+                    {"id", tc.id.empty() ? gen_tool_call_id() : tc.id},
+                });
+            }
+            message["tool_calls"] = tool_calls;
+        }
+
+        json choice {
+            {"finish_reason", finish_reason},
+            {"index", 0},
+            {"message", message},
+        };
+
+        if (!stream && probs_output.size() > 0) {
+            choice["logprobs"] = json{
+                {"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
+            };
+        }
+
+        std::time_t t = std::time(0);
+
+        json res = json {
+            {"choices",            json::array({choice})},
+            {"created",            t},
+            {"model",              oaicompat_model},
+            {"system_fingerprint", build_info},
+            {"object",             "chat.completion"},
+            {"usage", json {
+                {"completion_tokens", n_decoded},
+                {"prompt_tokens",     n_prompt_tokens},
+                {"total_tokens",      n_decoded + n_prompt_tokens}
+            }},
+            {"id", oaicompat_cmpl_id}
+        };
+
+        // extra fields for debugging purposes
+        if (verbose) {
+            res["__verbose"] = to_json_non_oaicompat();
+        }
+        if (timings.prompt_n >= 0) {
+            res.push_back({"timings", timings.to_json()});
+        }
+
+        return res;
+    }
+
+    json to_json_oaicompat_chat_stream() {
+        std::time_t t = std::time(0);
+        std::string finish_reason = "length";
+        if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
+            finish_reason = "stop";
+        }
+
+        json choice = json {
+            {"finish_reason", finish_reason},
+            {"index", 0},
+            {"delta", json::object()}
+        };
+
+        json ret = json {
+            {"choices",            json::array({choice})},
+            {"created",            t},
+            {"id",                 oaicompat_cmpl_id},
+            {"model",              oaicompat_model},
+            {"system_fingerprint", build_info},
+            {"object",             "chat.completion.chunk"},
+            {"usage", json {
+                {"completion_tokens", n_decoded},
+                {"prompt_tokens",     n_prompt_tokens},
+                {"total_tokens",      n_decoded + n_prompt_tokens},
+            }},
+        };
+
+        if (timings.prompt_n >= 0) {
+            ret.push_back({"timings", timings.to_json()});
+        }
+
+        // extra fields for debugging purposes
+        if (verbose) {
+            ret["__verbose"] = to_json_non_oaicompat();
+        }
+
+        return ret;
+    }
+};
+
+struct server_task_result_cmpl_partial : server_task_result {
+    int index = 0;
+
+    std::string  content;
+    llama_tokens tokens;
+
+    int32_t n_decoded;
+    int32_t n_prompt_tokens;
+
+    bool post_sampling_probs;
+    completion_token_output prob_output;
+    result_timings timings;
+
+    // OAI-compat fields
+    bool           verbose   = false;
+    oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
+    std::string    oaicompat_model;
+    std::string    oaicompat_cmpl_id;
+
+    virtual int get_index() override {
+        return index;
+    }
+
+    virtual bool is_stop() override {
+        return false; // in stream mode, partial responses are not considered stop
+    }
+
+    virtual json to_json() override {
+        switch (oaicompat) {
+            case OAICOMPAT_TYPE_NONE:
+                return to_json_non_oaicompat();
+            case OAICOMPAT_TYPE_COMPLETION:
+                return to_json_oaicompat();
+            case OAICOMPAT_TYPE_CHAT:
+                return to_json_oaicompat_chat();
+            default:
+                GGML_ASSERT(false && "Invalid oaicompat_type");
+        }
+    }
+
+    json to_json_non_oaicompat() {
+        // non-OAI-compat JSON
+        json res = json {
+            {"index",            index},
+            {"content",          content},
+            {"tokens",           tokens},
+            {"stop",             false},
+            {"id_slot",          id_slot},
+            {"tokens_predicted", n_decoded},
+            {"tokens_evaluated", n_prompt_tokens},
+        };
+        // populate the timings object when needed (usually for the last response or with timings_per_token enabled)
+        if (timings.prompt_n > 0) {
+            res.push_back({"timings", timings.to_json()});
+        }
+        if (!prob_output.probs.empty()) {
+            res["completion_probabilities"] = completion_token_output::probs_vector_to_json({prob_output}, post_sampling_probs);
+        }
+        return res;
+    }
+
+    json to_json_oaicompat() {
+        std::time_t t = std::time(0);
+        json logprobs = json(nullptr); // OAI default to null
+        if (prob_output.probs.size() > 0) {
+            logprobs = json{
+                {"content", completion_token_output::probs_vector_to_json({prob_output}, post_sampling_probs)},
+            };
+        }
+        json res = json {
+            {"choices",            json::array({
+                json{
+                    {"text",          content},
+                    {"index",         index},
+                    {"logprobs",      logprobs},
+                    {"finish_reason", nullptr},
+                }
+            })},
+            {"created",            t},
+            {"model",              oaicompat_model},
+            {"system_fingerprint", build_info},
+            {"object",             "text_completion"},
+            {"id",                 oaicompat_cmpl_id}
+        };
+
+        // extra fields for debugging purposes
+        if (verbose) {
+            res["__verbose"] = to_json_non_oaicompat();
+        }
+        if (timings.prompt_n >= 0) {
+            res.push_back({"timings", timings.to_json()});
+        }
+
+        return res;
+    }
+
+    json to_json_oaicompat_chat() {
+        bool first = n_decoded == 0;
+        std::time_t t = std::time(0);
+        json choices;
+
+        if (first) {
+            if (content.empty()) {
+                choices = json::array({json{{"finish_reason", nullptr},
+                                            {"index", 0},
+                                            {"delta", json{{"role", "assistant"}}}}});
+            } else {
+                // We have to send this as two updates to conform to openai behavior
+                json initial_ret = json{{"choices", json::array({json{
+                                        {"finish_reason", nullptr},
+                                        {"index", 0},
+                                        {"delta", json{
+                                            {"role", "assistant"}
+                                        }}}})},
+                            {"created", t},
+                            {"id", oaicompat_cmpl_id},
+                            {"model", oaicompat_model},
+                            {"object", "chat.completion.chunk"}};
+
+                json second_ret = json{
+                            {"choices", json::array({json{{"finish_reason", nullptr},
+                                                            {"index", 0},
+                                                            {"delta", json {
+                                                            {"content", content}}}
+                                                            }})},
+                            {"created", t},
+                            {"id", oaicompat_cmpl_id},
+                            {"model", oaicompat_model},
+                            {"object", "chat.completion.chunk"}};
+
+                return std::vector<json>({initial_ret, second_ret});
+            }
+        } else {
+            choices = json::array({json{
+                {"finish_reason", nullptr},
+                {"index", 0},
+                {"delta",
+                json {
+                    {"content", content},
+                }},
+            }});
+        }
+
+        GGML_ASSERT(choices.size() >= 1);
+
+        if (prob_output.probs.size() > 0) {
+            choices[0]["logprobs"] = json{
+                {"content", completion_token_output::probs_vector_to_json({prob_output}, post_sampling_probs)},
+            };
+        }
+
+        json ret = json {
+            {"choices",            choices},
+            {"created",            t},
+            {"id",                 oaicompat_cmpl_id},
+            {"model",              oaicompat_model},
+            {"system_fingerprint", build_info},
+            {"object",             "chat.completion.chunk"}
+        };
+
+        if (timings.prompt_n >= 0) {
+            ret.push_back({"timings", timings.to_json()});
+        }
+
+        return std::vector<json>({ret});
+    }
+};
+
+struct server_task_result_embd : server_task_result {
+    int index = 0;
+    std::vector<std::vector<float>> embedding;
+
+    int32_t n_tokens;
+
+    // OAI-compat fields
+    oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
+
+    virtual int get_index() override {
+        return index;
+    }
+
+    virtual json to_json() override {
+        return oaicompat == OAICOMPAT_TYPE_EMBEDDING
+            ? to_json_oaicompat()
+            : to_json_non_oaicompat();
+    }
+
+    json to_json_non_oaicompat() {
+        return json {
+            {"index",     index},
+            {"embedding", embedding},
+        };
+    }
+
+    json to_json_oaicompat() {
+        return json {
+            {"index",            index},
+            {"embedding",        embedding[0]},
+            {"tokens_evaluated", n_tokens},
+        };
+    }
+};
+
+struct server_task_result_rerank : server_task_result {
+    int index = 0;
+    float score = -1e6;
+
+    int32_t n_tokens;
+
+    virtual int get_index() override {
+        return index;
+    }
+
+    virtual json to_json() override {
+        return json {
+            {"index",            index},
+            {"score",            score},
+            {"tokens_evaluated", n_tokens},
+        };
+    }
+};
+
+// this function maybe used outside of server_task_result_error
+static json format_error_response(const std::string & message, const enum error_type type) {
+    std::string type_str;
+    int code = 500;
+    switch (type) {
+        case ERROR_TYPE_INVALID_REQUEST:
+            type_str = "invalid_request_error";
+            code = 400;
+            break;
+        case ERROR_TYPE_AUTHENTICATION:
+            type_str = "authentication_error";
+            code = 401;
+            break;
+        case ERROR_TYPE_NOT_FOUND:
+            type_str = "not_found_error";
+            code = 404;
+            break;
+        case ERROR_TYPE_SERVER:
+            type_str = "server_error";
+            code = 500;
+            break;
+        case ERROR_TYPE_PERMISSION:
+            type_str = "permission_error";
+            code = 403;
+            break;
+        case ERROR_TYPE_NOT_SUPPORTED:
+            type_str = "not_supported_error";
+            code = 501;
+            break;
+        case ERROR_TYPE_UNAVAILABLE:
+            type_str = "unavailable_error";
+            code = 503;
+            break;
+    }
+    return json {
+        {"code", code},
+        {"message", message},
+        {"type", type_str},
+    };
+}
+
+struct server_task_result_error : server_task_result {
+    int index = 0;
+    error_type err_type = ERROR_TYPE_SERVER;
+    std::string err_msg;
+
+    virtual bool is_error() override {
+        return true;
+    }
+
+    virtual json to_json() override {
+        return format_error_response(err_msg, err_type);
+    }
+};
+
+struct server_task_result_metrics : server_task_result {
+    int n_idle_slots;
+    int n_processing_slots;
+    int n_tasks_deferred;
+    int64_t t_start;
+
+    int32_t kv_cache_tokens_count;
+    int32_t kv_cache_used_cells;
+
+    // TODO: somehow reuse server_metrics in the future, instead of duplicating the fields
+    uint64_t n_prompt_tokens_processed_total = 0;
+    uint64_t t_prompt_processing_total       = 0;
+    uint64_t n_tokens_predicted_total        = 0;
+    uint64_t t_tokens_generation_total       = 0;
+
+    uint64_t n_prompt_tokens_processed = 0;
+    uint64_t t_prompt_processing       = 0;
+
+    uint64_t n_tokens_predicted  = 0;
+    uint64_t t_tokens_generation = 0;
+
+    uint64_t n_decode_total     = 0;
+    uint64_t n_busy_slots_total = 0;
+
+    // while we can also use std::vector<server_slot> this requires copying the slot object which can be quite messy
+    // therefore, we use json to temporarily store the slot.to_json() result
+    json slots_data = json::array();
+
+    virtual json to_json() override {
+        return json {
+            { "idle",                            n_idle_slots },
+            { "processing",                      n_processing_slots },
+            { "deferred",                        n_tasks_deferred },
+            { "t_start",                         t_start },
+
+            { "n_prompt_tokens_processed_total", n_prompt_tokens_processed_total },
+            { "t_tokens_generation_total",       t_tokens_generation_total },
+            { "n_tokens_predicted_total",        n_tokens_predicted_total },
+            { "t_prompt_processing_total",       t_prompt_processing_total },
+
+            { "n_prompt_tokens_processed",       n_prompt_tokens_processed },
+            { "t_prompt_processing",             t_prompt_processing },
+            { "n_tokens_predicted",              n_tokens_predicted },
+            { "t_tokens_generation",             t_tokens_generation },
+
+            { "n_decode_total",                  n_decode_total },
+            { "n_busy_slots_total",              n_busy_slots_total },
+
+            { "kv_cache_tokens_count",           kv_cache_tokens_count },
+            { "kv_cache_used_cells",             kv_cache_used_cells },
+
+            { "slots",                           slots_data },
+        };
+    }
+};
+
+struct server_task_result_slot_save_load : server_task_result {
+    std::string filename;
+    bool is_save; // true = save, false = load
+
+    size_t n_tokens;
+    size_t n_bytes;
+    double t_ms;
+
+    virtual json to_json() override {
+        if (is_save) {
+            return json {
+                { "id_slot",   id_slot },
+                { "filename",  filename },
+                { "n_saved",   n_tokens },
+                { "n_written", n_bytes },
+                { "timings", {
+                    { "save_ms", t_ms }
+                }},
+            };
+        } else {
+            return json {
+                { "id_slot",    id_slot },
+                { "filename",   filename },
+                { "n_restored", n_tokens },
+                { "n_read",     n_bytes },
+                { "timings", {
+                    { "restore_ms", t_ms }
+                }},
+            };
+        }
+    }
+};
+
+struct server_task_result_slot_erase : server_task_result {
+    size_t n_erased;
+
+    virtual json to_json() override {
+        return json {
+            { "id_slot",  id_slot },
+            { "n_erased", n_erased },
+        };
+    }
+};
+
+struct server_task_result_apply_lora : server_task_result {
+    virtual json to_json() override {
+        return json {{ "success", true }};
+    }
+};
+
+struct server_slot {
+    int id;
+    int id_task = -1;
+
+    // only used for completion/embedding/infill/rerank
+    server_task_type task_type = SERVER_TASK_TYPE_COMPLETION;
+
+    llama_batch batch_spec = {};
+
+    llama_context * ctx = nullptr;
+    llama_context * ctx_dft = nullptr;
+
+    common_speculative * spec = nullptr;
+
+    std::vector<common_adapter_lora_info> lora;
+
+    // the index relative to completion multi-task request
+    size_t index = 0;
+
+    struct slot_params params;
+
+    slot_state state = SLOT_STATE_IDLE;
+
+    // used to determine the slot that has been used the longest
+    int64_t t_last_used = -1;
+
+    // generation props
+    int32_t n_ctx       = 0;  // context size per slot
+    int32_t n_past      = 0;
+    int32_t n_decoded   = 0;
+    int32_t n_remaining = -1;
+    int32_t i_batch     = -1;
+    int32_t n_predict   = -1; // TODO: disambiguate from params.n_predict
+
+    // n_prompt_tokens may not be equal to prompt_tokens.size(), because prompt maybe truncated
+    int32_t n_prompt_tokens           = 0;
+    int32_t n_prompt_tokens_processed = 0;
+
+    // input prompt tokens
+    llama_tokens prompt_tokens;
+
+    size_t last_nl_pos = 0;
+
+    std::string  generated_text;
+    llama_tokens generated_tokens;
+
+    llama_tokens cache_tokens;
+
+    std::vector<completion_token_output> generated_token_probs;
+
+    bool has_next_token = true;
+    bool has_new_line   = false;
+    bool truncated      = false;
+    stop_type stop;
+
+    std::string stopping_word;
+
+    // sampling
+    json json_schema;
+
+    struct common_sampler * smpl = nullptr;
+
+    llama_token sampled;
+
+    common_chat_format chat_format = COMMON_CHAT_FORMAT_CONTENT_ONLY;
+
+    // stats
+    size_t n_sent_text        = 0; // number of sent text character
+
+    int64_t t_start_process_prompt;
+    int64_t t_start_generation;
+
+    double t_prompt_processing; // ms
+    double t_token_generation;  // ms
+
+    std::function<void(int)> callback_on_release;
+
+    // Speculative decoding stats
+    int32_t n_draft_total = 0;      // Total draft tokens generated
+    int32_t n_draft_accepted = 0;   // Draft tokens actually accepted
+
+    void reset() {
+        SLT_DBG(*this, "%s", "\n");
+
+        n_prompt_tokens    = 0;
+        last_nl_pos        = 0;
+        generated_text     = "";
+        has_new_line       = false;
+        truncated          = false;
+        stop               = STOP_TYPE_NONE;
+        stopping_word      = "";
+        n_past             = 0;
+        n_sent_text        = 0;
+        task_type          = SERVER_TASK_TYPE_COMPLETION;
+
+        generated_tokens.clear();
+        generated_token_probs.clear();
+
+        // clear speculative decoding stats
+        n_draft_total = 0;
+        n_draft_accepted = 0;
+    }
+
+    bool is_non_causal() const {
+        return task_type == SERVER_TASK_TYPE_EMBEDDING || task_type == SERVER_TASK_TYPE_RERANK;
+    }
+
+    bool can_batch_with(server_slot & other_slot) const {
+        return is_non_causal() == other_slot.is_non_causal()
+            && are_lora_equal(lora, other_slot.lora);
+    }
+
+    bool has_budget(const common_params & global_params) {
+        if (params.n_predict == -1 && global_params.n_predict == -1) {
+            return true; // limitless
+        }
+
+        n_remaining = -1;
+
+        if (params.n_predict != -1) {
+            n_remaining = params.n_predict - n_decoded;
+        } else if (global_params.n_predict != -1) {
+            n_remaining = global_params.n_predict - n_decoded;
+        }
+
+        return n_remaining > 0; // no budget
+    }
+
+    bool is_processing() const {
+        return state != SLOT_STATE_IDLE;
+    }
+
+    bool can_speculate() const {
+        return ctx_dft && params.speculative.n_max > 0 && params.cache_prompt;
+    }
+
+    void add_token(const completion_token_output & token) {
+        if (!is_processing()) {
+            SLT_WRN(*this, "%s", "slot is not processing\n");
+            return;
+        }
+        generated_token_probs.push_back(token);
+    }
+
+    void release() {
+        if (is_processing()) {
+            SLT_INF(*this, "stop processing: n_past = %d, truncated = %d\n", n_past, truncated);
+
+            t_last_used = ggml_time_us();
+            t_token_generation = (ggml_time_us() - t_start_generation) / 1e3;
+            state = SLOT_STATE_IDLE;
+            callback_on_release(id);
+        }
+    }
+
+    result_timings get_timings() const {
+        result_timings timings;
+        timings.prompt_n = n_prompt_tokens_processed;
+        timings.prompt_ms = t_prompt_processing;
+        timings.prompt_per_token_ms = t_prompt_processing / n_prompt_tokens_processed;
+        timings.prompt_per_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed;
+
+        timings.predicted_n = n_decoded;
+        timings.predicted_ms = t_token_generation;
+        timings.predicted_per_token_ms = t_token_generation / n_decoded;
+        timings.predicted_per_second = 1e3 / t_token_generation * n_decoded;
+
+        // Add speculative metrics
+        if (n_draft_total > 0) {
+            timings.draft_n = n_draft_total;
+            timings.draft_n_accepted = n_draft_accepted;
+        }
+
+        return timings;
+    }
+
+    size_t find_stopping_strings(const std::string & text, const size_t last_token_size, bool is_full_stop) {
+        size_t stop_pos = std::string::npos;
+
+        for (const std::string & word : params.antiprompt) {
+            size_t pos;
+
+            if (is_full_stop) {
+                const size_t tmp      = word.size() + last_token_size;
+                const size_t from_pos = text.size() > tmp ? text.size() - tmp : 0;
+
+                pos = text.find(word, from_pos);
+            } else {
+                // otherwise, partial stop
+                pos = find_partial_stop_string(word, text);
+            }
+
+            if (pos != std::string::npos && (stop_pos == std::string::npos || pos < stop_pos)) {
+                if (is_full_stop) {
+                    stop           = STOP_TYPE_WORD;
+                    stopping_word  = word;
+                    has_next_token = false;
+                }
+                stop_pos = pos;
+            }
+        }
+
+        return stop_pos;
+    }
+
+    void print_timings() const {
+        const double t_prompt        =       t_prompt_processing / n_prompt_tokens_processed;
+        const double n_prompt_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed;
+
+        const double t_gen        =       t_token_generation / n_decoded;
+        const double n_gen_second = 1e3 / t_token_generation * n_decoded;
+
+        SLT_INF(*this,
+                "\n"
+                "prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)\n"
+                "       eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)\n"
+                "      total time = %10.2f ms / %5d tokens\n",
+                t_prompt_processing, n_prompt_tokens_processed, t_prompt, n_prompt_second,
+                t_token_generation, n_decoded, t_gen, n_gen_second,
+                t_prompt_processing + t_token_generation, n_prompt_tokens_processed + n_decoded);
+
+        if (n_draft_total > 0) {
+            const float draft_ratio = (float) n_draft_accepted / n_draft_total;
+            SLT_INF(*this,
+                    "\n"
+                    "draft acceptance rate = %0.5f (%5d accepted / %5d generated)\n",
+                    draft_ratio, n_draft_accepted, n_draft_total
+            );
+        }
+    }
+
+    json to_json() const {
+        return json {
+            {"id",            id},
+            {"id_task",       id_task},
+            {"n_ctx",         n_ctx},
+            {"speculative",   can_speculate()},
+            {"is_processing", is_processing()},
+            {"non_causal",    is_non_causal()},
+            {"params",        params.to_json()},
+            {"prompt",        common_detokenize(ctx, prompt_tokens)},
+            {"next_token",
+                {
+                    {"has_next_token", has_next_token},
+                    {"has_new_line",   has_new_line},
+                    {"n_remain",       n_remaining},
+                    {"n_decoded",      n_decoded},
+                    {"stopping_word",  stopping_word},
+                }
+            },
+        };
+    }
+};
+
+struct server_metrics {
+    int64_t t_start = 0;
+
+    uint64_t n_prompt_tokens_processed_total = 0;
+    uint64_t t_prompt_processing_total       = 0;
+    uint64_t n_tokens_predicted_total        = 0;
+    uint64_t t_tokens_generation_total       = 0;
+
+    uint64_t n_prompt_tokens_processed = 0;
+    uint64_t t_prompt_processing       = 0;
+
+    uint64_t n_tokens_predicted  = 0;
+    uint64_t t_tokens_generation = 0;
+
+    uint64_t n_decode_total     = 0;
+    uint64_t n_busy_slots_total = 0;
+
+    void init() {
+        t_start = ggml_time_us();
+    }
+
+    void on_prompt_eval(const server_slot & slot) {
+        n_prompt_tokens_processed_total += slot.n_prompt_tokens_processed;
+        n_prompt_tokens_processed       += slot.n_prompt_tokens_processed;
+        t_prompt_processing             += slot.t_prompt_processing;
+        t_prompt_processing_total       += slot.t_prompt_processing;
+    }
+
+    void on_prediction(const server_slot & slot) {
+        n_tokens_predicted_total   += slot.n_decoded;
+        n_tokens_predicted         += slot.n_decoded;
+        t_tokens_generation        += slot.t_token_generation;
+        t_tokens_generation_total  += slot.t_token_generation;
+    }
+
+    void on_decoded(const std::vector<server_slot> & slots) {
+        n_decode_total++;
+        for (const auto & slot : slots) {
+            if (slot.is_processing()) {
+                n_busy_slots_total++;
+            }
+        }
+    }
+
+    void reset_bucket() {
+        n_prompt_tokens_processed = 0;
+        t_prompt_processing       = 0;
+        n_tokens_predicted        = 0;
+        t_tokens_generation       = 0;
+    }
+};
+
+struct server_queue {
+    int id = 0;
+    bool running;
+
+    // queues
+    std::deque<server_task> queue_tasks;
+    std::deque<server_task> queue_tasks_deferred;
+
+    std::mutex mutex_tasks;
+    std::condition_variable condition_tasks;
+
+    // callback functions
+    std::function<void(server_task &&)> callback_new_task;
+    std::function<void(void)>           callback_update_slots;
+
+    // Add a new task to the end of the queue
+    int post(server_task && task, bool front = false) {
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        GGML_ASSERT(task.id != -1);
+        // if this is cancel task make sure to clean up pending tasks
+        if (task.type == SERVER_TASK_TYPE_CANCEL) {
+            cleanup_pending_task(task.id_target);
+        }
+        const int task_id = task.id;
+        QUE_DBG("new task, id = %d, front = %d\n", task_id, front);
+        if (front) {
+            queue_tasks.push_front(std::move(task));
+        } else {
+            queue_tasks.push_back(std::move(task));
+        }
+        condition_tasks.notify_one();
+        return task_id;
+    }
+
+    // multi-task version of post()
+    int post(std::vector<server_task> && tasks, bool front = false) {
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        for (auto & task : tasks) {
+            if (task.id == -1) {
+                task.id = id++;
+            }
+            // if this is cancel task make sure to clean up pending tasks
+            if (task.type == SERVER_TASK_TYPE_CANCEL) {
+                cleanup_pending_task(task.id_target);
+            }
+            QUE_DBG("new task, id = %d/%d, front = %d\n", task.id, (int) tasks.size(), front);
+            if (front) {
+                queue_tasks.push_front(std::move(task));
+            } else {
+                queue_tasks.push_back(std::move(task));
+            }
+        }
+        condition_tasks.notify_one();
+        return 0;
+    }
+
+    // Add a new task, but defer until one slot is available
+    void defer(server_task && task) {
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        QUE_DBG("defer task, id = %d\n", task.id);
+        queue_tasks_deferred.push_back(std::move(task));
+        condition_tasks.notify_one();
+    }
+
+    // Get the next id for creating a new task
+    int get_new_id() {
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        int new_id = id++;
+        return new_id;
+    }
+
+    // Register function to process a new task
+    void on_new_task(std::function<void(server_task &&)> callback) {
+        callback_new_task = std::move(callback);
+    }
+
+    // Register the function to be called when all slots data is ready to be processed
+    void on_update_slots(std::function<void(void)> callback) {
+        callback_update_slots = std::move(callback);
+    }
+
+    // Call when the state of one slot is changed, it will move one task from deferred to main queue
+    void pop_deferred_task() {
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        if (!queue_tasks_deferred.empty()) {
+            queue_tasks.emplace_back(std::move(queue_tasks_deferred.front()));
+            queue_tasks_deferred.pop_front();
+        }
+        condition_tasks.notify_one();
+    }
+
+    // end the start_loop routine
+    void terminate() {
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        running = false;
+        condition_tasks.notify_all();
+    }
+
+    /**
+     * Main loop consists of these steps:
+     * - Wait until a new task arrives
+     * - Process the task (i.e. maybe copy data into slot)
+     * - Check if multitask is finished
+     * - Update all slots
+     */
+    void start_loop() {
+        running = true;
+
+        while (true) {
+            QUE_DBG("%s", "processing new tasks\n");
+
+            while (true) {
+                std::unique_lock<std::mutex> lock(mutex_tasks);
+                if (!running) {
+                    QUE_DBG("%s", "terminate\n");
+                    return;
+                }
+                if (queue_tasks.empty()) {
+                    lock.unlock();
+                    break;
+                }
+                server_task task = std::move(queue_tasks.front());
+                queue_tasks.pop_front();
+                lock.unlock();
+
+                QUE_DBG("processing task, id = %d\n", task.id);
+                callback_new_task(std::move(task));
+            }
+
+            // all tasks in the current loop is processed, slots data is now ready
+            QUE_DBG("%s", "update slots\n");
+
+            callback_update_slots();
+
+            QUE_DBG("%s", "waiting for new tasks\n");
+            {
+                std::unique_lock<std::mutex> lock(mutex_tasks);
+                if (!running) {
+                    QUE_DBG("%s", "terminate\n");
+                    return;
+                }
+                if (queue_tasks.empty()) {
+                    condition_tasks.wait(lock, [&]{
+                        return (!queue_tasks.empty() || !running);
+                    });
+                }
+            }
+        }
+    }
+
+private:
+    void cleanup_pending_task(int id_target) {
+        // no need lock because this is called exclusively by post()
+        auto rm_func = [id_target](const server_task & task) {
+            return task.id_target == id_target;
+        };
+        queue_tasks.erase(
+            std::remove_if(queue_tasks.begin(),          queue_tasks.end(),          rm_func),
+            queue_tasks.end());
+        queue_tasks_deferred.erase(
+            std::remove_if(queue_tasks_deferred.begin(), queue_tasks_deferred.end(), rm_func),
+            queue_tasks_deferred.end());
+    }
+};
+
+struct server_response {
+    bool running = true;
+
+    // for keeping track of all tasks waiting for the result
+    std::unordered_set<int> waiting_task_ids;
+
+    // the main result queue (using ptr for polymorphism)
+    std::vector<server_task_result_ptr> queue_results;
+
+    std::mutex mutex_results;
+    std::condition_variable condition_results;
+
+    // add the id_task to the list of tasks waiting for response
+    void add_waiting_task_id(int id_task) {
+        SRV_DBG("add task %d to waiting list. current waiting = %d (before add)\n", id_task, (int) waiting_task_ids.size());
+
+        std::unique_lock<std::mutex> lock(mutex_results);
+        waiting_task_ids.insert(id_task);
+    }
+
+    void add_waiting_tasks(const std::vector<server_task> & tasks) {
+        std::unique_lock<std::mutex> lock(mutex_results);
+
+        for (const auto & task : tasks) {
+            SRV_DBG("add task %d to waiting list. current waiting = %d (before add)\n", task.id, (int) waiting_task_ids.size());
+            waiting_task_ids.insert(task.id);
+        }
+    }
+
+    // when the request is finished, we can remove task associated with it
+    void remove_waiting_task_id(int id_task) {
+        SRV_DBG("remove task %d from waiting list. current waiting = %d (before remove)\n", id_task, (int) waiting_task_ids.size());
+
+        std::unique_lock<std::mutex> lock(mutex_results);
+        waiting_task_ids.erase(id_task);
+        // make sure to clean up all pending results
+        queue_results.erase(
+            std::remove_if(queue_results.begin(), queue_results.end(), [id_task](const server_task_result_ptr & res) {
+                return res->id == id_task;
+            }),
+            queue_results.end());
+    }
+
+    void remove_waiting_task_ids(const std::unordered_set<int> & id_tasks) {
+        std::unique_lock<std::mutex> lock(mutex_results);
+
+        for (const auto & id_task : id_tasks) {
+            SRV_DBG("remove task %d from waiting list. current waiting = %d (before remove)\n", id_task, (int) waiting_task_ids.size());
+            waiting_task_ids.erase(id_task);
+        }
+    }
+
+    // This function blocks the thread until there is a response for one of the id_tasks
+    server_task_result_ptr recv(const std::unordered_set<int> & id_tasks) {
+        while (true) {
+            std::unique_lock<std::mutex> lock(mutex_results);
+            condition_results.wait(lock, [&]{
+                if (!running) {
+                    SRV_DBG("%s : queue result stop\n", __func__);
+                    std::terminate(); // we cannot return here since the caller is HTTP code
+                }
+                return !queue_results.empty();
+            });
+
+            for (size_t i = 0; i < queue_results.size(); i++) {
+                if (id_tasks.find(queue_results[i]->id) != id_tasks.end()) {
+                    server_task_result_ptr res = std::move(queue_results[i]);
+                    queue_results.erase(queue_results.begin() + i);
+                    return res;
+                }
+            }
+        }
+
+        // should never reach here
+    }
+
+    // same as recv(), but have timeout in seconds
+    // if timeout is reached, nullptr is returned
+    server_task_result_ptr recv_with_timeout(const std::unordered_set<int> & id_tasks, int timeout) {
+        while (true) {
+            std::unique_lock<std::mutex> lock(mutex_results);
+
+            for (int i = 0; i < (int) queue_results.size(); i++) {
+                if (id_tasks.find(queue_results[i]->id) != id_tasks.end()) {
+                    server_task_result_ptr res = std::move(queue_results[i]);
+                    queue_results.erase(queue_results.begin() + i);
+                    return res;
+                }
+            }
+
+            std::cv_status cr_res = condition_results.wait_for(lock, std::chrono::seconds(timeout));
+            if (!running) {
+                SRV_DBG("%s : queue result stop\n", __func__);
+                std::terminate(); // we cannot return here since the caller is HTTP code
+            }
+            if (cr_res == std::cv_status::timeout) {
+                return nullptr;
+            }
+        }
+
+        // should never reach here
+    }
+
+    // single-task version of recv()
+    server_task_result_ptr recv(int id_task) {
+        std::unordered_set<int> id_tasks = {id_task};
+        return recv(id_tasks);
+    }
+
+    // Send a new result to a waiting id_task
+    void send(server_task_result_ptr && result) {
+        SRV_DBG("sending result for task id = %d\n", result->id);
+
+        std::unique_lock<std::mutex> lock(mutex_results);
+        for (const auto & id_task : waiting_task_ids) {
+            if (result->id == id_task) {
+                SRV_DBG("task id = %d pushed to result queue\n", result->id);
+
+                queue_results.emplace_back(std::move(result));
+                condition_results.notify_all();
+                return;
+            }
+        }
+    }
+
+    // terminate the waiting loop
+    void terminate() {
+        running = false;
+        condition_results.notify_all();
+    }
+};
+
+struct server_context {
+    common_params params_base;
+
+    // note: keep these alive - they determine the lifetime of the model, context, etc.
+    common_init_result llama_init;
+    common_init_result llama_init_dft;
+
+    llama_model * model = nullptr;
+    llama_context * ctx = nullptr;
+
+    const llama_vocab * vocab = nullptr;
+
+    llama_model * model_dft = nullptr;
+
+    llama_context_params cparams_dft;
+
+    llama_batch batch = {};
+
+    bool clean_kv_cache = true;
+    bool add_bos_token  = true;
+    bool has_eos_token  = false;
+
+    int32_t n_ctx; // total context for all clients / slots
+
+    // slots / clients
+    std::vector<server_slot> slots;
+    json default_generation_settings_for_props;
+
+    server_queue    queue_tasks;
+    server_response queue_results;
+
+    server_metrics metrics;
+
+    // Necessary similarity of prompt for slot selection
+    float slot_prompt_similarity = 0.0f;
+
+    common_chat_templates_ptr chat_templates;
+
+    ~server_context() {
+        // Clear any sampling context
+        for (server_slot & slot : slots) {
+            common_sampler_free(slot.smpl);
+            slot.smpl = nullptr;
+
+            llama_free(slot.ctx_dft);
+            slot.ctx_dft = nullptr;
+
+            common_speculative_free(slot.spec);
+            slot.spec = nullptr;
+
+            llama_batch_free(slot.batch_spec);
+        }
+
+        llama_batch_free(batch);
+    }
+
+    bool load_model(const common_params & params) {
+        SRV_INF("loading model '%s'\n", params.model.path.c_str());
+
+        params_base = params;
+
+        llama_init = common_init_from_params(params_base);
+
+        model = llama_init.model.get();
+        ctx   = llama_init.context.get();
+
+        if (model == nullptr) {
+            SRV_ERR("failed to load model, '%s'\n", params_base.model.path.c_str());
+            return false;
+        }
+
+        vocab = llama_model_get_vocab(model);
+
+        n_ctx = llama_n_ctx(ctx);
+
+        add_bos_token = llama_vocab_get_add_bos(vocab);
+        has_eos_token = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;
+
+        if (!params_base.speculative.model.path.empty() || !params_base.speculative.model.hf_repo.empty()) {
+            SRV_INF("loading draft model '%s'\n", params_base.speculative.model.path.c_str());
+
+            auto params_dft = params_base;
+
+            params_dft.devices      = params_base.speculative.devices;
+            params_dft.model        = params_base.speculative.model;
+            params_dft.n_ctx        = params_base.speculative.n_ctx == 0 ? params_base.n_ctx / params_base.n_parallel : params_base.speculative.n_ctx;
+            params_dft.n_gpu_layers = params_base.speculative.n_gpu_layers;
+            params_dft.n_parallel   = 1;
+
+            // force F16 KV cache for the draft model for extra performance
+            params_dft.cache_type_k = GGML_TYPE_F16;
+            params_dft.cache_type_v = GGML_TYPE_F16;
+
+            llama_init_dft = common_init_from_params(params_dft);
+
+            model_dft = llama_init_dft.model.get();
+
+            if (model_dft == nullptr) {
+                SRV_ERR("failed to load draft model, '%s'\n", params_base.speculative.model.path.c_str());
+                return false;
+            }
+
+            if (!common_speculative_are_compatible(ctx, llama_init_dft.context.get())) {
+                SRV_ERR("the draft model '%s' is not compatible with the target model '%s'\n", params_base.speculative.model.path.c_str(), params_base.model.path.c_str());
+
+                return false;
+            }
+
+            const int n_ctx_dft = llama_n_ctx(llama_init_dft.context.get());
+
+            cparams_dft = common_context_params_to_llama(params_dft);
+            cparams_dft.n_batch = n_ctx_dft;
+
+            // the context is not needed - we will create one for each slot
+            llama_init_dft.context.reset();
+        }
+
+        chat_templates = common_chat_templates_init(model, params_base.chat_template);
+        try {
+            common_chat_format_example(chat_templates.get(), params.use_jinja);
+        } catch (const std::exception & e) {
+            SRV_WRN("%s: Chat template parsing error: %s\n", __func__, e.what());
+            SRV_WRN("%s: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses\n", __func__);
+            chat_templates = common_chat_templates_init(model, "chatml");
+        }
+
+        return true;
+    }
+
+    void init() {
+        const int32_t n_ctx_slot = n_ctx / params_base.n_parallel;
+
+        SRV_INF("initializing slots, n_slots = %d\n", params_base.n_parallel);
+
+        for (int i = 0; i < params_base.n_parallel; i++) {
+            server_slot slot;
+
+            slot.id = i;
+            slot.ctx = ctx;
+            slot.n_ctx = n_ctx_slot;
+            slot.n_predict = params_base.n_predict;
+
+            if (model_dft) {
+                slot.batch_spec = llama_batch_init(params_base.speculative.n_max + 1, 0, 1);
+
+                slot.ctx_dft = llama_init_from_model(model_dft, cparams_dft);
+                if (slot.ctx_dft == nullptr) {
+                    SRV_ERR("%s", "failed to create draft context\n");
+                    return;
+                }
+
+                slot.spec = common_speculative_init(slot.ctx_dft);
+                if (slot.spec == nullptr) {
+                    SRV_ERR("%s", "failed to create speculator\n");
+                    return;
+                }
+            }
+
+            SLT_INF(slot, "new slot n_ctx_slot = %d\n", slot.n_ctx);
+
+            slot.params.sampling = params_base.sampling;
+
+            slot.callback_on_release = [this](int) {
+                queue_tasks.pop_deferred_task();
+            };
+
+            slot.reset();
+
+            slots.push_back(std::move(slot));
+        }
+
+        default_generation_settings_for_props = slots[0].to_json();
+
+        // the update_slots() logic will always submit a maximum of n_batch or n_parallel tokens
+        // note that n_batch can be > n_ctx (e.g. for non-causal attention models such as BERT where the KV cache is not used)
+        {
+            const int32_t n_batch = llama_n_batch(ctx);
+
+            // only a single seq_id per token is needed
+            batch = llama_batch_init(std::max(n_batch, params_base.n_parallel), 0, 1);
+        }
+
+        metrics.init();
+    }
+
+    server_slot * get_slot_by_id(int id) {
+        for (server_slot & slot : slots) {
+            if (slot.id == id) {
+                return &slot;
+            }
+        }
+
+        return nullptr;
+    }
+
+    server_slot * get_available_slot(const server_task & task) {
+        server_slot * ret = nullptr;
+
+        // find the slot that has at least n% prompt similarity
+        if (ret == nullptr && slot_prompt_similarity != 0.0f) {
+            int lcs_len = 0;
+            float similarity = 0;
+
+            for (server_slot & slot : slots) {
+                // skip the slot if it is not available
+                if (slot.is_processing()) {
+                    continue;
+                }
+
+                // skip the slot if it does not contains cached tokens
+                if (slot.cache_tokens.empty()) {
+                    continue;
+                }
+
+                // length of the Longest Common Subsequence between the current slot's prompt and the input prompt
+                int cur_lcs_len = common_lcs(slot.cache_tokens, task.prompt_tokens);
+
+                // fraction of the common subsequence length compared to the current slot's prompt length
+                float cur_similarity = static_cast<float>(cur_lcs_len) / static_cast<int>(slot.cache_tokens.size());
+
+                // select the current slot if the criteria match
+                if (cur_lcs_len > lcs_len && cur_similarity > slot_prompt_similarity) {
+                    lcs_len = cur_lcs_len;
+                    similarity = cur_similarity;
+                    ret = &slot;
+                }
+            }
+
+            if (ret != nullptr) {
+                SLT_DBG(*ret, "selected slot by lcs similarity, lcs_len = %d, similarity = %f\n", lcs_len, similarity);
+            }
+        }
+
+        // find the slot that has been least recently used
+        if (ret == nullptr) {
+            int64_t t_last = ggml_time_us();
+            for (server_slot & slot : slots) {
+                // skip the slot if it is not available
+                if (slot.is_processing()) {
+                    continue;
+                }
+
+                // select the current slot if the criteria match
+                if (slot.t_last_used < t_last) {
+                    t_last = slot.t_last_used;
+                    ret = &slot;
+                }
+            }
+
+            if (ret != nullptr) {
+                SLT_DBG(*ret, "selected slot by lru, t_last = %" PRId64 "\n", t_last);
+            }
+        }
+
+        return ret;
+    }
+
+    bool can_be_detokenized(const struct llama_context * ctx, const std::vector<llama_token> & tokens) {
+        const llama_model * model = llama_get_model(ctx);
+        const llama_vocab * vocab = llama_model_get_vocab(model);
+        const int32_t n_vocab = llama_vocab_n_tokens(vocab);
+        for (const auto & token : tokens) {
+            if (token < 0 || token >= n_vocab) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    bool launch_slot_with_task(server_slot & slot, server_task && task) {
+        slot.reset();
+        slot.id_task       = task.id;
+        slot.index         = task.index;
+        slot.task_type     = task.type;
+        slot.params        = std::move(task.params);
+        slot.prompt_tokens = std::move(task.prompt_tokens);
+
+        if (!are_lora_equal(slot.params.lora, slot.lora)) {
+            // if lora is changed, we cannot reuse cached tokens
+            slot.cache_tokens.clear();
+            slot.lora = slot.params.lora;
+        }
+
+        bool can_detokenize = can_be_detokenized(ctx, slot.prompt_tokens);
+        if (!can_detokenize) {
+            send_error(task, "Prompt contains invalid tokens", ERROR_TYPE_INVALID_REQUEST);
+            return false;
+        }
+        SLT_DBG(slot, "launching slot : %s\n", safe_json_to_str(slot.to_json()).c_str());
+
+        if (slot.n_predict > 0 && slot.params.n_predict > slot.n_predict) {
+            // Might be better to reject the request with a 400 ?
+            SLT_WRN(slot, "n_predict = %d exceeds server configuration, setting to %d\n", slot.params.n_predict, slot.n_predict);
+            slot.params.n_predict = slot.n_predict;
+        }
+
+        if (slot.params.ignore_eos && has_eos_token) {
+            slot.params.sampling.logit_bias.push_back({llama_vocab_eos(vocab), -INFINITY});
+        }
+
+        {
+            if (slot.smpl != nullptr) {
+                common_sampler_free(slot.smpl);
+            }
+
+            slot.smpl = common_sampler_init(model, slot.params.sampling);
+            if (slot.smpl == nullptr) {
+                // for now, the only error that may happen here is invalid grammar
+                send_error(task, "Failed to parse grammar", ERROR_TYPE_INVALID_REQUEST);
+                return false;
+            }
+        }
+
+        if (slot.ctx_dft) {
+            llama_batch_free(slot.batch_spec);
+
+            slot.batch_spec = llama_batch_init(slot.params.speculative.n_max + 1, 0, 1);
+        }
+
+        slot.state = SLOT_STATE_STARTED;
+
+        SLT_INF(slot, "%s", "processing task\n");
+
+        return true;
+    }
+
+    void kv_cache_clear() {
+        SRV_DBG("%s", "clearing KV cache\n");
+
+        // clear the entire KV cache
+        llama_kv_self_clear(ctx);
+        clean_kv_cache = false;
+    }
+
+    bool process_token(completion_token_output & result, server_slot & slot) {
+        // remember which tokens were sampled - used for repetition penalties during sampling
+        const std::string token_str = result.text_to_send;
+        slot.sampled = result.tok;
+
+        slot.generated_text += token_str;
+        if (slot.params.return_tokens) {
+            slot.generated_tokens.push_back(result.tok);
+        }
+        slot.has_next_token = true;
+
+        // check if there is incomplete UTF-8 character at the end
+        bool incomplete = validate_utf8(slot.generated_text) < slot.generated_text.size();
+
+        // search stop word and delete it
+        if (!incomplete) {
+            size_t pos = std::min(slot.n_sent_text, slot.generated_text.size());
+
+            const std::string str_test = slot.generated_text.substr(pos);
+            bool send_text = true;
+
+            size_t stop_pos = slot.find_stopping_strings(str_test, token_str.size(), true);
+            if (stop_pos != std::string::npos) {
+                slot.generated_text.erase(
+                    slot.generated_text.begin() + pos + stop_pos,
+                    slot.generated_text.end());
+                pos = std::min(slot.n_sent_text, slot.generated_text.size());
+            } else if (slot.has_next_token) {
+                stop_pos = slot.find_stopping_strings(str_test, token_str.size(), false);
+                send_text = stop_pos == std::string::npos;
+            }
+
+            // check if there is any token to predict
+            if (send_text) {
+                // no send the stop word in the response
+                result.text_to_send = slot.generated_text.substr(pos, std::string::npos);
+                slot.n_sent_text += result.text_to_send.size();
+                // add the token to slot queue and cache
+            } else {
+                result.text_to_send = "";
+            }
+
+            slot.add_token(result);
+            if (slot.params.stream) {
+                send_partial_response(slot, result);
+            }
+        }
+
+        if (incomplete) {
+            slot.has_next_token = true;
+        }
+
+        // check the limits
+        if (slot.n_decoded > 0 && slot.has_next_token && !slot.has_budget(params_base)) {
+            slot.stop           = STOP_TYPE_LIMIT;
+            slot.has_next_token = false;
+
+            SLT_DBG(slot, "stopped by limit, n_decoded = %d, n_predict = %d\n", slot.n_decoded, slot.params.n_predict);
+        }
+
+        if (slot.has_new_line) {
+            // require that each new line has a whitespace prefix (i.e. indentation) of at least slot.params.n_indent
+            if (slot.params.n_indent > 0) {
+                // check the current indentation
+                // TODO: improve by not doing it more than once for each new line
+                if (slot.last_nl_pos > 0) {
+                    size_t pos = slot.last_nl_pos;
+
+                    int n_indent = 0;
+                    while (pos < slot.generated_text.size() && (slot.generated_text[pos] == ' ' || slot.generated_text[pos] == '\t')) {
+                        n_indent++;
+                        pos++;
+                    }
+
+                    if (pos < slot.generated_text.size() && n_indent < slot.params.n_indent) {
+                        slot.stop           = STOP_TYPE_LIMIT;
+                        slot.has_next_token = false;
+
+                        // cut the last line
+                        slot.generated_text.erase(pos, std::string::npos);
+
+                        SLT_DBG(slot, "stopped by indentation limit, n_decoded = %d, n_indent = %d\n", slot.n_decoded, n_indent);
+                    }
+                }
+
+                // find the next new line
+                {
+                    const size_t pos = slot.generated_text.find('\n', slot.last_nl_pos);
+
+                    if (pos != std::string::npos) {
+                        slot.last_nl_pos = pos + 1;
+                    }
+                }
+            }
+        }
+
+        // check if there is a new line in the generated text
+        if (result.text_to_send.find('\n') != std::string::npos) {
+            slot.has_new_line = true;
+
+            // if we have seen a new line, we stop after a certain time limit, but only upon another new line
+            if (slot.params.t_max_predict_ms > 0 && (ggml_time_us() - slot.t_start_generation > 1000.0f*slot.params.t_max_predict_ms)) {
+                slot.stop           = STOP_TYPE_LIMIT;
+                slot.has_next_token = false;
+
+                SLT_DBG(slot, "stopped by time limit, n_decoded = %d, t_max_predict_ms = %d ms\n", slot.n_decoded, (int) slot.params.t_max_predict_ms);
+            }
+        }
+
+        // if context shift is disabled, we stop when it reaches the context limit
+        if (slot.n_past >= slot.n_ctx) {
+            slot.truncated      = true;
+            slot.stop           = STOP_TYPE_LIMIT;
+            slot.has_next_token = false;
+
+            SLT_DBG(slot, "stopped due to running out of context capacity, n_past = %d, n_prompt_tokens = %d, n_decoded = %d, n_ctx = %d\n",
+                    slot.n_decoded, slot.n_prompt_tokens, slot.n_past, slot.n_ctx);
+        }
+
+        if (llama_vocab_is_eog(vocab, result.tok)) {
+            slot.stop           = STOP_TYPE_EOS;
+            slot.has_next_token = false;
+
+            SLT_DBG(slot, "%s", "stopped by EOS\n");
+        }
+
+        const auto n_ctx_train = llama_model_n_ctx_train(model);
+
+        if (slot.params.n_predict < 1 && slot.n_predict < 1 && slot.n_prompt_tokens + slot.n_decoded >= n_ctx_train) {
+            slot.truncated      = true;
+            slot.stop           = STOP_TYPE_LIMIT;
+            slot.has_next_token = false; // stop prediction
+
+            SLT_WRN(slot,
+                    "n_predict (%d) is set for infinite generation. "
+                    "Limiting generated tokens to n_ctx_train (%d) to avoid EOS-less generation infinite loop\n",
+                    slot.params.n_predict, n_ctx_train);
+        }
+
+        SLT_DBG(slot, "n_decoded = %d, n_remaining = %d, next token: %5d '%s'\n", slot.n_decoded, slot.n_remaining, result.tok, token_str.c_str());
+
+        return slot.has_next_token; // continue
+    }
+
+    void populate_token_probs(const server_slot & slot, completion_token_output & result, bool post_sampling, bool special, int idx) {
+        size_t n_probs = slot.params.sampling.n_probs;
+        size_t n_vocab = llama_vocab_n_tokens(vocab);
+        if (post_sampling) {
+            const auto * cur_p = common_sampler_get_candidates(slot.smpl);
+            const size_t max_probs = cur_p->size;
+
+            // set probability for sampled token
+            for (size_t i = 0; i < max_probs; i++) {
+                if (cur_p->data[i].id == result.tok) {
+                    result.prob = cur_p->data[i].p;
+                    break;
+                }
+            }
+
+            // set probability for top n_probs tokens
+            result.probs.reserve(max_probs);
+            for (size_t i = 0; i < std::min(max_probs, n_probs); i++) {
+                result.probs.push_back({
+                    cur_p->data[i].id,
+                    common_token_to_piece(ctx, cur_p->data[i].id, special),
+                    cur_p->data[i].p
+                });
+            }
+        } else {
+            // TODO: optimize this with min-p optimization
+            std::vector<llama_token_data> cur = get_token_probabilities(ctx, idx);
+
+            // set probability for sampled token
+            for (size_t i = 0; i < n_vocab; i++) {
+                // set probability for sampled token
+                if (cur[i].id == result.tok) {
+                    result.prob = cur[i].p;
+                    break;
+                }
+            }
+
+            // set probability for top n_probs tokens
+            result.probs.reserve(n_probs);
+            for (size_t i = 0; i < std::min(n_vocab, n_probs); i++) {
+                result.probs.push_back({
+                    cur[i].id,
+                    common_token_to_piece(ctx, cur[i].id, special),
+                    cur[i].p
+                });
+            }
+        }
+    }
+
+    void send_error(const server_task & task, const std::string & error, const enum error_type type = ERROR_TYPE_SERVER) {
+        send_error(task.id, error, type);
+    }
+
+    void send_error(const server_slot & slot, const std::string & error, const enum error_type type = ERROR_TYPE_SERVER) {
+        send_error(slot.id_task, error, type);
+    }
+
+    void send_error(const int id_task, const std::string & error, const enum error_type type = ERROR_TYPE_SERVER) {
+        SRV_ERR("task id = %d, error: %s\n", id_task, error.c_str());
+
+        auto res = std::make_unique<server_task_result_error>();
+        res->id       = id_task;
+        res->err_type = type;
+        res->err_msg  = error;
+
+        queue_results.send(std::move(res));
+    }
+
+    void send_partial_response(server_slot & slot, const completion_token_output & tkn) {
+        auto res = std::make_unique<server_task_result_cmpl_partial>();
+
+        res->id      = slot.id_task;
+        res->index   = slot.index;
+        res->content = tkn.text_to_send;
+        res->tokens  = { tkn.tok };
+
+        res->n_decoded           = slot.n_decoded;
+        res->n_prompt_tokens     = slot.n_prompt_tokens;
+        res->post_sampling_probs = slot.params.post_sampling_probs;
+
+        res->verbose           = slot.params.verbose;
+        res->oaicompat         = slot.params.oaicompat;
+        res->oaicompat_model   = slot.params.oaicompat_model;
+        res->oaicompat_cmpl_id = slot.params.oaicompat_cmpl_id;
+
+        // populate res.probs_output
+        if (slot.params.sampling.n_probs > 0) {
+            res->prob_output = tkn; // copy the token probs
+        }
+
+        // populate timings if this is final response or timings_per_token is enabled
+        if (slot.stop != STOP_TYPE_NONE || slot.params.timings_per_token) {
+            res->timings = slot.get_timings();
+        }
+
+        queue_results.send(std::move(res));
+    }
+
+    void send_final_response(server_slot & slot) {
+        auto res = std::make_unique<server_task_result_cmpl_final>();
+        res->id              = slot.id_task;
+        res->id_slot         = slot.id;
+
+        res->index           = slot.index;
+        res->content         = std::move(slot.generated_text);
+        res->tokens          = std::move(slot.generated_tokens);
+        res->timings         = slot.get_timings();
+        res->prompt          = common_detokenize(ctx, slot.prompt_tokens, true);
+        res->response_fields = std::move(slot.params.response_fields);
+
+        res->truncated           = slot.truncated;
+        res->n_decoded           = slot.n_decoded;
+        res->n_prompt_tokens     = slot.n_prompt_tokens;
+        res->n_tokens_cached     = slot.n_past;
+        res->has_new_line        = slot.has_new_line;
+        res->stopping_word       = slot.stopping_word;
+        res->stop                = slot.stop;
+        res->post_sampling_probs = slot.params.post_sampling_probs;
+
+        res->verbose               = slot.params.verbose;
+        res->stream                = slot.params.stream;
+        res->oaicompat             = slot.params.oaicompat;
+        res->oaicompat_model       = slot.params.oaicompat_model;
+        res->oaicompat_cmpl_id     = slot.params.oaicompat_cmpl_id;
+        res->oaicompat_chat_format = slot.params.oaicompat_chat_format;
+        // populate res.probs_output
+        if (slot.params.sampling.n_probs > 0) {
+            if (!slot.params.stream && slot.stop == STOP_TYPE_WORD) {
+                const llama_tokens stop_word_toks = common_tokenize(ctx, slot.stopping_word, false);
+
+                size_t safe_offset = std::min(slot.generated_token_probs.size(), stop_word_toks.size());
+                res->probs_output = std::vector<completion_token_output>(
+                        slot.generated_token_probs.begin(),
+                        slot.generated_token_probs.end() - safe_offset);
+            } else {
+                res->probs_output = std::vector<completion_token_output>(
+                        slot.generated_token_probs.begin(),
+                        slot.generated_token_probs.end());
+            }
+        }
+
+        res->generation_params = slot.params; // copy the parameters
+
+        queue_results.send(std::move(res));
+    }
+
+    void send_embedding(const server_slot & slot, const llama_batch & batch) {
+        auto res = std::make_unique<server_task_result_embd>();
+        res->id        = slot.id_task;
+        res->index     = slot.index;
+        res->n_tokens  = slot.n_prompt_tokens;
+        res->oaicompat = slot.params.oaicompat;
+
+        const int n_embd = llama_model_n_embd(model);
+
+        std::vector<float> embd_res(n_embd, 0.0f);
+
+        for (int i = 0; i < batch.n_tokens; ++i) {
+            if (!batch.logits[i] || batch.seq_id[i][0] != slot.id) {
+                continue;
+            }
+
+            const float * embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
+            if (embd == NULL) {
+                embd = llama_get_embeddings_ith(ctx, i);
+            }
+
+            if (embd == NULL) {
+                SLT_ERR(slot, "failed to get embeddings, token = %d, seq_id = %d\n", batch.token[i], batch.seq_id[i][0]);
+
+                res->embedding.push_back(std::vector<float>(n_embd, 0.0f));
+                continue;
+            }
+
+            // normalize only when there is pooling
+            // TODO: configurable
+            if (llama_pooling_type(slot.ctx) != LLAMA_POOLING_TYPE_NONE) {
+                common_embd_normalize(embd, embd_res.data(), n_embd, 2);
+                res->embedding.push_back(embd_res);
+            } else {
+                res->embedding.push_back({ embd, embd + n_embd });
+            }
+        }
+
+        SLT_DBG(slot, "%s", "sending embeddings\n");
+
+        queue_results.send(std::move(res));
+    }
+
+    void send_rerank(const server_slot & slot, const llama_batch & batch) {
+        auto res = std::make_unique<server_task_result_rerank>();
+        res->id    = slot.id_task;
+        res->index = slot.index;
+        res->n_tokens = slot.n_prompt_tokens;
+
+        for (int i = 0; i < batch.n_tokens; ++i) {
+            if (!batch.logits[i] || batch.seq_id[i][0] != slot.id) {
+                continue;
+            }
+
+            const float * embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
+            if (embd == NULL) {
+                embd = llama_get_embeddings_ith(ctx, i);
+            }
+
+            if (embd == NULL) {
+                SLT_ERR(slot, "failed to get embeddings, token = %d, seq_id = %d\n", batch.token[i], batch.seq_id[i][0]);
+
+                res->score = -1e6;
+                continue;
+            }
+
+            res->score = embd[0];
+        }
+
+        SLT_DBG(slot, "sending rerank result, res.score = %f\n", res->score);
+
+        queue_results.send(std::move(res));
+    }
+
+    //
+    // Functions to create new task(s) and receive result(s)
+    //
+
+    void cancel_tasks(const std::unordered_set<int> & id_tasks) {
+        std::vector<server_task> cancel_tasks;
+        cancel_tasks.reserve(id_tasks.size());
+        for (const auto & id_task : id_tasks) {
+            SRV_WRN("cancel task, id_task = %d\n", id_task);
+
+            server_task task(SERVER_TASK_TYPE_CANCEL);
+            task.id_target = id_task;
+            queue_results.remove_waiting_task_id(id_task);
+            cancel_tasks.push_back(std::move(task));
+        }
+        // push to beginning of the queue, so it has highest priority
+        queue_tasks.post(std::move(cancel_tasks), true);
+    }
+
+    // receive the results from task(s)
+    void receive_multi_results(
+            const std::unordered_set<int> & id_tasks,
+            const std::function<void(std::vector<server_task_result_ptr>&)> & result_handler,
+            const std::function<void(json)> & error_handler,
+            const std::function<bool()> & is_connection_closed) {
+        std::vector<server_task_result_ptr> results(id_tasks.size());
+        for (int i = 0; i < (int)id_tasks.size(); i++) {
+            server_task_result_ptr result = queue_results.recv_with_timeout(id_tasks, HTTP_POLLING_SECONDS);
+
+            if (is_connection_closed()) {
+                cancel_tasks(id_tasks);
+                return;
+            }
+
+            if (result == nullptr) {
+                i--; // retry
+                continue;
+            }
+
+            if (result->is_error()) {
+                error_handler(result->to_json());
+                cancel_tasks(id_tasks);
+                return;
+            }
+
+            GGML_ASSERT(
+                dynamic_cast<server_task_result_cmpl_final*>(result.get()) != nullptr
+                || dynamic_cast<server_task_result_embd*>(result.get()) != nullptr
+                || dynamic_cast<server_task_result_rerank*>(result.get()) != nullptr
+            );
+            const size_t idx = result->get_index();
+            GGML_ASSERT(idx < results.size() && "index out of range");
+            results[idx] = std::move(result);
+        }
+        result_handler(results);
+    }
+
+    // receive the results from task(s), in stream mode
+    void receive_cmpl_results_stream(
+            const std::unordered_set<int> & id_tasks,
+            const std::function<bool(server_task_result_ptr&)> & result_handler,
+            const std::function<void(json)> & error_handler,
+            const std::function<bool()> & is_connection_closed) {
+        size_t n_finished = 0;
+        while (true) {
+            server_task_result_ptr result = queue_results.recv_with_timeout(id_tasks, HTTP_POLLING_SECONDS);
+
+            if (is_connection_closed()) {
+                cancel_tasks(id_tasks);
+                return;
+            }
+
+            if (result == nullptr) {
+                continue; // retry
+            }
+
+            if (result->is_error()) {
+                error_handler(result->to_json());
+                cancel_tasks(id_tasks);
+                return;
+            }
+
+            GGML_ASSERT(
+                dynamic_cast<server_task_result_cmpl_partial*>(result.get()) != nullptr
+                || dynamic_cast<server_task_result_cmpl_final*>(result.get()) != nullptr
+            );
+            if (!result_handler(result)) {
+                cancel_tasks(id_tasks);
+                break;
+            }
+
+            if (result->is_stop()) {
+                if (++n_finished == id_tasks.size()) {
+                    break;
+                }
+            }
+        }
+    }
+
+    //
+    // Functions to process the task
+    //
+
+    void process_single_task(server_task && task) {
+        switch (task.type) {
+            case SERVER_TASK_TYPE_COMPLETION:
+            case SERVER_TASK_TYPE_INFILL:
+            case SERVER_TASK_TYPE_EMBEDDING:
+            case SERVER_TASK_TYPE_RERANK:
+                {
+                    const int id_slot = task.id_selected_slot;
+
+                    server_slot * slot = id_slot != -1 ? get_slot_by_id(id_slot) : get_available_slot(task);
+
+                    if (slot == nullptr) {
+                        // if no slot is available, we defer this task for processing later
+                        SRV_DBG("no slot is available, defer task, id_task = %d\n", task.id);
+                        queue_tasks.defer(std::move(task));
+                        break;
+                    }
+                    if (slot->is_processing()) {
+                        // if requested slot is unavailable, we defer this task for processing later
+                        SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
+                        queue_tasks.defer(std::move(task));
+                        break;
+                    }
+
+                    if (!launch_slot_with_task(*slot, std::move(task))) {
+                        SRV_ERR("failed to launch slot with task, id_task = %d\n", task.id);
+                        break;
+                    }
+                } break;
+            case SERVER_TASK_TYPE_CANCEL:
+                {
+                    // release slot linked with the task id
+                    for (auto & slot : slots) {
+                        if (slot.id_task == task.id_target) {
+                            slot.release();
+                            break;
+                        }
+                    }
+                } break;
+            case SERVER_TASK_TYPE_NEXT_RESPONSE:
+                {
+                    // do nothing
+                } break;
+            case SERVER_TASK_TYPE_METRICS:
+                {
+                    json slots_data = json::array();
+
+                    int n_idle_slots       = 0;
+                    int n_processing_slots = 0;
+
+                    for (server_slot & slot : slots) {
+                        json slot_data = slot.to_json();
+
+                        if (slot.is_processing()) {
+                            n_processing_slots++;
+                        } else {
+                            n_idle_slots++;
+                        }
+
+                        slots_data.push_back(slot_data);
+                    }
+                    SRV_DBG("n_idle_slots = %d, n_processing_slots = %d\n", n_idle_slots, n_processing_slots);
+
+                    auto res = std::make_unique<server_task_result_metrics>();
+                    res->id                  = task.id;
+                    res->slots_data          = std::move(slots_data);
+                    res->n_idle_slots        = n_idle_slots;
+                    res->n_processing_slots  = n_processing_slots;
+                    res->n_tasks_deferred    = queue_tasks.queue_tasks_deferred.size();
+                    res->t_start             = metrics.t_start;
+
+                    res->kv_cache_tokens_count = llama_kv_self_n_tokens(ctx);
+                    res->kv_cache_used_cells   = llama_kv_self_used_cells(ctx);
+
+                    res->n_prompt_tokens_processed_total = metrics.n_prompt_tokens_processed_total;
+                    res->t_prompt_processing_total       = metrics.t_prompt_processing_total;
+                    res->n_tokens_predicted_total        = metrics.n_tokens_predicted_total;
+                    res->t_tokens_generation_total       = metrics.t_tokens_generation_total;
+
+                    res->n_prompt_tokens_processed = metrics.n_prompt_tokens_processed;
+                    res->t_prompt_processing       = metrics.t_prompt_processing;
+                    res->n_tokens_predicted        = metrics.n_tokens_predicted;
+                    res->t_tokens_generation       = metrics.t_tokens_generation;
+
+                    res->n_decode_total          = metrics.n_decode_total;
+                    res->n_busy_slots_total      = metrics.n_busy_slots_total;
+
+                    if (task.metrics_reset_bucket) {
+                        metrics.reset_bucket();
+                    }
+                    queue_results.send(std::move(res));
+                } break;
+            case SERVER_TASK_TYPE_SLOT_SAVE:
+                {
+                    int id_slot = task.slot_action.slot_id;
+                    server_slot * slot = get_slot_by_id(id_slot);
+                    if (slot == nullptr) {
+                        send_error(task, "Invalid slot ID", ERROR_TYPE_INVALID_REQUEST);
+                        break;
+                    }
+                    if (slot->is_processing()) {
+                        // if requested slot is unavailable, we defer this task for processing later
+                        SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
+                        queue_tasks.defer(std::move(task));
+                        break;
+                    }
+
+                    const size_t token_count = slot->cache_tokens.size();
+                    const int64_t t_start = ggml_time_us();
+
+                    std::string filename = task.slot_action.filename;
+                    std::string filepath = task.slot_action.filepath;
+
+                    const size_t nwrite = llama_state_seq_save_file(ctx, filepath.c_str(), slot->id, slot->cache_tokens.data(), token_count);
+
+                    const int64_t t_end = ggml_time_us();
+                    const double t_save_ms = (t_end - t_start) / 1000.0;
+
+                    auto res = std::make_unique<server_task_result_slot_save_load>();
+                    res->id       = task.id;
+                    res->id_slot  = id_slot;
+                    res->filename = filename;
+                    res->is_save  = true;
+                    res->n_tokens = token_count;
+                    res->n_bytes  = nwrite;
+                    res->t_ms     = t_save_ms;
+                    queue_results.send(std::move(res));
+                } break;
+            case SERVER_TASK_TYPE_SLOT_RESTORE:
+                {
+                    int id_slot = task.slot_action.slot_id;
+                    server_slot * slot = get_slot_by_id(id_slot);
+                    if (slot == nullptr) {
+                        send_error(task, "Invalid slot ID", ERROR_TYPE_INVALID_REQUEST);
+                        break;
+                    }
+                    if (slot->is_processing()) {
+                        // if requested slot is unavailable, we defer this task for processing later
+                        SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
+                        queue_tasks.defer(std::move(task));
+                        break;
+                    }
+
+                    const int64_t t_start = ggml_time_us();
+
+                    std::string filename = task.slot_action.filename;
+                    std::string filepath = task.slot_action.filepath;
+
+                    slot->cache_tokens.resize(slot->n_ctx);
+                    size_t token_count = 0;
+                    size_t nread = llama_state_seq_load_file(ctx, filepath.c_str(), slot->id, slot->cache_tokens.data(), slot->cache_tokens.size(), &token_count);
+                    if (nread == 0) {
+                        slot->cache_tokens.resize(0);
+                        send_error(task, "Unable to restore slot, no available space in KV cache or invalid slot save file", ERROR_TYPE_INVALID_REQUEST);
+                        break;
+                    }
+                    slot->cache_tokens.resize(token_count);
+
+                    const int64_t t_end = ggml_time_us();
+                    const double t_restore_ms = (t_end - t_start) / 1000.0;
+
+                    auto res = std::make_unique<server_task_result_slot_save_load>();
+                    res->id       = task.id;
+                    res->id_slot  = id_slot;
+                    res->filename = filename;
+                    res->is_save  = false;
+                    res->n_tokens = token_count;
+                    res->n_bytes  = nread;
+                    res->t_ms     = t_restore_ms;
+                    queue_results.send(std::move(res));
+                } break;
+            case SERVER_TASK_TYPE_SLOT_ERASE:
+                {
+                    int id_slot = task.slot_action.slot_id;
+                    server_slot * slot = get_slot_by_id(id_slot);
+                    if (slot == nullptr) {
+                        send_error(task, "Invalid slot ID", ERROR_TYPE_INVALID_REQUEST);
+                        break;
+                    }
+                    if (slot->is_processing()) {
+                        // if requested slot is unavailable, we defer this task for processing later
+                        SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
+                        queue_tasks.defer(std::move(task));
+                        break;
+                    }
+
+                    // Erase token cache
+                    const size_t n_erased = slot->cache_tokens.size();
+                    llama_kv_self_seq_rm(ctx, slot->id, -1, -1);
+                    slot->cache_tokens.clear();
+
+                    auto res = std::make_unique<server_task_result_slot_erase>();
+                    res->id       = task.id;
+                    res->id_slot  = id_slot;
+                    res->n_erased = n_erased;
+                    queue_results.send(std::move(res));
+                } break;
+            case SERVER_TASK_TYPE_SET_LORA:
+                {
+                    params_base.lora_adapters = std::move(task.set_lora);
+                    auto res = std::make_unique<server_task_result_apply_lora>();
+                    res->id = task.id;
+                    queue_results.send(std::move(res));
+                } break;
+        }
+    }
+
+    void update_slots() {
+        // check if all slots are idle
+        {
+            bool all_idle = true;
+
+            for (auto & slot : slots) {
+                if (slot.is_processing()) {
+                    all_idle = false;
+                    break;
+                }
+            }
+
+            if (all_idle) {
+                SRV_INF("%s", "all slots are idle\n");
+                if (clean_kv_cache) {
+                    kv_cache_clear();
+                }
+
+                return;
+            }
+        }
+
+        {
+            SRV_DBG("%s", "posting NEXT_RESPONSE\n");
+
+            server_task task(SERVER_TASK_TYPE_NEXT_RESPONSE);
+            task.id = queue_tasks.get_new_id();
+            queue_tasks.post(std::move(task));
+        }
+
+        // apply context-shift if needed
+        // TODO: simplify and improve
+        for (server_slot & slot : slots) {
+            if (slot.is_processing() && slot.n_past + 1 >= slot.n_ctx) {
+                if (!params_base.ctx_shift) {
+                    // this check is redundant (for good)
+                    // we should never get here, because generation should already stopped in process_token()
+                    slot.release();
+                    send_error(slot, "context shift is disabled", ERROR_TYPE_SERVER);
+                    continue;
+                }
+
+                // Shift context
+                const int n_keep    = slot.params.n_keep + add_bos_token;
+                const int n_left    = slot.n_past - n_keep;
+                const int n_discard = slot.params.n_discard ? slot.params.n_discard : (n_left / 2);
+
+                SLT_WRN(slot, "slot context shift, n_keep = %d, n_left = %d, n_discard = %d\n", n_keep, n_left, n_discard);
+
+                llama_kv_self_seq_rm (ctx, slot.id, n_keep            , n_keep + n_discard);
+                llama_kv_self_seq_add(ctx, slot.id, n_keep + n_discard, slot.n_past,        -n_discard);
+
+                if (slot.params.cache_prompt) {
+                    for (size_t i = n_keep + n_discard; i < slot.cache_tokens.size(); i++) {
+                        slot.cache_tokens[i - n_discard] = slot.cache_tokens[i];
+                    }
+
+                    slot.cache_tokens.resize(slot.cache_tokens.size() - n_discard);
+                }
+
+                slot.n_past -= n_discard;
+
+                slot.truncated = true;
+            }
+        }
+
+        // start populating the batch for this iteration
+        common_batch_clear(batch);
+
+        // track if given slot can be batched with slots already in the batch
+        server_slot * slot_batched = nullptr;
+
+        auto accept_special_token = [&](server_slot & slot, llama_token token) {
+            return params_base.special || slot.params.sampling.preserved_tokens.find(token) != slot.params.sampling.preserved_tokens.end();
+        };
+
+        // frist, add sampled tokens from any ongoing sequences
+        for (auto & slot : slots) {
+            if (slot.state != SLOT_STATE_GENERATING) {
+                continue;
+            }
+
+            // check if we can batch this slot with the previous one
+            if (!slot_batched) {
+                slot_batched = &slot;
+            } else if (!slot_batched->can_batch_with(slot)) {
+                continue;
+            }
+
+            slot.i_batch = batch.n_tokens;
+
+            common_batch_add(batch, slot.sampled, slot.n_past, { slot.id }, true);
+
+            slot.n_past += 1;
+
+            if (slot.params.cache_prompt) {
+                slot.cache_tokens.push_back(slot.sampled);
+            }
+
+            SLT_DBG(slot, "slot decode token, n_ctx = %d, n_past = %d, n_cache_tokens = %d, truncated = %d\n",
+                    slot.n_ctx, slot.n_past, (int) slot.cache_tokens.size(), slot.truncated);
+        }
+
+        // process in chunks of params.n_batch
+        int32_t n_batch  = llama_n_batch(ctx);
+        int32_t n_ubatch = llama_n_ubatch(ctx);
+
+        // next, batch any pending prompts without exceeding n_batch
+        if (params_base.cont_batching || batch.n_tokens == 0) {
+            for (auto & slot : slots) {
+                // check if we can batch this slot with the previous one
+                if (slot.is_processing()) {
+                    if (!slot_batched) {
+                        slot_batched = &slot;
+                    } else if (!slot_batched->can_batch_with(slot)) {
+                        continue;
+                    }
+                }
+
+                // this slot still has a prompt to be processed
+                if (slot.state == SLOT_STATE_PROCESSING_PROMPT || slot.state == SLOT_STATE_STARTED) {
+                    auto & prompt_tokens = slot.prompt_tokens;
+
+                    // TODO: maybe move branch to outside of this loop in the future
+                    if (slot.state == SLOT_STATE_STARTED) {
+                        slot.t_start_process_prompt = ggml_time_us();
+                        slot.t_start_generation = 0;
+
+                        slot.n_past = 0;
+                        slot.n_prompt_tokens = prompt_tokens.size();
+                        slot.state = SLOT_STATE_PROCESSING_PROMPT;
+
+                        SLT_INF(slot, "new prompt, n_ctx_slot = %d, n_keep = %d, n_prompt_tokens = %d\n", slot.n_ctx, slot.params.n_keep, slot.n_prompt_tokens);
+
+                        // print prompt tokens (for debugging)
+                        if (1) {
+                            // first 16 tokens (avoid flooding logs)
+                            for (int i = 0; i < std::min<int>(16, prompt_tokens.size()); i++) {
+                                SLT_DBG(slot, "prompt token %3d: %6d '%s'\n", i, prompt_tokens[i], common_token_to_piece(ctx, prompt_tokens[i]).c_str());
+                            }
+                        } else {
+                            // all
+                            for (int i = 0; i < (int) prompt_tokens.size(); i++) {
+                                SLT_DBG(slot, "prompt token %3d: %6d '%s'\n", i, prompt_tokens[i], common_token_to_piece(ctx, prompt_tokens[i]).c_str());
+                            }
+                        }
+
+                        // empty prompt passed -> release the slot and send empty response
+                        if (prompt_tokens.empty()) {
+                            SLT_WRN(slot, "%s", "empty prompt - releasing slot\n");
+
+                            slot.release();
+                            slot.print_timings();
+                            send_final_response(slot);
+                            continue;
+                        }
+
+                        if (slot.is_non_causal()) {
+                            if (slot.n_prompt_tokens > n_ubatch) {
+                                slot.release();
+                                send_error(slot, "input is too large to process. increase the physical batch size", ERROR_TYPE_SERVER);
+                                continue;
+                            }
+
+                            if (slot.n_prompt_tokens > slot.n_ctx) {
+                                slot.release();
+                                send_error(slot, "input is larger than the max context size. skipping", ERROR_TYPE_SERVER);
+                                continue;
+                            }
+                        } else {
+                            if (!params_base.ctx_shift) {
+                                // if context shift is disabled, we make sure prompt size is smaller than KV size
+                                // TODO: there should be a separate parameter that control prompt truncation
+                                //       context shift should be applied only during the generation phase
+                                if (slot.n_prompt_tokens >= slot.n_ctx) {
+                                    slot.release();
+                                    send_error(slot, "the request exceeds the available context size. try increasing the context size or enable context shift", ERROR_TYPE_INVALID_REQUEST);
+                                    continue;
+                                }
+                            }
+                            if (slot.params.n_keep < 0) {
+                                slot.params.n_keep = slot.n_prompt_tokens;
+                            }
+                            slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);
+
+                            // if input prompt is too big, truncate it
+                            if (slot.n_prompt_tokens >= slot.n_ctx) {
+                                const int n_left = slot.n_ctx - slot.params.n_keep;
+
+                                const int n_block_size = n_left / 2;
+                                const int erased_blocks = (slot.n_prompt_tokens - slot.params.n_keep - n_block_size) / n_block_size;
+
+                                llama_tokens new_tokens(
+                                        prompt_tokens.begin(),
+                                        prompt_tokens.begin() + slot.params.n_keep);
+
+                                new_tokens.insert(
+                                        new_tokens.end(),
+                                        prompt_tokens.begin() + slot.params.n_keep + erased_blocks * n_block_size,
+                                        prompt_tokens.end());
+
+                                prompt_tokens = std::move(new_tokens);
+
+                                slot.truncated = true;
+                                slot.n_prompt_tokens = prompt_tokens.size();
+
+                                SLT_WRN(slot, "input truncated, n_ctx = %d, n_keep = %d, n_left = %d, n_prompt_tokens = %d\n", slot.n_ctx, slot.params.n_keep, n_left, slot.n_prompt_tokens);
+
+                                GGML_ASSERT(slot.n_prompt_tokens < slot.n_ctx);
+                            }
+
+                            if (slot.params.cache_prompt) {
+                                // reuse any previously computed tokens that are common with the new prompt
+                                slot.n_past = common_lcp(slot.cache_tokens, prompt_tokens);
+
+                                // reuse chunks from the cached prompt by shifting their KV cache in the new position
+                                if (params_base.n_cache_reuse > 0) {
+                                    size_t head_c = slot.n_past; // cache
+                                    size_t head_p = slot.n_past; // current prompt
+
+                                    SLT_DBG(slot, "trying to reuse chunks with size > %d, slot.n_past = %d\n", params_base.n_cache_reuse, slot.n_past);
+
+                                    while (head_c < slot.cache_tokens.size() &&
+                                           head_p < prompt_tokens.size()) {
+
+                                        size_t n_match = 0;
+                                        while (head_c + n_match < slot.cache_tokens.size() &&
+                                               head_p + n_match < prompt_tokens.size()     &&
+                                               slot.cache_tokens[head_c + n_match] == prompt_tokens[head_p + n_match]) {
+
+                                            n_match++;
+                                        }
+
+                                        if (n_match >= (size_t) params_base.n_cache_reuse) {
+                                            SLT_INF(slot, "reusing chunk with size %zu, shifting KV cache [%zu, %zu) -> [%zu, %zu)\n", n_match, head_c, head_c + n_match, head_p, head_p + n_match);
+                                            //for (size_t i = head_p; i < head_p + n_match; i++) {
+                                            //    SLT_DBG(slot, "cache token %3zu: %6d '%s'\n", i, prompt_tokens[i], common_token_to_piece(ctx, prompt_tokens[i]).c_str());
+                                            //}
+
+                                            const int64_t kv_shift = (int64_t) head_p - (int64_t) head_c;
+
+                                            llama_kv_self_seq_rm (ctx, slot.id, head_p, head_c);
+                                            llama_kv_self_seq_add(ctx, slot.id, head_c, head_c + n_match, kv_shift);
+
+                                            for (size_t i = 0; i < n_match; i++) {
+                                                slot.cache_tokens[head_p + i] = slot.cache_tokens[head_c + i];
+                                                slot.n_past++;
+                                            }
+
+                                            head_c += n_match;
+                                            head_p += n_match;
+                                        } else {
+                                            head_c += 1;
+                                        }
+                                    }
+
+                                    SLT_DBG(slot, "after context reuse, new slot.n_past = %d\n", slot.n_past);
+                                }
+                            }
+                        }
+
+                        if (slot.n_past == slot.n_prompt_tokens && slot.n_past > 0) {
+                            // we have to evaluate at least 1 token to generate logits.
+                            SLT_WRN(slot, "need to evaluate at least 1 token to generate logits, n_past = %d, n_prompt_tokens = %d\n", slot.n_past, slot.n_prompt_tokens);
+
+                            slot.n_past--;
+                        }
+
+                        slot.n_prompt_tokens_processed = 0;
+                    }
+
+                    // non-causal tasks require to fit the entire prompt in the physical batch
+                    if (slot.is_non_causal()) {
+                        // cannot fit the prompt in the current batch - will try next iter
+                        if (batch.n_tokens + slot.n_prompt_tokens > n_batch) {
+                            continue;
+                        }
+                    }
+
+                    // keep only the common part
+                    if (!llama_kv_self_seq_rm(ctx, slot.id, slot.n_past, -1)) {
+                        // could not partially delete (likely using a non-Transformer model)
+                        llama_kv_self_seq_rm(ctx, slot.id, -1, -1);
+
+                        // there is no common part left
+                        slot.n_past = 0;
+                    }
+
+                    SLT_INF(slot, "kv cache rm [%d, end)\n", slot.n_past);
+
+                    // remove the non-common part from the cache
+                    slot.cache_tokens.resize(slot.n_past);
+
+                    // add prompt tokens for processing in the current batch
+                    while (slot.n_past < slot.n_prompt_tokens && batch.n_tokens < n_batch) {
+                        // without pooling, we want to output the embeddings for all the tokens in the batch
+                        const bool need_embd = slot.task_type == SERVER_TASK_TYPE_EMBEDDING && llama_pooling_type(slot.ctx) == LLAMA_POOLING_TYPE_NONE;
+
+                        common_batch_add(batch, prompt_tokens[slot.n_past], slot.n_past, { slot.id }, need_embd);
+
+                        if (slot.params.cache_prompt) {
+                            slot.cache_tokens.push_back(prompt_tokens[slot.n_past]);
+                        }
+
+                        slot.n_prompt_tokens_processed++;
+                        slot.n_past++;
+                    }
+
+                    SLT_INF(slot, "prompt processing progress, n_past = %d, n_tokens = %d, progress = %f\n", slot.n_past, batch.n_tokens, (float) slot.n_prompt_tokens_processed / slot.n_prompt_tokens);
+
+                    // entire prompt has been processed
+                    if (slot.n_past == slot.n_prompt_tokens) {
+                        slot.state = SLOT_STATE_DONE_PROMPT;
+
+                        GGML_ASSERT(batch.n_tokens > 0);
+
+                        common_sampler_reset(slot.smpl);
+
+                        // Process all prompt tokens through sampler system
+                        for (int i = 0; i < slot.n_prompt_tokens; ++i) {
+                            common_sampler_accept(slot.smpl, prompt_tokens[i], false);
+                        }
+
+                        // extract the logits only for the last token
+                        batch.logits[batch.n_tokens - 1] = true;
+
+                        slot.n_decoded = 0;
+                        slot.i_batch   = batch.n_tokens - 1;
+
+                        SLT_INF(slot, "prompt done, n_past = %d, n_tokens = %d\n", slot.n_past, batch.n_tokens);
+                    }
+                }
+
+                if (batch.n_tokens >= n_batch) {
+                    break;
+                }
+            }
+        }
+
+        if (batch.n_tokens == 0) {
+            SRV_WRN("%s", "no tokens to decode\n");
+            return;
+        }
+
+        SRV_DBG("decoding batch, n_tokens = %d\n", batch.n_tokens);
+
+        if (slot_batched) {
+            // make sure we're in the right embedding mode
+            llama_set_embeddings(ctx, slot_batched->is_non_causal());
+            // apply lora, only need to do it once per batch
+            common_set_adapter_lora(ctx, slot_batched->lora);
+        }
+
+        // process the created batch of tokens
+        for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {
+            const int32_t n_tokens = std::min(n_batch, batch.n_tokens - i);
+
+            llama_batch batch_view = {
+                n_tokens,
+                batch.token    + i,
+                nullptr,
+                batch.pos      + i,
+                batch.n_seq_id + i,
+                batch.seq_id   + i,
+                batch.logits   + i,
+            };
+
+            const int ret = llama_decode(ctx, batch_view);
+            metrics.on_decoded(slots);
+
+            if (ret != 0) {
+                if (n_batch == 1 || ret < 0) {
+                    // if you get here, it means the KV cache is full - try increasing it via the context size
+                    SRV_ERR("failed to decode the batch: KV cache is full - try increasing it via the context size, i = %d, n_batch = %d, ret = %d\n", i, n_batch, ret);
+                    for (auto & slot : slots) {
+                        slot.release();
+                        send_error(slot, "Input prompt is too big compared to KV size. Please try increasing KV size.");
+                    }
+                    break; // break loop of n_batch
+                }
+
+                // retry with half the batch size to try to find a free slot in the KV cache
+                n_batch /= 2;
+                i -= n_batch;
+
+                SRV_WRN("failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = %d, n_batch = %d, ret = %d\n", i, n_batch, ret);
+
+                continue; // continue loop of n_batch
+            }
+
+            for (auto & slot : slots) {
+                if (slot.i_batch < (int) i || slot.i_batch >= (int) (i + n_tokens)) {
+                    continue; // continue loop of slots
+                }
+
+                if (slot.state == SLOT_STATE_DONE_PROMPT) {
+                    if (slot.task_type == SERVER_TASK_TYPE_EMBEDDING) {
+                        // prompt evaluated for embedding
+                        send_embedding(slot, batch_view);
+                        slot.release();
+                        slot.i_batch = -1;
+                        continue; // continue loop of slots
+                    }
+
+                    if (slot.task_type == SERVER_TASK_TYPE_RERANK) {
+                        send_rerank(slot, batch_view);
+                        slot.release();
+                        slot.i_batch = -1;
+                        continue; // continue loop of slots
+                    }
+
+                    // prompt evaluated for next-token prediction
+                    slot.state = SLOT_STATE_GENERATING;
+                } else if (slot.state != SLOT_STATE_GENERATING) {
+                    continue; // continue loop of slots
+                }
+
+                const int tok_idx = slot.i_batch - i;
+
+                llama_token id = common_sampler_sample(slot.smpl, ctx, tok_idx);
+
+                slot.i_batch = -1;
+
+                common_sampler_accept(slot.smpl, id, true);
+
+                slot.n_decoded += 1;
+
+                const int64_t t_current = ggml_time_us();
+
+                if (slot.n_decoded == 1) {
+                    slot.t_start_generation = t_current;
+                    slot.t_prompt_processing = (slot.t_start_generation - slot.t_start_process_prompt) / 1e3;
+                    metrics.on_prompt_eval(slot);
+                }
+
+                slot.t_token_generation = (t_current - slot.t_start_generation) / 1e3;
+
+                completion_token_output result;
+                result.tok          = id;
+                result.text_to_send = common_token_to_piece(ctx, result.tok, accept_special_token(slot, result.tok));
+                result.prob         = 1.0f; // TODO: set it here instead of doing inside populate_token_probs
+
+                if (slot.params.sampling.n_probs > 0) {
+                    populate_token_probs(slot, result, slot.params.post_sampling_probs, params_base.special, tok_idx);
+                }
+
+                if (!process_token(result, slot)) {
+                    // release slot because of stop condition
+                    slot.release();
+                    slot.print_timings();
+                    send_final_response(slot);
+                    metrics.on_prediction(slot);
+                    continue;
+                }
+            }
+
+            // do speculative decoding
+            for (auto & slot : slots) {
+                if (!slot.is_processing() || !slot.can_speculate()) {
+                    continue;
+                }
+
+                if (slot.state != SLOT_STATE_GENERATING) {
+                    continue;
+                }
+
+                // determine the max draft that fits the current slot state
+                int n_draft_max = slot.params.speculative.n_max;
+
+                // note: n_past is not yet increased for the `id` token sampled above
+                //       also, need to leave space for 1 extra token to allow context shifts
+                n_draft_max = std::min(n_draft_max, slot.n_ctx - slot.n_past - 2);
+
+                if (slot.n_remaining > 0) {
+                    n_draft_max = std::min(n_draft_max, slot.n_remaining - 1);
+                }
+
+                SLT_DBG(slot, "max possible draft: %d\n", n_draft_max);
+
+                if (n_draft_max < slot.params.speculative.n_min) {
+                    SLT_DBG(slot, "the max possible draft is too small: %d < %d - skipping speculative decoding\n", n_draft_max, slot.params.speculative.n_min);
+
+                    continue;
+                }
+
+                llama_token id = slot.sampled;
+
+                struct common_speculative_params params_spec;
+                params_spec.n_draft   = n_draft_max;
+                params_spec.n_reuse   = llama_n_ctx(slot.ctx_dft) - slot.params.speculative.n_max;
+                params_spec.p_min     = slot.params.speculative.p_min;
+
+                llama_tokens draft = common_speculative_gen_draft(slot.spec, params_spec, slot.cache_tokens, id);
+
+                // keep track of total number of tokens generated in the draft
+                slot.n_draft_total += draft.size();
+
+                // ignore small drafts
+                if (slot.params.speculative.n_min > (int) draft.size()) {
+                    SLT_DBG(slot, "ignoring small draft: %d < %d\n", (int) draft.size(), slot.params.speculative.n_min);
+
+                    continue;
+                }
+
+                // construct the speculation batch
+                common_batch_clear(slot.batch_spec);
+                common_batch_add  (slot.batch_spec, id, slot.n_past, { slot.id }, true);
+
+                for (size_t i = 0; i < draft.size(); ++i) {
+                    common_batch_add(slot.batch_spec, draft[i], slot.n_past + 1 + i, { slot.id }, true);
+                }
+
+                SLT_DBG(slot, "decoding speculative batch, size = %d\n", slot.batch_spec.n_tokens);
+
+                llama_decode(ctx, slot.batch_spec);
+
+                // the accepted tokens from the speculation
+                const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, draft);
+
+                slot.n_past    += ids.size();
+                slot.n_decoded += ids.size();
+
+                // update how many tokens out of draft was accepted
+                slot.n_draft_accepted += ids.size() - 1;
+
+                slot.cache_tokens.push_back(id);
+                slot.cache_tokens.insert(slot.cache_tokens.end(), ids.begin(), ids.end() - 1);
+
+                llama_kv_self_seq_rm(ctx, slot.id, slot.n_past, -1);
+
+                for (size_t i = 0; i < ids.size(); ++i) {
+                    completion_token_output result;
+
+                    result.tok          = ids[i];
+                    result.text_to_send = common_token_to_piece(ctx, result.tok, accept_special_token(slot, result.tok));
+                    result.prob         = 1.0f; // set later
+
+                    // TODO: set result.probs
+
+                    if (!process_token(result, slot)) {
+                        // release slot because of stop condition
+                        slot.release();
+                        slot.print_timings();
+                        send_final_response(slot);
+                        metrics.on_prediction(slot);
+                        break;
+                    }
+                }
+
+                SLT_DBG(slot, "accepted %d/%d draft tokens, new n_past = %d\n", (int) ids.size() - 1, (int) draft.size(), slot.n_past);
+            }
+        }
+
+        SRV_DBG("%s", "run slots completed\n");
+    }
+
+    json model_meta() const {
+        return json {
+            {"vocab_type",  llama_vocab_type       (vocab)},
+            {"n_vocab",     llama_vocab_n_tokens   (vocab)},
+            {"n_ctx_train", llama_model_n_ctx_train(model)},
+            {"n_embd",      llama_model_n_embd     (model)},
+            {"n_params",    llama_model_n_params   (model)},
+            {"size",        llama_model_size       (model)},
+        };
+    }
+};
+
+static void log_server_request(const httplib::Request & req, const httplib::Response & res) {
+    // skip GH copilot requests when using default port
+    if (req.path == "/v1/health" || req.path == "/v1/completions") {
+        return;
+    }
+
+    // reminder: this function is not covered by httplib's exception handler; if someone does more complicated stuff, think about wrapping it in try-catch
+
+    SRV_INF("request: %s %s %s %d\n", req.method.c_str(), req.path.c_str(), req.remote_addr.c_str(), res.status);
+
+    SRV_DBG("request:  %s\n", req.body.c_str());
+    SRV_DBG("response: %s\n", res.body.c_str());
+}
+
+std::function<void(int)> shutdown_handler;
+std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;
+
+inline void signal_handler(int signal) {
+    if (is_terminating.test_and_set()) {
+        // in case it hangs, we can force terminate the server by hitting Ctrl+C twice
+        // this is for better developer experience, we can remove when the server is stable enough
+        fprintf(stderr, "Received second interrupt, terminating immediately.\n");
+        exit(1);
+    }
+
+    shutdown_handler(signal);
+}
+
+int main(int argc, char ** argv) {
+    // own arguments required by this example
+    common_params params;
+
+    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER)) {
+        return 1;
+    }
+
+    common_init();
+
+    // struct that contains llama context and inference
+    server_context ctx_server;
+
+    llama_backend_init();
+    llama_numa_init(params.numa);
+
+    LOG_INF("system info: n_threads = %d, n_threads_batch = %d, total_threads = %d\n", params.cpuparams.n_threads, params.cpuparams_batch.n_threads, std::thread::hardware_concurrency());
+    LOG_INF("\n");
+    LOG_INF("%s\n", common_params_get_system_info(params).c_str());
+    LOG_INF("\n");
+
+    std::unique_ptr<httplib::Server> svr;
+#ifdef CPPHTTPLIB_OPENSSL_SUPPORT
+    if (params.ssl_file_key != "" && params.ssl_file_cert != "") {
+        LOG_INF("Running with SSL: key = %s, cert = %s\n", params.ssl_file_key.c_str(), params.ssl_file_cert.c_str());
+        svr.reset(
+            new httplib::SSLServer(params.ssl_file_cert.c_str(), params.ssl_file_key.c_str())
+        );
+    } else {
+        LOG_INF("Running without SSL\n");
+        svr.reset(new httplib::Server());
+    }
+#else
+    if (params.ssl_file_key != "" && params.ssl_file_cert != "") {
+        LOG_ERR("Server is built without SSL support\n");
+        return 1;
+    }
+    svr.reset(new httplib::Server());
+#endif
+
+    std::atomic<server_state> state{SERVER_STATE_LOADING_MODEL};
+
+    svr->set_default_headers({{"Server", "llama.cpp"}});
+    svr->set_logger(log_server_request);
+
+    auto res_error = [](httplib::Response & res, const json & error_data) {
+        json final_response {{"error", error_data}};
+        res.set_content(safe_json_to_str(final_response), MIMETYPE_JSON);
+        res.status = json_value(error_data, "code", 500);
+    };
+
+    auto res_ok = [](httplib::Response & res, const json & data) {
+        res.set_content(safe_json_to_str(data), MIMETYPE_JSON);
+        res.status = 200;
+    };
+
+    svr->set_exception_handler([&res_error](const httplib::Request &, httplib::Response & res, const std::exception_ptr & ep) {
+        std::string message;
+        try {
+            std::rethrow_exception(ep);
+        } catch (const std::exception & e) {
+            message = e.what();
+        } catch (...) {
+            message = "Unknown Exception";
+        }
+
+        try {
+            json formatted_error = format_error_response(message, ERROR_TYPE_SERVER);
+            LOG_WRN("got exception: %s\n", formatted_error.dump().c_str());
+            res_error(res, formatted_error);
+        } catch (const std::exception & e) {
+            LOG_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str());
+        }
+    });
+
+    svr->set_error_handler([&res_error](const httplib::Request &, httplib::Response & res) {
+        if (res.status == 404) {
+            res_error(res, format_error_response("File Not Found", ERROR_TYPE_NOT_FOUND));
+        }
+        // for other error codes, we skip processing here because it's already done by res_error()
+    });
+
+    // set timeouts and change hostname and port
+    svr->set_read_timeout (params.timeout_read);
+    svr->set_write_timeout(params.timeout_write);
+
+    std::unordered_map<std::string, std::string> log_data;
+
+    log_data["hostname"] = params.hostname;
+    log_data["port"]     = std::to_string(params.port);
+
+    if (params.api_keys.size() == 1) {
+        auto key = params.api_keys[0];
+        log_data["api_key"] = "api_key: ****" + key.substr(std::max((int)(key.length() - 4), 0));
+    } else if (params.api_keys.size() > 1) {
+        log_data["api_key"] = "api_key: " + std::to_string(params.api_keys.size()) + " keys loaded";
+    }
+
+    // Necessary similarity of prompt for slot selection
+    ctx_server.slot_prompt_similarity = params.slot_prompt_similarity;
+
+    //
+    // Middlewares
+    //
+
+    auto middleware_validate_api_key = [&params, &res_error](const httplib::Request & req, httplib::Response & res) {
+        static const std::unordered_set<std::string> public_endpoints = {
+            "/health",
+            "/models",
+            "/v1/models",
+        };
+
+        // If API key is not set, skip validation
+        if (params.api_keys.empty()) {
+            return true;
+        }
+
+        // If path is public or is static file, skip validation
+        if (public_endpoints.find(req.path) != public_endpoints.end() || req.path == "/") {
+            return true;
+        }
+
+        // Check for API key in the header
+        auto auth_header = req.get_header_value("Authorization");
+
+        std::string prefix = "Bearer ";
+        if (auth_header.substr(0, prefix.size()) == prefix) {
+            std::string received_api_key = auth_header.substr(prefix.size());
+            if (std::find(params.api_keys.begin(), params.api_keys.end(), received_api_key) != params.api_keys.end()) {
+                return true; // API key is valid
+            }
+        }
+
+        // API key is invalid or not provided
+        res_error(res, format_error_response("Invalid API Key", ERROR_TYPE_AUTHENTICATION));
+
+        LOG_WRN("Unauthorized: Invalid API Key\n");
+
+        return false;
+    };
+
+    auto middleware_server_state = [&res_error, &state](const httplib::Request & req, httplib::Response & res) {
+        server_state current_state = state.load();
+        if (current_state == SERVER_STATE_LOADING_MODEL) {
+            auto tmp = string_split<std::string>(req.path, '.');
+            if (req.path == "/" || tmp.back() == "html") {
+                res.set_content(reinterpret_cast<const char*>(loading_html), loading_html_len, "text/html; charset=utf-8");
+                res.status = 503;
+            } else {
+                res_error(res, format_error_response("Loading model", ERROR_TYPE_UNAVAILABLE));
+            }
+            return false;
+        }
+        return true;
+    };
+
+    // register server middlewares
+    svr->set_pre_routing_handler([&middleware_validate_api_key, &middleware_server_state](const httplib::Request & req, httplib::Response & res) {
+        res.set_header("Access-Control-Allow-Origin", req.get_header_value("Origin"));
+        // If this is OPTIONS request, skip validation because browsers don't include Authorization header
+        if (req.method == "OPTIONS") {
+            res.set_header("Access-Control-Allow-Credentials", "true");
+            res.set_header("Access-Control-Allow-Methods",     "GET, POST");
+            res.set_header("Access-Control-Allow-Headers",     "*");
+            res.set_content("", "text/html"); // blank response, no data
+            return httplib::Server::HandlerResponse::Handled; // skip further processing
+        }
+        if (!middleware_server_state(req, res)) {
+            return httplib::Server::HandlerResponse::Handled;
+        }
+        if (!middleware_validate_api_key(req, res)) {
+            return httplib::Server::HandlerResponse::Handled;
+        }
+        return httplib::Server::HandlerResponse::Unhandled;
+    });
+
+    //
+    // Route handlers (or controllers)
+    //
+
+    const auto handle_health = [&](const httplib::Request &, httplib::Response & res) {
+        // error and loading states are handled by middleware
+        json health = {{"status", "ok"}};
+        res_ok(res, health);
+    };
+
+    const auto handle_slots = [&](const httplib::Request & req, httplib::Response & res) {
+        if (!params.endpoint_slots) {
+            res_error(res, format_error_response("This server does not support slots endpoint. Start it with `--slots`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        // request slots data using task queue
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_METRICS);
+            task.id = task_id;
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task), true); // high-priority task
+        }
+
+        // get the result
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
+
+        if (result->is_error()) {
+            res_error(res, result->to_json());
+            return;
+        }
+
+        // TODO: get rid of this dynamic_cast
+        auto res_metrics = dynamic_cast<server_task_result_metrics*>(result.get());
+        GGML_ASSERT(res_metrics != nullptr);
+
+        // optionally return "fail_on_no_slot" error
+        if (req.has_param("fail_on_no_slot")) {
+            if (res_metrics->n_idle_slots == 0) {
+                res_error(res, format_error_response("no slot available", ERROR_TYPE_UNAVAILABLE));
+                return;
+            }
+        }
+
+        res_ok(res, res_metrics->slots_data);
+    };
+
+    const auto handle_metrics = [&](const httplib::Request &, httplib::Response & res) {
+        if (!params.endpoint_metrics) {
+            res_error(res, format_error_response("This server does not support metrics endpoint. Start it with `--metrics`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        // request slots data using task queue
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_METRICS);
+            task.id = task_id;
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task), true); // high-priority task
+        }
+
+        // get the result
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
+
+        if (result->is_error()) {
+            res_error(res, result->to_json());
+            return;
+        }
+
+        // TODO: get rid of this dynamic_cast
+        auto res_metrics = dynamic_cast<server_task_result_metrics*>(result.get());
+        GGML_ASSERT(res_metrics != nullptr);
+
+        // metrics definition: https://prometheus.io/docs/practices/naming/#metric-names
+        json all_metrics_def = json {
+            {"counter", {{
+                    {"name",  "prompt_tokens_total"},
+                    {"help",  "Number of prompt tokens processed."},
+                    {"value",  (uint64_t) res_metrics->n_prompt_tokens_processed_total}
+            }, {
+                    {"name",  "prompt_seconds_total"},
+                    {"help",  "Prompt process time"},
+                    {"value",  (uint64_t) res_metrics->t_prompt_processing_total / 1.e3}
+            }, {
+                    {"name",  "tokens_predicted_total"},
+                    {"help",  "Number of generation tokens processed."},
+                    {"value",  (uint64_t) res_metrics->n_tokens_predicted_total}
+            }, {
+                    {"name",  "tokens_predicted_seconds_total"},
+                    {"help",  "Predict process time"},
+                    {"value",  (uint64_t) res_metrics->t_tokens_generation_total / 1.e3}
+            }, {
+                    {"name",  "n_decode_total"},
+                    {"help",  "Total number of llama_decode() calls"},
+                    {"value",  res_metrics->n_decode_total}
+            }, {
+                    {"name",  "n_busy_slots_per_decode"},
+                    {"help",  "Average number of busy slots per llama_decode() call"},
+                    {"value",  (float) res_metrics->n_busy_slots_total / std::max((float) res_metrics->n_decode_total, 1.f)}
+            }}},
+            {"gauge", {{
+                    {"name",  "prompt_tokens_seconds"},
+                    {"help",  "Average prompt throughput in tokens/s."},
+                    {"value",  res_metrics->n_prompt_tokens_processed ? 1.e3 / res_metrics->t_prompt_processing * res_metrics->n_prompt_tokens_processed : 0.}
+            },{
+                    {"name",  "predicted_tokens_seconds"},
+                    {"help",  "Average generation throughput in tokens/s."},
+                    {"value",  res_metrics->n_tokens_predicted ? 1.e3 / res_metrics->t_tokens_generation * res_metrics->n_tokens_predicted : 0.}
+            },{
+                    {"name",  "kv_cache_usage_ratio"},
+                    {"help",  "KV-cache usage. 1 means 100 percent usage."},
+                    {"value",  1. * res_metrics->kv_cache_used_cells / params.n_ctx}
+            },{
+                    {"name",  "kv_cache_tokens"},
+                    {"help",  "KV-cache tokens."},
+                    {"value",  (uint64_t) res_metrics->kv_cache_tokens_count}
+            },{
+                    {"name",  "requests_processing"},
+                    {"help",  "Number of requests processing."},
+                    {"value",  (uint64_t) res_metrics->n_processing_slots}
+            },{
+                    {"name",  "requests_deferred"},
+                    {"help",  "Number of requests deferred."},
+                    {"value",  (uint64_t) res_metrics->n_tasks_deferred}
+            }}}
+        };
+
+        std::stringstream prometheus;
+
+        for (const auto & el : all_metrics_def.items()) {
+            const auto & type        = el.key();
+            const auto & metrics_def = el.value();
+
+            for (const auto & metric_def : metrics_def) {
+                const std::string name = metric_def.at("name");
+                const std::string help = metric_def.at("help");
+
+                auto value = json_value(metric_def, "value", 0.);
+                prometheus << "# HELP llamacpp:" << name << " " << help  << "\n"
+                            << "# TYPE llamacpp:" << name << " " << type  << "\n"
+                            << "llamacpp:"        << name << " " << value << "\n";
+            }
+        }
+
+        res.set_header("Process-Start-Time-Unix", std::to_string(res_metrics->t_start));
+
+        res.set_content(prometheus.str(), "text/plain; version=0.0.4");
+        res.status = 200; // HTTP OK
+    };
+
+    const auto handle_slots_save = [&ctx_server, &res_error, &res_ok, &params](const httplib::Request & req, httplib::Response & res, int id_slot) {
+        json request_data = json::parse(req.body);
+        std::string filename = request_data.at("filename");
+        if (!fs_validate_filename(filename)) {
+            res_error(res, format_error_response("Invalid filename", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+        std::string filepath = params.slot_save_path + filename;
+
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SLOT_SAVE);
+            task.id = task_id;
+            task.slot_action.slot_id  = id_slot;
+            task.slot_action.filename = filename;
+            task.slot_action.filepath = filepath;
+
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
+
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
+
+        if (result->is_error()) {
+            res_error(res, result->to_json());
+            return;
+        }
+
+        res_ok(res, result->to_json());
+    };
+
+    const auto handle_slots_restore = [&ctx_server, &res_error, &res_ok, &params](const httplib::Request & req, httplib::Response & res, int id_slot) {
+        json request_data = json::parse(req.body);
+        std::string filename = request_data.at("filename");
+        if (!fs_validate_filename(filename)) {
+            res_error(res, format_error_response("Invalid filename", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+        std::string filepath = params.slot_save_path + filename;
+
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SLOT_RESTORE);
+            task.id = task_id;
+            task.slot_action.slot_id  = id_slot;
+            task.slot_action.filename = filename;
+            task.slot_action.filepath = filepath;
+
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
+
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
+
+        if (result->is_error()) {
+            res_error(res, result->to_json());
+            return;
+        }
+
+        GGML_ASSERT(dynamic_cast<server_task_result_slot_save_load*>(result.get()) != nullptr);
+        res_ok(res, result->to_json());
+    };
+
+    const auto handle_slots_erase = [&ctx_server, &res_error, &res_ok](const httplib::Request & /* req */, httplib::Response & res, int id_slot) {
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SLOT_ERASE);
+            task.id = task_id;
+            task.slot_action.slot_id = id_slot;
+
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
+
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
+
+        if (result->is_error()) {
+            res_error(res, result->to_json());
+            return;
+        }
+
+        GGML_ASSERT(dynamic_cast<server_task_result_slot_erase*>(result.get()) != nullptr);
+        res_ok(res, result->to_json());
+    };
+
+    const auto handle_slots_action = [&params, &res_error, &handle_slots_save, &handle_slots_restore, &handle_slots_erase](const httplib::Request & req, httplib::Response & res) {
+        if (params.slot_save_path.empty()) {
+            res_error(res, format_error_response("This server does not support slots action. Start it with `--slot-save-path`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        std::string id_slot_str = req.path_params.at("id_slot");
+        int id_slot;
+
+        try {
+            id_slot = std::stoi(id_slot_str);
+        } catch (const std::exception &) {
+            res_error(res, format_error_response("Invalid slot ID", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        std::string action = req.get_param_value("action");
+
+        if (action == "save") {
+            handle_slots_save(req, res, id_slot);
+        } else if (action == "restore") {
+            handle_slots_restore(req, res, id_slot);
+        } else if (action == "erase") {
+            handle_slots_erase(req, res, id_slot);
+        } else {
+            res_error(res, format_error_response("Invalid action", ERROR_TYPE_INVALID_REQUEST));
+        }
+    };
+
+    const auto handle_props = [&ctx_server, &res_ok](const httplib::Request &, httplib::Response & res) {
+        // this endpoint is publicly available, please only return what is safe to be exposed
+        json data = {
+            { "default_generation_settings", ctx_server.default_generation_settings_for_props },
+            { "total_slots",                 ctx_server.params_base.n_parallel },
+            { "model_path",                  ctx_server.params_base.model.path },
+            { "chat_template",               common_chat_templates_source(ctx_server.chat_templates.get()) },
+            { "bos_token",                   common_token_to_piece(ctx_server.ctx, llama_vocab_bos(ctx_server.vocab), /* special= */ true)},
+            { "eos_token",                   common_token_to_piece(ctx_server.ctx, llama_vocab_eos(ctx_server.vocab), /* special= */ true)},
+            { "build_info",                  build_info },
+        };
+        if (ctx_server.params_base.use_jinja) {
+            if (auto tool_use_src = common_chat_templates_source(ctx_server.chat_templates.get(), "tool_use")) {
+                data["chat_template_tool_use"] = tool_use_src;
+            }
+        }
+
+        res_ok(res, data);
+    };
+
+    const auto handle_props_change = [&ctx_server, &res_error, &res_ok](const httplib::Request & req, httplib::Response & res) {
+        if (!ctx_server.params_base.endpoint_props) {
+            res_error(res, format_error_response("This server does not support changing global properties. Start it with `--props`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        json data = json::parse(req.body);
+
+        // update any props here
+
+        res_ok(res, {{ "success", true }});
+    };
+
+    const auto handle_api_show = [&ctx_server, &res_ok](const httplib::Request &, httplib::Response & res) {
+        json data = {
+            {
+                "template", common_chat_templates_source(ctx_server.chat_templates.get()),
+            },
+            {
+                "model_info", {
+                    { "llama.context_length", ctx_server.slots.back().n_ctx, },
+                }
+            },
+        };
+
+        res_ok(res, data);
+    };
+
+    // handle completion-like requests (completion, chat, infill)
+    // we can optionally provide a custom format for partial results and final results
+    const auto handle_completions_impl = [&ctx_server, &res_error, &res_ok](
+            server_task_type type,
+            json & data,
+            std::function<bool()> is_connection_closed,
+            httplib::Response & res,
+            oaicompat_type oaicompat) {
+        GGML_ASSERT(type == SERVER_TASK_TYPE_COMPLETION || type == SERVER_TASK_TYPE_INFILL);
+
+        if (ctx_server.params_base.embedding) {
+            res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        auto completion_id = gen_chatcmplid();
+        std::unordered_set<int> task_ids;
+        try {
+            std::vector<server_task> tasks;
+
+            const auto & prompt = data.at("prompt");
+            // TODO: this log can become very long, put it behind a flag or think about a more compact format
+            //SRV_DBG("Prompt: %s\n", prompt.is_string() ? prompt.get<std::string>().c_str() : prompt.dump(2).c_str());
+
+            std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
+            tasks.reserve(tokenized_prompts.size());
+            for (size_t i = 0; i < tokenized_prompts.size(); i++) {
+                server_task task = server_task(type);
+
+                task.id    = ctx_server.queue_tasks.get_new_id();
+                task.index = i;
+
+                task.prompt_tokens    = std::move(tokenized_prompts[i]);
+                task.params           = server_task::params_from_json_cmpl(
+                        ctx_server.ctx,
+                        ctx_server.params_base,
+                        data);
+                task.id_selected_slot = json_value(data, "id_slot", -1);
+
+                // OAI-compat
+                task.params.oaicompat                 = oaicompat;
+                task.params.oaicompat_cmpl_id         = completion_id;
+                // oaicompat_model is already populated by params_from_json_cmpl
+
+                tasks.push_back(std::move(task));
+            }
+
+            task_ids = server_task::get_list_id(tasks);
+            ctx_server.queue_results.add_waiting_tasks(tasks);
+            ctx_server.queue_tasks.post(std::move(tasks));
+        } catch (const std::exception & e) {
+            res_error(res, format_error_response(e.what(), ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        bool stream = json_value(data, "stream", false);
+
+        if (!stream) {
+            ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
+                if (results.size() == 1) {
+                    // single result
+                    res_ok(res, results[0]->to_json());
+                } else {
+                    // multiple results (multitask)
+                    json arr = json::array();
+                    for (auto & res : results) {
+                        arr.push_back(res->to_json());
+                    }
+                    res_ok(res, arr);
+                }
+            }, [&](const json & error_data) {
+                res_error(res, error_data);
+            }, is_connection_closed);
+
+            ctx_server.queue_results.remove_waiting_task_ids(task_ids);
+        } else {
+            const auto chunked_content_provider = [task_ids, &ctx_server, oaicompat](size_t, httplib::DataSink & sink) {
+                ctx_server.receive_cmpl_results_stream(task_ids, [&](server_task_result_ptr & result) -> bool {
+                    json res_json = result->to_json();
+                    if (res_json.is_array()) {
+                        for (const auto & res : res_json) {
+                            if (!server_sent_event(sink, "data", res)) {
+                                // sending failed (HTTP connection closed), cancel the generation
+                                return false;
+                            }
+                        }
+                        return true;
+                    } else {
+                        return server_sent_event(sink, "data", res_json);
+                    }
+                }, [&](const json & error_data) {
+                    server_sent_event(sink, "error", error_data);
+                }, [&sink]() {
+                    // note: do not use req.is_connection_closed here because req is already destroyed
+                    return !sink.is_writable();
+                });
+                if (oaicompat != OAICOMPAT_TYPE_NONE) {
+                    static const std::string ev_done = "data: [DONE]\n\n";
+                    sink.write(ev_done.data(), ev_done.size());
+                }
+                sink.done();
+                return false;
+            };
+
+            auto on_complete = [task_ids, &ctx_server] (bool) {
+                ctx_server.queue_results.remove_waiting_task_ids(task_ids);
+            };
+
+            res.set_chunked_content_provider("text/event-stream", chunked_content_provider, on_complete);
+        }
+    };
+
+    const auto handle_completions = [&handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
+        json data = json::parse(req.body);
+        return handle_completions_impl(
+            SERVER_TASK_TYPE_COMPLETION,
+            data,
+            req.is_connection_closed,
+            res,
+            OAICOMPAT_TYPE_NONE);
+    };
+
+    const auto handle_completions_oai = [&handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
+        json data = oaicompat_completion_params_parse(json::parse(req.body));
+        return handle_completions_impl(
+            SERVER_TASK_TYPE_COMPLETION,
+            data,
+            req.is_connection_closed,
+            res,
+            OAICOMPAT_TYPE_COMPLETION);
+    };
+
+    const auto handle_infill = [&ctx_server, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
+        // check model compatibility
+        std::string err;
+        if (llama_vocab_fim_pre(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
+            err += "prefix token is missing. ";
+        }
+        if (llama_vocab_fim_suf(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
+            err += "suffix token is missing. ";
+        }
+        if (llama_vocab_fim_mid(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
+            err += "middle token is missing. ";
+        }
+        if (!err.empty()) {
+            res_error(res, format_error_response(string_format("Infill is not supported by this model: %s", err.c_str()), ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        json data = json::parse(req.body);
+
+        // validate input
+        if (data.contains("prompt") && !data.at("prompt").is_string()) {
+            // prompt is optional
+            res_error(res, format_error_response("\"prompt\" must be a string", ERROR_TYPE_INVALID_REQUEST));
+        }
+
+        if (!data.contains("input_prefix")) {
+            res_error(res, format_error_response("\"input_prefix\" is required", ERROR_TYPE_INVALID_REQUEST));
+        }
+
+        if (!data.contains("input_suffix")) {
+            res_error(res, format_error_response("\"input_suffix\" is required", ERROR_TYPE_INVALID_REQUEST));
+        }
+
+        if (data.contains("input_extra") && !data.at("input_extra").is_array()) {
+            // input_extra is optional
+            res_error(res, format_error_response("\"input_extra\" must be an array of {\"filename\": string, \"text\": string}", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        json input_extra = json_value(data, "input_extra", json::array());
+        for (const auto & chunk : input_extra) {
+            // { "text": string, "filename": string }
+            if (!chunk.contains("text") || !chunk.at("text").is_string()) {
+                res_error(res, format_error_response("extra_context chunk must contain a \"text\" field with a string value", ERROR_TYPE_INVALID_REQUEST));
+                return;
+            }
+            // filename is optional
+            if (chunk.contains("filename") && !chunk.at("filename").is_string()) {
+                res_error(res, format_error_response("extra_context chunk's \"filename\" field must be a string", ERROR_TYPE_INVALID_REQUEST));
+                return;
+            }
+        }
+        data["input_extra"] = input_extra; // default to empty array if it's not exist
+
+        std::string prompt = json_value(data, "prompt", std::string());
+        std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, false, true);
+        SRV_DBG("creating infill tasks, n_prompts = %d\n", (int) tokenized_prompts.size());
+        data["prompt"] = format_infill(
+            ctx_server.vocab,
+            data.at("input_prefix"),
+            data.at("input_suffix"),
+            data.at("input_extra"),
+            ctx_server.params_base.n_batch,
+            ctx_server.params_base.n_predict,
+            ctx_server.slots[0].n_ctx, // TODO: there should be a better way
+            ctx_server.params_base.spm_infill,
+            tokenized_prompts[0]
+        );
+
+        return handle_completions_impl(
+            SERVER_TASK_TYPE_INFILL,
+            data,
+            req.is_connection_closed,
+            res,
+            OAICOMPAT_TYPE_NONE); // infill is not OAI compatible
+    };
+
+    const auto handle_chat_completions = [&ctx_server, &params, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
+        LOG_DBG("request: %s\n", req.body.c_str());
+        if (ctx_server.params_base.embedding) {
+            res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        auto body = json::parse(req.body);
+        json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates.get());
+
+        return handle_completions_impl(
+            SERVER_TASK_TYPE_COMPLETION,
+            data,
+            req.is_connection_closed,
+            res,
+            OAICOMPAT_TYPE_CHAT);
+    };
+
+    // same with handle_chat_completions, but without inference part
+    const auto handle_apply_template = [&ctx_server, &params, &res_ok](const httplib::Request & req, httplib::Response & res) {
+        auto body = json::parse(req.body);
+        json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates.get());
+        res_ok(res, {{ "prompt", std::move(data.at("prompt")) }});
+    };
+
+    const auto handle_models = [&params, &ctx_server, &res_ok](const httplib::Request &, httplib::Response & res) {
+        json models = {
+            {"object", "list"},
+            {"data", {
+                {
+                    {"id",       params.model_alias.empty() ? params.model.path : params.model_alias},
+                    {"object",   "model"},
+                    {"created",  std::time(0)},
+                    {"owned_by", "llamacpp"},
+                    {"meta",     ctx_server.model_meta()}
+                },
+             }}
+        };
+
+        res_ok(res, models);
+    };
+
+    const auto handle_tokenize = [&ctx_server, &res_ok](const httplib::Request & req, httplib::Response & res) {
+        const json body = json::parse(req.body);
+
+        json tokens_response = json::array();
+        if (body.count("content") != 0) {
+            const bool add_special = json_value(body, "add_special", false);
+            const bool with_pieces = json_value(body, "with_pieces", false);
+
+            llama_tokens tokens = tokenize_mixed(ctx_server.vocab, body.at("content"), add_special, true);
+
+            if (with_pieces) {
+                for (const auto& token : tokens) {
+                    std::string piece = common_token_to_piece(ctx_server.ctx, token);
+                    json piece_json;
+
+                    // Check if the piece is valid UTF-8
+                    if (is_valid_utf8(piece)) {
+                        piece_json = piece;
+                    } else {
+                        // If not valid UTF-8, store as array of byte values
+                        piece_json = json::array();
+                        for (unsigned char c : piece) {
+                            piece_json.push_back(static_cast<int>(c));
+                        }
+                    }
+
+                    tokens_response.push_back({
+                        {"id", token},
+                        {"piece", piece_json}
+                    });
+                }
+            } else {
+                tokens_response = tokens;
+            }
+        }
+
+        const json data = format_tokenizer_response(tokens_response);
+        res_ok(res, data);
+    };
+
+    const auto handle_detokenize = [&ctx_server, &res_ok](const httplib::Request & req, httplib::Response & res) {
+        const json body = json::parse(req.body);
+
+        std::string content;
+        if (body.count("tokens") != 0) {
+            const llama_tokens tokens = body.at("tokens");
+            content = tokens_to_str(ctx_server.ctx, tokens.cbegin(), tokens.cend());
+        }
+
+        const json data = format_detokenized_response(content);
+        res_ok(res, data);
+    };
+
+    const auto handle_embeddings_impl = [&ctx_server, &res_error, &res_ok](const httplib::Request & req, httplib::Response & res, oaicompat_type oaicompat) {
+        const json body = json::parse(req.body);
+
+        if (oaicompat != OAICOMPAT_TYPE_NONE && llama_pooling_type(ctx_server.ctx) == LLAMA_POOLING_TYPE_NONE) {
+            res_error(res, format_error_response("Pooling type 'none' is not OAI compatible. Please use a different pooling type", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        // for the shape of input/content, see tokenize_input_prompts()
+        json prompt;
+        if (body.count("input") != 0) {
+            prompt = body.at("input");
+        } else if (body.contains("content")) {
+            oaicompat = OAICOMPAT_TYPE_NONE; // "content" field is not OAI compatible
+            prompt = body.at("content");
+        } else {
+            res_error(res, format_error_response("\"input\" or \"content\" must be provided", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        bool use_base64 = false;
+        if (body.count("encoding_format") != 0) {
+            const std::string& format = body.at("encoding_format");
+            if (format == "base64") {
+                use_base64 = true;
+            } else if (format != "float") {
+                res_error(res, format_error_response("The format to return the embeddings in. Can be either float or base64", ERROR_TYPE_INVALID_REQUEST));
+                return;
+            }
+        }
+
+        std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
+        for (const auto & tokens : tokenized_prompts) {
+            // this check is necessary for models that do not add BOS token to the input
+            if (tokens.empty()) {
+                res_error(res, format_error_response("Input content cannot be empty", ERROR_TYPE_INVALID_REQUEST));
+                return;
+            }
+        }
+
+        // create and queue the task
+        json responses = json::array();
+        bool error = false;
+        std::unordered_set<int> task_ids;
+        {
+            std::vector<server_task> tasks;
+            for (size_t i = 0; i < tokenized_prompts.size(); i++) {
+                server_task task = server_task(SERVER_TASK_TYPE_EMBEDDING);
+
+                task.id            = ctx_server.queue_tasks.get_new_id();
+                task.index         = i;
+                task.prompt_tokens = std::move(tokenized_prompts[i]);
+
+                // OAI-compat
+                task.params.oaicompat = oaicompat;
+
+                tasks.push_back(std::move(task));
+            }
+
+            task_ids = server_task::get_list_id(tasks);
+            ctx_server.queue_results.add_waiting_tasks(tasks);
+            ctx_server.queue_tasks.post(std::move(tasks));
+        }
+
+        // get the result
+        ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
+            for (auto & res : results) {
+                GGML_ASSERT(dynamic_cast<server_task_result_embd*>(res.get()) != nullptr);
+                responses.push_back(res->to_json());
+            }
+        }, [&](const json & error_data) {
+            res_error(res, error_data);
+            error = true;
+        }, req.is_connection_closed);
+
+        ctx_server.queue_results.remove_waiting_task_ids(task_ids);
+
+        if (error) {
+            return;
+        }
+
+        // write JSON response
+        json root = oaicompat == OAICOMPAT_TYPE_EMBEDDING
+            ? format_embeddings_response_oaicompat(body, responses, use_base64)
+            : json(responses);
+        res_ok(res, root);
+    };
+
+    const auto handle_embeddings = [&handle_embeddings_impl](const httplib::Request & req, httplib::Response & res) {
+        handle_embeddings_impl(req, res, OAICOMPAT_TYPE_NONE);
+    };
+
+    const auto handle_embeddings_oai = [&handle_embeddings_impl](const httplib::Request & req, httplib::Response & res) {
+        handle_embeddings_impl(req, res, OAICOMPAT_TYPE_EMBEDDING);
+    };
+
+    const auto handle_rerank = [&ctx_server, &res_error, &res_ok](const httplib::Request & req, httplib::Response & res) {
+        if (!ctx_server.params_base.reranking || ctx_server.params_base.embedding) {
+            res_error(res, format_error_response("This server does not support reranking. Start it with `--reranking` and without `--embedding`", ERROR_TYPE_NOT_SUPPORTED));
+            return;
+        }
+
+        const json body = json::parse(req.body);
+
+        // TODO: implement
+        //int top_n = 1;
+        //if (body.count("top_n") != 1) {
+        //    top_n = body.at("top_n");
+        //} else {
+        //    res_error(res, format_error_response("\"top_n\" must be provided", ERROR_TYPE_INVALID_REQUEST));
+        //    return;
+        //}
+
+        // if true, use TEI API format, otherwise use Jina API format
+        // Jina: https://jina.ai/reranker/
+        // TEI: https://huggingface.github.io/text-embeddings-inference/#/Text%20Embeddings%20Inference/rerank
+        bool is_tei_format = body.contains("texts");
+
+        json query;
+        if (body.count("query") == 1) {
+            query = body.at("query");
+            if (!query.is_string()) {
+                res_error(res, format_error_response("\"query\" must be a string", ERROR_TYPE_INVALID_REQUEST));
+                return;
+            }
+        } else {
+            res_error(res, format_error_response("\"query\" must be provided", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        std::vector<std::string> documents = json_value(body, "documents",
+                                             json_value(body, "texts", std::vector<std::string>()));
+        if (documents.empty()) {
+            res_error(res, format_error_response("\"documents\" must be a non-empty string array", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.vocab, query, /* add_special */ false, true)[0];
+
+        // create and queue the task
+        json responses = json::array();
+        bool error = false;
+        std::unordered_set<int> task_ids;
+        {
+            std::vector<server_task> tasks;
+            std::vector<llama_tokens> tokenized_docs = tokenize_input_prompts(ctx_server.vocab, documents, /* add_special */ false, true);
+            tasks.reserve(tokenized_docs.size());
+            for (size_t i = 0; i < tokenized_docs.size(); i++) {
+                server_task task   = server_task(SERVER_TASK_TYPE_RERANK);
+                task.id            = ctx_server.queue_tasks.get_new_id();
+                task.index         = i;
+                task.prompt_tokens = format_rerank(ctx_server.vocab, tokenized_query, tokenized_docs[i]);
+                tasks.push_back(std::move(task));
+            }
+
+            task_ids = server_task::get_list_id(tasks);
+            ctx_server.queue_results.add_waiting_tasks(tasks);
+            ctx_server.queue_tasks.post(std::move(tasks));
+        }
+
+        ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
+            for (auto & res : results) {
+                GGML_ASSERT(dynamic_cast<server_task_result_rerank*>(res.get()) != nullptr);
+                responses.push_back(res->to_json());
+            }
+        }, [&](const json & error_data) {
+            res_error(res, error_data);
+            error = true;
+        }, req.is_connection_closed);
+
+        if (error) {
+            return;
+        }
+
+        // write JSON response
+        json root = format_response_rerank(
+            body,
+            responses,
+            is_tei_format,
+            documents);
+
+        res_ok(res, root);
+    };
+
+    const auto handle_lora_adapters_list = [&](const httplib::Request &, httplib::Response & res) {
+        json result = json::array();
+        const auto & loras = ctx_server.params_base.lora_adapters;
+        for (size_t i = 0; i < loras.size(); ++i) {
+            auto & lora = loras[i];
+            result.push_back({
+                {"id", i},
+                {"path", lora.path},
+                {"scale", lora.scale},
+            });
+        }
+        res_ok(res, result);
+        res.status = 200; // HTTP OK
+    };
+
+    const auto handle_lora_adapters_apply = [&](const httplib::Request & req, httplib::Response & res) {
+        const json body = json::parse(req.body);
+        if (!body.is_array()) {
+            res_error(res, format_error_response("Request body must be an array", ERROR_TYPE_INVALID_REQUEST));
+            return;
+        }
+
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SET_LORA);
+            task.id = task_id;
+            task.set_lora = parse_lora_request(ctx_server.params_base.lora_adapters, body);
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
+
+        // get the result
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
+
+        if (result->is_error()) {
+            res_error(res, result->to_json());
+            return;
+        }
+
+        GGML_ASSERT(dynamic_cast<server_task_result_apply_lora*>(result.get()) != nullptr);
+        res_ok(res, result->to_json());
+    };
+
+    //
+    // Router
+    //
+
+    if (!params.webui) {
+        LOG_INF("Web UI is disabled\n");
+    } else {
+        // register static assets routes
+        if (!params.public_path.empty()) {
+            // Set the base directory for serving static files
+            bool is_found = svr->set_mount_point("/", params.public_path);
+            if (!is_found) {
+                LOG_ERR("%s: static assets path not found: %s\n", __func__, params.public_path.c_str());
+                return 1;
+            }
+        } else {
+            // using embedded static index.html
+            svr->Get("/", [](const httplib::Request & req, httplib::Response & res) {
+                if (req.get_header_value("Accept-Encoding").find("gzip") == std::string::npos) {
+                    res.set_content("Error: gzip is not supported by this browser", "text/plain");
+                } else {
+                    res.set_header("Content-Encoding", "gzip");
+                    // COEP and COOP headers, required by pyodide (python interpreter)
+                    res.set_header("Cross-Origin-Embedder-Policy", "require-corp");
+                    res.set_header("Cross-Origin-Opener-Policy", "same-origin");
+                    res.set_content(reinterpret_cast<const char*>(index_html_gz), index_html_gz_len, "text/html; charset=utf-8");
+                }
+                return false;
+            });
+        }
+    }
+
+    // register API routes
+    svr->Get ("/health",              handle_health); // public endpoint (no API key check)
+    svr->Get ("/metrics",             handle_metrics);
+    svr->Get ("/props",               handle_props);
+    svr->Post("/props",               handle_props_change);
+    svr->Post("/api/show",            handle_api_show);
+    svr->Get ("/models",              handle_models); // public endpoint (no API key check)
+    svr->Get ("/v1/models",           handle_models); // public endpoint (no API key check)
+    svr->Post("/completion",          handle_completions); // legacy
+    svr->Post("/completions",         handle_completions);
+    svr->Post("/v1/completions",      handle_completions_oai);
+    svr->Post("/chat/completions",    handle_chat_completions);
+    svr->Post("/v1/chat/completions", handle_chat_completions);
+    svr->Post("/infill",              handle_infill);
+    svr->Post("/embedding",           handle_embeddings); // legacy
+    svr->Post("/embeddings",          handle_embeddings);
+    svr->Post("/v1/embeddings",       handle_embeddings_oai);
+    svr->Post("/rerank",              handle_rerank);
+    svr->Post("/reranking",           handle_rerank);
+    svr->Post("/v1/rerank",           handle_rerank);
+    svr->Post("/v1/reranking",        handle_rerank);
+    svr->Post("/tokenize",            handle_tokenize);
+    svr->Post("/detokenize",          handle_detokenize);
+    svr->Post("/apply-template",      handle_apply_template);
+    // LoRA adapters hotswap
+    svr->Get ("/lora-adapters",       handle_lora_adapters_list);
+    svr->Post("/lora-adapters",       handle_lora_adapters_apply);
+    // Save & load slots
+    svr->Get ("/slots",               handle_slots);
+    svr->Post("/slots/:id_slot",      handle_slots_action);
+
+    //
+    // Start the server
+    //
+    if (params.n_threads_http < 1) {
+        // +2 threads for monitoring endpoints
+        params.n_threads_http = std::max(params.n_parallel + 2, (int32_t) std::thread::hardware_concurrency() - 1);
+    }
+    log_data["n_threads_http"] =  std::to_string(params.n_threads_http);
+    svr->new_task_queue = [&params] { return new httplib::ThreadPool(params.n_threads_http); };
+
+    // clean up function, to be called before exit
+    auto clean_up = [&svr, &ctx_server]() {
+        SRV_INF("%s: cleaning up before exit...\n", __func__);
+        svr->stop();
+        ctx_server.queue_results.terminate();
+        llama_backend_free();
+    };
+
+    bool was_bound = false;
+    if (string_ends_with(std::string(params.hostname), ".sock")) {
+        LOG_INF("%s: setting address family to AF_UNIX\n", __func__);
+        svr->set_address_family(AF_UNIX);
+        // bind_to_port requires a second arg, any value other than 0 should
+        // simply get ignored
+        was_bound = svr->bind_to_port(params.hostname, 8080);
+    } else {
+        LOG_INF("%s: binding port with default address family\n", __func__);
+        // bind HTTP listen port
+        if (params.port == 0) {
+            int bound_port = svr->bind_to_any_port(params.hostname);
+            if ((was_bound = (bound_port >= 0))) {
+                params.port = bound_port;
+            }
+        } else {
+            was_bound = svr->bind_to_port(params.hostname, params.port);
+        }
+    }
+
+    if (!was_bound) {
+        LOG_ERR("%s: couldn't bind HTTP server socket, hostname: %s, port: %d\n", __func__, params.hostname.c_str(), params.port);
+        clean_up();
+        return 1;
+    }
+
+    // run the HTTP server in a thread
+    std::thread t([&]() { svr->listen_after_bind(); });
+    svr->wait_until_ready();
+
+    LOG_INF("%s: HTTP server is listening, hostname: %s, port: %d, http threads: %d\n", __func__, params.hostname.c_str(), params.port, params.n_threads_http);
+
+    // load the model
+    LOG_INF("%s: loading model\n", __func__);
+
+    if (!ctx_server.load_model(params)) {
+        clean_up();
+        t.join();
+        LOG_ERR("%s: exiting due to model loading error\n", __func__);
+        return 1;
+    }
+
+    ctx_server.init();
+    state.store(SERVER_STATE_READY);
+
+    LOG_INF("%s: model loaded\n", __func__);
+
+    // print sample chat example to make it clear which template is used
+    LOG_INF("%s: chat template, chat_template: %s, example_format: '%s'\n", __func__,
+        common_chat_templates_source(ctx_server.chat_templates.get()),
+        common_chat_format_example(ctx_server.chat_templates.get(), ctx_server.params_base.use_jinja).c_str());
+
+    ctx_server.queue_tasks.on_new_task([&ctx_server](server_task && task) {
+        ctx_server.process_single_task(std::move(task));
+    });
+
+    ctx_server.queue_tasks.on_update_slots([&ctx_server]() {
+        ctx_server.update_slots();
+    });
+
+    shutdown_handler = [&](int) {
+        // this will unblock start_loop()
+        ctx_server.queue_tasks.terminate();
+    };
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+    struct sigaction sigint_action;
+    sigint_action.sa_handler = signal_handler;
+    sigemptyset (&sigint_action.sa_mask);
+    sigint_action.sa_flags = 0;
+    sigaction(SIGINT, &sigint_action, NULL);
+    sigaction(SIGTERM, &sigint_action, NULL);
+#elif defined (_WIN32)
+    auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
+        return (ctrl_type == CTRL_C_EVENT) ? (signal_handler(SIGINT), true) : false;
+    };
+    SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
+#endif
+
+    LOG_INF("%s: server is listening on http://%s:%d - starting the main loop\n", __func__, params.hostname.c_str(), params.port);
+
+    // this call blocks the main thread until queue_tasks.terminate() is called
+    ctx_server.queue_tasks.start_loop();
+
+    clean_up();
+    t.join();
+
+    return 0;
+}

From 5b92c34a586e67c536d38f0c596b39c906629162 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 12:08:37 -0700
Subject: [PATCH 45/73] Update server.cpp

Added my fixes back in after merge from upstream.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server.cpp | 132 +++++++++++++++++++++++++++++++++----
 1 file changed, 120 insertions(+), 12 deletions(-)

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index c580ec123299c..ac0df62172915 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -31,6 +31,12 @@
 #include <unordered_map>
 #include <unordered_set>
 
+// llama-server-one START
+#ifdef COSMOCC
+#include <cosmo.h>
+#endif
+// llama-server-one END
+
 using json = nlohmann::ordered_json;
 
 constexpr int HTTP_POLLING_SECONDS = 1;
@@ -1596,13 +1602,15 @@ struct server_queue {
         return 0;
     }
 
+    // llama-server-one START - defer() --> defer_task() to make Cosmo STL happy.
     // Add a new task, but defer until one slot is available
-    void defer(server_task && task) {
+    void defer_task(server_task task) {
         std::unique_lock<std::mutex> lock(mutex_tasks);
         QUE_DBG("defer task, id = %d\n", task.id);
         queue_tasks_deferred.push_back(std::move(task));
         condition_tasks.notify_one();
     }
+    // llama-server-one END
 
     // Get the next id for creating a new task
     int get_new_id() {
@@ -2652,13 +2660,17 @@ struct server_context {
                     if (slot == nullptr) {
                         // if no slot is available, we defer this task for processing later
                         SRV_DBG("no slot is available, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(std::move(task));
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(std::move(task));
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -2741,7 +2753,9 @@ struct server_context {
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(std::move(task));
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -2777,7 +2791,9 @@ struct server_context {
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(std::move(task));
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -2820,7 +2836,9 @@ struct server_context {
                     if (slot->is_processing()) {
                         // if requested slot is unavailable, we defer this task for processing later
                         SRV_DBG("requested slot is unavailable, defer task, id_task = %d\n", task.id);
-                        queue_tasks.defer(std::move(task));
+                        // llama-server-one START
+                        queue_tasks.defer_task(task);
+                        // llama-server-one END
                         break;
                     }
 
@@ -3402,15 +3420,51 @@ struct server_context {
     }
 
     json model_meta() const {
+        char general_architecture[64];
+        char general_type[64];
+        char general_name[64];
+        char general_version[64];
+        char general_finetune[64];
+        char general_basename[64];
+        char general_size_label[64];
+        char general_license[64];
+
+        general_architecture[0] = 0;
+        general_type[0] = 0;
+        general_name[0] = 0;
+        general_version[0] = 0;
+        general_finetune[0] = 0;
+        general_basename[0] = 0;
+        general_size_label[0] = 0;
+        general_license[0] = 0;
+
+        llama_model_meta_val_str(model, "general.architecture", general_architecture, 64);
+        llama_model_meta_val_str(model, "general.type", general_type, 64);
+        llama_model_meta_val_str(model, "general.name", general_name, 64);
+        llama_model_meta_val_str(model, "general.version",      general_version, 64);
+        llama_model_meta_val_str(model, "general.finetune",     general_finetune, 64);
+        llama_model_meta_val_str(model, "general.basename",     general_basename, 64);
+        llama_model_meta_val_str(model, "general.size_label",   general_size_label, 64);
+        llama_model_meta_val_str(model, "general.license",      general_license, 64);
+
         return json {
-            {"vocab_type",  llama_vocab_type       (vocab)},
-            {"n_vocab",     llama_vocab_n_tokens   (vocab)},
-            {"n_ctx_train", llama_model_n_ctx_train(model)},
-            {"n_embd",      llama_model_n_embd     (model)},
-            {"n_params",    llama_model_n_params   (model)},
-            {"size",        llama_model_size       (model)},
+            {"vocab_type",  llama_vocab_type            (vocab)},
+            {"n_vocab",     llama_vocab_n_tokens        (vocab)},
+            {"n_ctx_train", llama_n_ctx_train           (model)},
+            {"n_embd",      llama_n_embd                (model)},
+            {"n_params",    llama_model_n_params        (model)},
+            {"size",        llama_model_size            (model)},
+            {"general.architecture", general_architecture },
+            {"general.type", general_type },
+            {"general.name", general_name },
+            {"general.version", general_version },
+            {"general.finetune", general_finetune },
+            {"general.basename", general_basename },
+            {"general.size_label", general_size_label },
+            {"general.license", general_license },
         };
     }
+    // llama-server-one END
 };
 
 static void log_server_request(const httplib::Request & req, const httplib::Response & res) {
@@ -3442,6 +3496,40 @@ inline void signal_handler(int signal) {
 }
 
 int main(int argc, char ** argv) {
+    // llama-server-one START
+    // This implements an args file feature inspired by llamafile's.
+    #ifdef COSMOCC
+    // Keep the build from showing up as ape in the process list.
+    pthread_setname_np(pthread_self(), "llama-server-one");
+    
+    // Args files if present. The names are different to remove confusion during packaging.
+    const std::string& argsFilename = "llama-server-one-args";
+    const std::string& zipArgsFilename = "/zip/default-args";
+    struct stat buffer;
+
+    // At this point, argc, argv represent:
+    //     command (User supplied args)
+    
+    if (stat (argsFilename.c_str(), &buffer) == 0) {
+        argc = cosmo_args(argsFilename.c_str(), &argv);
+    }
+    
+    // At this point, argc, argv represent:
+    //     command (argsFilename args) (User supplied args)
+
+    if (stat (zipArgsFilename.c_str(), &buffer) == 0) {
+        argc = cosmo_args(zipArgsFilename.c_str(), &argv);
+    }
+
+    // At this point, argc, argv represent:
+    //     command (zipArgsFilename args) (argsFilename args) (User supplied args)
+    
+    // Yep, this is counterintuitive, but how the cosmo_args command works.
+    // argsFilename args override zipArgsFilename file args.
+    // User supplied args override argsFilename and zipArgsFilename args.
+    #endif
+    // llama-server-one END
+
     // own arguments required by this example
     common_params params;
 
@@ -4500,6 +4588,26 @@ int main(int argc, char ** argv) {
         }
     }
 
+    // llama-server-one START
+    svr->Get("/chat", [](const httplib::Request & req, httplib::Response & res) {
+        if (req.get_header_value("Accept-Encoding").find("gzip") == std::string::npos) {
+            res.set_content("Error: gzip is not supported by this browser", "text/plain");
+        } else {
+            res.set_header("Content-Encoding", "gzip");
+            // COEP and COOP headers, required by pyodide (python interpreter)
+            res.set_header("Cross-Origin-Embedder-Policy", "require-corp");
+            res.set_header("Cross-Origin-Opener-Policy", "same-origin");
+            res.set_content(reinterpret_cast<const char*>(index_html_gz), index_html_gz_len, "text/html; charset=utf-8");
+        }
+        return false;
+    });
+
+    svr->Get("/chat/", [](const httplib::Request &, httplib::Response & res) {
+        res.set_redirect("/chat");
+        return false;
+    });
+    // llama-server-one END
+
     // register API routes
     svr->Get ("/health",              handle_health); // public endpoint (no API key check)
     svr->Get ("/metrics",             handle_metrics);

From 9cfcb3636b1e0f5c229481c13ce3cefb4462fe8f Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 12:09:22 -0700
Subject: [PATCH 46/73] Update server-ls1.cpp

Fresh copy for next merge reconcile.

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 examples/server/server-ls1.cpp | 234 ++++++++++++++++++---------------
 1 file changed, 126 insertions(+), 108 deletions(-)

diff --git a/examples/server/server-ls1.cpp b/examples/server/server-ls1.cpp
index daeff8927a644..ac0df62172915 100644
--- a/examples/server/server-ls1.cpp
+++ b/examples/server/server-ls1.cpp
@@ -1558,29 +1558,30 @@ struct server_queue {
     std::condition_variable condition_tasks;
 
     // callback functions
-    std::function<void(server_task)> callback_new_task;
-    std::function<void(void)>        callback_update_slots;
+    std::function<void(server_task &&)> callback_new_task;
+    std::function<void(void)>           callback_update_slots;
 
     // Add a new task to the end of the queue
-    int post(server_task task, bool front = false) {
+    int post(server_task && task, bool front = false) {
         std::unique_lock<std::mutex> lock(mutex_tasks);
         GGML_ASSERT(task.id != -1);
         // if this is cancel task make sure to clean up pending tasks
         if (task.type == SERVER_TASK_TYPE_CANCEL) {
             cleanup_pending_task(task.id_target);
         }
-        QUE_DBG("new task, id = %d, front = %d\n", task.id, front);
+        const int task_id = task.id;
+        QUE_DBG("new task, id = %d, front = %d\n", task_id, front);
         if (front) {
             queue_tasks.push_front(std::move(task));
         } else {
             queue_tasks.push_back(std::move(task));
         }
         condition_tasks.notify_one();
-        return task.id;
+        return task_id;
     }
 
     // multi-task version of post()
-    int post(std::vector<server_task> & tasks, bool front = false) {
+    int post(std::vector<server_task> && tasks, bool front = false) {
         std::unique_lock<std::mutex> lock(mutex_tasks);
         for (auto & task : tasks) {
             if (task.id == -1) {
@@ -1619,7 +1620,7 @@ struct server_queue {
     }
 
     // Register function to process a new task
-    void on_new_task(std::function<void(server_task)> callback) {
+    void on_new_task(std::function<void(server_task &&)> callback) {
         callback_new_task = std::move(callback);
     }
 
@@ -1668,7 +1669,7 @@ struct server_queue {
                     lock.unlock();
                     break;
                 }
-                server_task task = queue_tasks.front();
+                server_task task = std::move(queue_tasks.front());
                 queue_tasks.pop_front();
                 lock.unlock();
 
@@ -2012,7 +2013,7 @@ struct server_context {
 
             slot.reset();
 
-            slots.push_back(slot);
+            slots.push_back(std::move(slot));
         }
 
         default_generation_settings_for_props = slots[0].to_json();
@@ -2113,7 +2114,7 @@ struct server_context {
         return true;
     }
 
-    bool launch_slot_with_task(server_slot & slot, const server_task & task) {
+    bool launch_slot_with_task(server_slot & slot, server_task && task) {
         slot.reset();
         slot.id_task       = task.id;
         slot.index         = task.index;
@@ -2121,10 +2122,10 @@ struct server_context {
         slot.params        = std::move(task.params);
         slot.prompt_tokens = std::move(task.prompt_tokens);
 
-        if (!are_lora_equal(task.params.lora, slot.lora)) {
+        if (!are_lora_equal(slot.params.lora, slot.lora)) {
             // if lora is changed, we cannot reuse cached tokens
             slot.cache_tokens.clear();
-            slot.lora = task.params.lora;
+            slot.lora = slot.params.lora;
         }
 
         bool can_detokenize = can_be_detokenized(ctx, slot.prompt_tokens);
@@ -2555,10 +2556,10 @@ struct server_context {
             server_task task(SERVER_TASK_TYPE_CANCEL);
             task.id_target = id_task;
             queue_results.remove_waiting_task_id(id_task);
-            cancel_tasks.push_back(task);
+            cancel_tasks.push_back(std::move(task));
         }
         // push to beginning of the queue, so it has highest priority
-        queue_tasks.post(cancel_tasks, true);
+        queue_tasks.post(std::move(cancel_tasks), true);
     }
 
     // receive the results from task(s)
@@ -2645,7 +2646,7 @@ struct server_context {
     // Functions to process the task
     //
 
-    void process_single_task(server_task task) {
+    void process_single_task(server_task && task) {
         switch (task.type) {
             case SERVER_TASK_TYPE_COMPLETION:
             case SERVER_TASK_TYPE_INFILL:
@@ -2673,7 +2674,7 @@ struct server_context {
                         break;
                     }
 
-                    if (!launch_slot_with_task(*slot, task)) {
+                    if (!launch_slot_with_task(*slot, std::move(task))) {
                         SRV_ERR("failed to launch slot with task, id_task = %d\n", task.id);
                         break;
                     }
@@ -2889,7 +2890,7 @@ struct server_context {
 
             server_task task(SERVER_TASK_TYPE_NEXT_RESPONSE);
             task.id = queue_tasks.get_new_id();
-            queue_tasks.post(task);
+            queue_tasks.post(std::move(task));
         }
 
         // apply context-shift if needed
@@ -3418,7 +3419,6 @@ struct server_context {
         SRV_DBG("%s", "run slots completed\n");
     }
 
-    // llama-server-one START
     json model_meta() const {
         char general_architecture[64];
         char general_type[64];
@@ -3722,14 +3722,17 @@ int main(int argc, char ** argv) {
         }
 
         // request slots data using task queue
-        server_task task(SERVER_TASK_TYPE_METRICS);
-        task.id = ctx_server.queue_tasks.get_new_id();
-        ctx_server.queue_results.add_waiting_task_id(task.id);
-        ctx_server.queue_tasks.post(task, true); // high-priority task
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_METRICS);
+            task.id = task_id;
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task), true); // high-priority task
+        }
 
         // get the result
-        server_task_result_ptr result = ctx_server.queue_results.recv(task.id);
-        ctx_server.queue_results.remove_waiting_task_id(task.id);
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
 
         if (result->is_error()) {
             res_error(res, result->to_json());
@@ -3758,16 +3761,17 @@ int main(int argc, char ** argv) {
         }
 
         // request slots data using task queue
-        server_task task(SERVER_TASK_TYPE_METRICS);
-        task.id = ctx_server.queue_tasks.get_new_id();
-        task.metrics_reset_bucket = true;
-
-        ctx_server.queue_results.add_waiting_task_id(task.id);
-        ctx_server.queue_tasks.post(task, true); // high-priority task
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_METRICS);
+            task.id = task_id;
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task), true); // high-priority task
+        }
 
         // get the result
-        server_task_result_ptr result = ctx_server.queue_results.recv(task.id);
-        ctx_server.queue_results.remove_waiting_task_id(task.id);
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
 
         if (result->is_error()) {
             res_error(res, result->to_json());
@@ -3864,17 +3868,20 @@ int main(int argc, char ** argv) {
         }
         std::string filepath = params.slot_save_path + filename;
 
-        server_task task(SERVER_TASK_TYPE_SLOT_SAVE);
-        task.id = ctx_server.queue_tasks.get_new_id();
-        task.slot_action.slot_id  = id_slot;
-        task.slot_action.filename = filename;
-        task.slot_action.filepath = filepath;
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SLOT_SAVE);
+            task.id = task_id;
+            task.slot_action.slot_id  = id_slot;
+            task.slot_action.filename = filename;
+            task.slot_action.filepath = filepath;
 
-        ctx_server.queue_results.add_waiting_task_id(task.id);
-        ctx_server.queue_tasks.post(task);
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
 
-        server_task_result_ptr result = ctx_server.queue_results.recv(task.id);
-        ctx_server.queue_results.remove_waiting_task_id(task.id);
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
 
         if (result->is_error()) {
             res_error(res, result->to_json());
@@ -3893,17 +3900,20 @@ int main(int argc, char ** argv) {
         }
         std::string filepath = params.slot_save_path + filename;
 
-        server_task task(SERVER_TASK_TYPE_SLOT_RESTORE);
-        task.id = ctx_server.queue_tasks.get_new_id();
-        task.slot_action.slot_id  = id_slot;
-        task.slot_action.filename = filename;
-        task.slot_action.filepath = filepath;
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SLOT_RESTORE);
+            task.id = task_id;
+            task.slot_action.slot_id  = id_slot;
+            task.slot_action.filename = filename;
+            task.slot_action.filepath = filepath;
 
-        ctx_server.queue_results.add_waiting_task_id(task.id);
-        ctx_server.queue_tasks.post(task);
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
 
-        server_task_result_ptr result = ctx_server.queue_results.recv(task.id);
-        ctx_server.queue_results.remove_waiting_task_id(task.id);
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
 
         if (result->is_error()) {
             res_error(res, result->to_json());
@@ -3915,15 +3925,18 @@ int main(int argc, char ** argv) {
     };
 
     const auto handle_slots_erase = [&ctx_server, &res_error, &res_ok](const httplib::Request & /* req */, httplib::Response & res, int id_slot) {
-        server_task task(SERVER_TASK_TYPE_SLOT_ERASE);
-        task.id = ctx_server.queue_tasks.get_new_id();
-        task.slot_action.slot_id = id_slot;
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SLOT_ERASE);
+            task.id = task_id;
+            task.slot_action.slot_id = id_slot;
 
-        ctx_server.queue_results.add_waiting_task_id(task.id);
-        ctx_server.queue_tasks.post(task);
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
 
-        server_task_result_ptr result = ctx_server.queue_results.recv(task.id);
-        ctx_server.queue_results.remove_waiting_task_id(task.id);
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
 
         if (result->is_error()) {
             res_error(res, result->to_json());
@@ -4027,9 +4040,10 @@ int main(int argc, char ** argv) {
         }
 
         auto completion_id = gen_chatcmplid();
-        std::vector<server_task> tasks;
-
+        std::unordered_set<int> task_ids;
         try {
+            std::vector<server_task> tasks;
+
             const auto & prompt = data.at("prompt");
             // TODO: this log can become very long, put it behind a flag or think about a more compact format
             //SRV_DBG("Prompt: %s\n", prompt.is_string() ? prompt.get<std::string>().c_str() : prompt.dump(2).c_str());
@@ -4044,9 +4058,9 @@ int main(int argc, char ** argv) {
 
                 task.prompt_tokens    = std::move(tokenized_prompts[i]);
                 task.params           = server_task::params_from_json_cmpl(
-                                            ctx_server.ctx,
-                                            ctx_server.params_base,
-                                            data);
+                        ctx_server.ctx,
+                        ctx_server.params_base,
+                        data);
                 task.id_selected_slot = json_value(data, "id_slot", -1);
 
                 // OAI-compat
@@ -4054,18 +4068,18 @@ int main(int argc, char ** argv) {
                 task.params.oaicompat_cmpl_id         = completion_id;
                 // oaicompat_model is already populated by params_from_json_cmpl
 
-                tasks.push_back(task);
+                tasks.push_back(std::move(task));
             }
+
+            task_ids = server_task::get_list_id(tasks);
+            ctx_server.queue_results.add_waiting_tasks(tasks);
+            ctx_server.queue_tasks.post(std::move(tasks));
         } catch (const std::exception & e) {
             res_error(res, format_error_response(e.what(), ERROR_TYPE_INVALID_REQUEST));
             return;
         }
 
-        ctx_server.queue_results.add_waiting_tasks(tasks);
-        ctx_server.queue_tasks.post(tasks);
-
         bool stream = json_value(data, "stream", false);
-        const auto task_ids = server_task::get_list_id(tasks);
 
         if (!stream) {
             ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
@@ -4357,6 +4371,7 @@ int main(int argc, char ** argv) {
         // create and queue the task
         json responses = json::array();
         bool error = false;
+        std::unordered_set<int> task_ids;
         {
             std::vector<server_task> tasks;
             for (size_t i = 0; i < tokenized_prompts.size(); i++) {
@@ -4369,27 +4384,26 @@ int main(int argc, char ** argv) {
                 // OAI-compat
                 task.params.oaicompat = oaicompat;
 
-                tasks.push_back(task);
+                tasks.push_back(std::move(task));
             }
 
+            task_ids = server_task::get_list_id(tasks);
             ctx_server.queue_results.add_waiting_tasks(tasks);
-            ctx_server.queue_tasks.post(tasks);
-
-            // get the result
-            std::unordered_set<int> task_ids = server_task::get_list_id(tasks);
+            ctx_server.queue_tasks.post(std::move(tasks));
+        }
 
-            ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
-                for (auto & res : results) {
-                    GGML_ASSERT(dynamic_cast<server_task_result_embd*>(res.get()) != nullptr);
-                    responses.push_back(res->to_json());
-                }
-            }, [&](const json & error_data) {
-                res_error(res, error_data);
-                error = true;
-            }, req.is_connection_closed);
+        // get the result
+        ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
+            for (auto & res : results) {
+                GGML_ASSERT(dynamic_cast<server_task_result_embd*>(res.get()) != nullptr);
+                responses.push_back(res->to_json());
+            }
+        }, [&](const json & error_data) {
+            res_error(res, error_data);
+            error = true;
+        }, req.is_connection_closed);
 
-            ctx_server.queue_results.remove_waiting_task_ids(task_ids);
-        }
+        ctx_server.queue_results.remove_waiting_task_ids(task_ids);
 
         if (error) {
             return;
@@ -4456,6 +4470,7 @@ int main(int argc, char ** argv) {
         // create and queue the task
         json responses = json::array();
         bool error = false;
+        std::unordered_set<int> task_ids;
         {
             std::vector<server_task> tasks;
             std::vector<llama_tokens> tokenized_docs = tokenize_input_prompts(ctx_server.vocab, documents, /* add_special */ false, true);
@@ -4465,26 +4480,24 @@ int main(int argc, char ** argv) {
                 task.id            = ctx_server.queue_tasks.get_new_id();
                 task.index         = i;
                 task.prompt_tokens = format_rerank(ctx_server.vocab, tokenized_query, tokenized_docs[i]);
-                tasks.push_back(task);
+                tasks.push_back(std::move(task));
             }
 
+            task_ids = server_task::get_list_id(tasks);
             ctx_server.queue_results.add_waiting_tasks(tasks);
-            ctx_server.queue_tasks.post(tasks);
-
-            // get the result
-            std::unordered_set<int> task_ids = server_task::get_list_id(tasks);
-
-            ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
-                for (auto & res : results) {
-                    GGML_ASSERT(dynamic_cast<server_task_result_rerank*>(res.get()) != nullptr);
-                    responses.push_back(res->to_json());
-                }
-            }, [&](const json & error_data) {
-                res_error(res, error_data);
-                error = true;
-            }, req.is_connection_closed);
+            ctx_server.queue_tasks.post(std::move(tasks));
         }
 
+        ctx_server.receive_multi_results(task_ids, [&](std::vector<server_task_result_ptr> & results) {
+            for (auto & res : results) {
+                GGML_ASSERT(dynamic_cast<server_task_result_rerank*>(res.get()) != nullptr);
+                responses.push_back(res->to_json());
+            }
+        }, [&](const json & error_data) {
+            res_error(res, error_data);
+            error = true;
+        }, req.is_connection_closed);
+
         if (error) {
             return;
         }
@@ -4520,14 +4533,19 @@ int main(int argc, char ** argv) {
             res_error(res, format_error_response("Request body must be an array", ERROR_TYPE_INVALID_REQUEST));
             return;
         }
-        server_task task(SERVER_TASK_TYPE_SET_LORA);
-        task.id = ctx_server.queue_tasks.get_new_id();
-        task.set_lora = parse_lora_request(ctx_server.params_base.lora_adapters, body);
-        ctx_server.queue_results.add_waiting_task_id(task.id);
-        ctx_server.queue_tasks.post(task);
 
-        server_task_result_ptr result = ctx_server.queue_results.recv(task.id);
-        ctx_server.queue_results.remove_waiting_task_id(task.id);
+        int task_id = ctx_server.queue_tasks.get_new_id();
+        {
+            server_task task(SERVER_TASK_TYPE_SET_LORA);
+            task.id = task_id;
+            task.set_lora = parse_lora_request(ctx_server.params_base.lora_adapters, body);
+            ctx_server.queue_results.add_waiting_task_id(task_id);
+            ctx_server.queue_tasks.post(std::move(task));
+        }
+
+        // get the result
+        server_task_result_ptr result = ctx_server.queue_results.recv(task_id);
+        ctx_server.queue_results.remove_waiting_task_id(task_id);
 
         if (result->is_error()) {
             res_error(res, result->to_json());
@@ -4570,7 +4588,7 @@ int main(int argc, char ** argv) {
         }
     }
 
-        // llama-server-one START
+    // llama-server-one START
     svr->Get("/chat", [](const httplib::Request & req, httplib::Response & res) {
         if (req.get_header_value("Accept-Encoding").find("gzip") == std::string::npos) {
             res.set_content("Error: gzip is not supported by this browser", "text/plain");
@@ -4691,8 +4709,8 @@ int main(int argc, char ** argv) {
         common_chat_templates_source(ctx_server.chat_templates.get()),
         common_chat_format_example(ctx_server.chat_templates.get(), ctx_server.params_base.use_jinja).c_str());
 
-    ctx_server.queue_tasks.on_new_task([&ctx_server](const server_task & task) {
-        ctx_server.process_single_task(task);
+    ctx_server.queue_tasks.on_new_task([&ctx_server](server_task && task) {
+        ctx_server.process_single_task(std::move(task));
     });
 
     ctx_server.queue_tasks.on_update_slots([&ctx_server]() {

From b963568c5c0e5c760c98d27ed5de2a0f831b8aad Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 13:14:17 -0700
Subject: [PATCH 47/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index d7882023198dc..ce5df78af1335 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -184,5 +184,6 @@ Congratulations! You are ready to copy `llams-server-one` executable to the shar
 
 ```
 sudo cp llama-server-one /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-LLMs
+sudo cp llama-server-one /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-LLMs/llama-server-one.exe
 printf "\n**********\n*\n* FINISHED: Copy llama-server-one for Deployment.\n*\n**********\n\n"
 ```

From 77bb344e1b607476b976715f9a31c8ff2e66039f Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 20:26:23 -0700
Subject: [PATCH 48/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index ce5df78af1335..a501b02eb5824 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -84,6 +84,15 @@ Verify that the archive has your website:
 unzip -l $LLAMA_SERVER_ONE_ZIP 
 printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
 ```
+
+#### Add Certs to Archive
+mkdir certs
+cp /mnt/hyperv/Mmojo-certs/dot-local.crt certs
+cp /mnt/hyperv/Mmojo-certs/dot-local.key certs
+cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt certs
+zip -0 -r $LLAMA_SERVER_ONE_ZIP certs/*
+printf "\n**********\n*\n* FINISHED: Add Certs to Archive.\n*\n**********\n\n"
+
 ---
 ### Create default-args File
 
@@ -106,6 +115,10 @@ model.gguf
 8
 --path
 /zip/website
+--ssl-key-file
+/zip/certs/dot-local.key
+--ssl-cert-file
+/zip/certs/dot-local.crt
 ...
 EOF
 printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"

From 0210119ec83df0246ace8aa74e65740786e77549 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 20:38:45 -0700
Subject: [PATCH 49/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index a501b02eb5824..f465ae25ac83c 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -85,13 +85,17 @@ unzip -l $LLAMA_SERVER_ONE_ZIP
 printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
 ```
 
-#### Add Certs to Archive
+---
+### Add Certs to Archive
+Add self-signed certs to the archive. CA crt might need to go in the website folder?
+```
 mkdir certs
 cp /mnt/hyperv/Mmojo-certs/dot-local.crt certs
 cp /mnt/hyperv/Mmojo-certs/dot-local.key certs
 cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt certs
 zip -0 -r $LLAMA_SERVER_ONE_ZIP certs/*
 printf "\n**********\n*\n* FINISHED: Add Certs to Archive.\n*\n**********\n\n"
+```
 
 ---
 ### Create default-args File

From e9043db1611152039afcd560b3bccd3d121ca1b8 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 20:41:58 -0700
Subject: [PATCH 50/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index aa0a8c59230d5..bd26bfbaaac4d 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -10,6 +10,7 @@ This file contains instructions for building `llama.cpp` with `cosmocc` to yield
 Let's define some environment variables:
 ```
 BUILDING_DIR="1-BUILDING-llama.cpp"
+LLAMA_SERVER_SSL=1
 printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
 ```
 

From 36c3fb197e78ef030202ac9aa5b2598ff82049c4 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 21:08:27 -0700
Subject: [PATCH 51/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index bd26bfbaaac4d..8d509309c1958 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -10,7 +10,6 @@ This file contains instructions for building `llama.cpp` with `cosmocc` to yield
 Let's define some environment variables:
 ```
 BUILDING_DIR="1-BUILDING-llama.cpp"
-LLAMA_SERVER_SSL=1
 printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
 ```
 
@@ -61,6 +60,7 @@ We use the old `Makefile` rather than CMake. We've updated the `Makefile` in thi
 ```
 cd ~/$BUILDING_DIR
 export LLAMA_MAKEFILE=1
+export LLAMA_SERVER_SSL=1
 make clean
 make
 printf "\n**********\n*\n* FINISHED: Make llama.cpp.\n*\n**********\n\n"

From a1f1d3b827d7ed7f1a4ef10b3eb3ff7e8f29e20a Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 22:19:08 -0700
Subject: [PATCH 52/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 8d509309c1958..f0e642a4f92ea 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -60,7 +60,8 @@ We use the old `Makefile` rather than CMake. We've updated the `Makefile` in thi
 ```
 cd ~/$BUILDING_DIR
 export LLAMA_MAKEFILE=1
-export LLAMA_SERVER_SSL=1
+# LLAMA_SERVER_SSL doesn't work with cosmocc yet.
+# export LLAMA_SERVER_SSL=ON
 make clean
 make
 printf "\n**********\n*\n* FINISHED: Make llama.cpp.\n*\n**********\n\n"

From 09f5c03c24f08c9c389d40136d6641803fd93262 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 22:21:08 -0700
Subject: [PATCH 53/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 37 +++++++++++++++++++------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index f465ae25ac83c..350ede4ed2521 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -63,6 +63,27 @@ unzip -l $LLAMA_SERVER_ONE_ZIP
 printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n**********\n\n"
 ```
 
+---
+### Add Certs to Archive
+
+**SSL DOESN'T WORK WITH COSMOCC YET.** Add self-signed certs to the archive. CA crt might need to go in the website folder?
+```
+mkdir certs
+cp /mnt/hyperv/Mmojo-certs/dot-local.crt certs
+cp /mnt/hyperv/Mmojo-certs/dot-local.key certs
+cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt certs
+zip -0 -r $LLAMA_SERVER_ONE_ZIP certs/*
+printf "\n**********\n*\n* FINISHED: Add Certs to Archive.\n*\n**********\n\n"
+```
+
+#### Verify certs Directory in Archive
+
+Verify that the archive has your certs:
+```
+unzip -l $LLAMA_SERVER_ONE_ZIP 
+printf "\n**********\n*\n* FINISHED: Verify certs Directory in Archive.\n*\n**********\n\n"
+```
+
 ---
 ### Create website Directory in Archive
 
@@ -85,18 +106,6 @@ unzip -l $LLAMA_SERVER_ONE_ZIP
 printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**********\n\n"
 ```
 
----
-### Add Certs to Archive
-Add self-signed certs to the archive. CA crt might need to go in the website folder?
-```
-mkdir certs
-cp /mnt/hyperv/Mmojo-certs/dot-local.crt certs
-cp /mnt/hyperv/Mmojo-certs/dot-local.key certs
-cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt certs
-zip -0 -r $LLAMA_SERVER_ONE_ZIP certs/*
-printf "\n**********\n*\n* FINISHED: Add Certs to Archive.\n*\n**********\n\n"
-```
-
 ---
 ### Create default-args File
 
@@ -119,9 +128,9 @@ model.gguf
 8
 --path
 /zip/website
---ssl-key-file
+--ssl-key-file-xx
 /zip/certs/dot-local.key
---ssl-cert-file
+--ssl-cert-file-xx
 /zip/certs/dot-local.crt
 ...
 EOF

From 4fbe49d39943f70cba1fe31fc16cddf1db55b23d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 22:33:36 -0700
Subject: [PATCH 54/73] Update README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 7e33723360242..fb9bff71da939 100644
--- a/README.md
+++ b/README.md
@@ -74,4 +74,5 @@ In no particular order of importance, these are the things that bother me:
   - The args thing is cute, but it might be easier as a yaml file. Key value pairs. Flags can be keys with null values.
 - The `--ctx-size` parameter doesn't seem quite right given that new models have the training (or max) context size in their metadata. That size should be used subject to a maximum in a passed parameter. E.g. So a 128K model can run comfortably on a smaller device.
 - Write docs for a Deploying step. It should address the args file, removing the extra executable depending on platform, models, host, port. context size.
-- Make a `.gitattributes` file so we can set the default file to be displayed and keep the README.md from llama.cpp. This will help in syncing changes continually from upstream. Reference: https://git-scm.com/docs/gitattributes
+- ~~Make a `.gitattributes` file so we can set the default file to be displayed and keep the README.md from llama.cpp. This will help in syncing changes continually from upstream. Reference: https://git-scm.com/docs/gitattributes~~ -- This doesn't actually work.
+- Cosmo needs libssl and libcrypto. Building these from scratch gets an error about Cosco not liking assembly files. Sort this out.

From c44f98f10264c50251b3f7a65e24c05760c121a9 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sat, 19 Apr 2025 22:41:13 -0700
Subject: [PATCH 55/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 350ede4ed2521..bc96b315b073e 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -128,9 +128,28 @@ model.gguf
 8
 --path
 /zip/website
---ssl-key-file-xx
+...
+EOF
+printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
+```
+
+```
+cat << EOF > $DEFAULT_ARGS
+-m
+model.gguf
+--host
+127.0.0.1
+--port
+8080
+--ctx-size
+8192
+--threads-http
+8
+--path
+/zip/website
+--ssl-key-file
 /zip/certs/dot-local.key
---ssl-cert-file-xx
+--ssl-cert-file
 /zip/certs/dot-local.crt
 ...
 EOF

From 998d354717da881d20dccb743d680ba06f603e15 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 07:52:18 -0700
Subject: [PATCH 56/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index f0e642a4f92ea..257ed4b02930c 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -103,6 +103,7 @@ export CC="cosmocc -I$(pwd)/cosmocc/include -L$(pwd)/cosmocc/lib"
 export CXX="cosmocc -I$(pwd)/cosmocc/include \
     -I$(pwd)/cosmocc/include/third_party/libcxx \
     -L$(pwd)/cosmocc/lib"
+export AR="cosmoar"
 export UNAME_S="cosmocc"
 export UNAME_P="cosmocc"
 export UNAME_M="cosmocc"

From c79ab107f08f54a4da81a660bf2be8addf207565 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 10:01:58 -0700
Subject: [PATCH 57/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index bc96b315b073e..f39af42664020 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -91,6 +91,7 @@ printf "\n**********\n*\n* FINISHED: Verify certs Directory in Archive.\n*\n****
 ```
 mkdir website
 cp -r /mnt/hyperv/web-apps/completion-tool/* website
+cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt website/CA.crt
 rm website/*.txt
 rm website/images/*.svg
 rm website/images/*.psd

From 9e0a1c3281a42209bf84552485a05f367bea789d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 12:41:46 -0700
Subject: [PATCH 58/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 36 ++++++-------------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index f39af42664020..5d93271a15a58 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -54,8 +54,7 @@ zip -d $LLAMA_SERVER_ONE_ZIP "/usr/*"
 printf "\n**********\n*\n* FINISHED: Delete Extraneous Timezone Files.\n*\n**********\n\n"
 ```
 
----
-### Verify Contents of Zip Archive
+#### Verify Contents of Zip Archive
 
 Verify that these files are no longer in the archive:
 ```
@@ -69,9 +68,9 @@ printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n*******
 **SSL DOESN'T WORK WITH COSMOCC YET.** Add self-signed certs to the archive. CA crt might need to go in the website folder?
 ```
 mkdir certs
-cp /mnt/hyperv/Mmojo-certs/dot-local.crt certs
-cp /mnt/hyperv/Mmojo-certs/dot-local.key certs
-cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt certs
+cp /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-certs/mmojo.local.crt certs
+cp /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-certs/mmojo.local.key certs
+cp /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-certs/selfsignCA.crt certs
 zip -0 -r $LLAMA_SERVER_ONE_ZIP certs/*
 printf "\n**********\n*\n* FINISHED: Add Certs to Archive.\n*\n**********\n\n"
 ```
@@ -91,7 +90,7 @@ printf "\n**********\n*\n* FINISHED: Verify certs Directory in Archive.\n*\n****
 ```
 mkdir website
 cp -r /mnt/hyperv/web-apps/completion-tool/* website
-cp /mnt/hyperv/Mmojo-certs/selfsignCA.crt website/CA.crt
+cp /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-certs/selfsignCA.crt website/CA.crt
 rm website/*.txt
 rm website/images/*.svg
 rm website/images/*.psd
@@ -115,25 +114,6 @@ A `default-args` file in the archive can specify sane default parameters. The fo
 We don't yet support including the model inside the zip archive (yet). That has a 4GB size limitation on Windows anyway, as `.exe` files cannot exceed 4GB. So let's use an adjacent file called `model.gguf`.
 
 We will serve on localhost, port 8080 by default for safety. The `--ctx-size` parameter is the size of the context window. This is kinda screwy to have as a set size rather than a maximum because the `.gguf` files now have the training context size in metadata. We set it to 8192 to be sensible. The `--threads-http` parameter ensures that the browser can ask for all the image files in our default UI at once.
-```
-cat << EOF > $DEFAULT_ARGS
--m
-model.gguf
---host
-127.0.0.1
---port
-8080
---ctx-size
-8192
---threads-http
-8
---path
-/zip/website
-...
-EOF
-printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
-```
-
 ```
 cat << EOF > $DEFAULT_ARGS
 -m
@@ -166,8 +146,7 @@ zip -0 -r $LLAMA_SERVER_ONE_ZIP $DEFAULT_ARGS
 printf "\n**********\n*\n* FINISHED: Add default-args File to Archive.\n*\n**********\n\n"
 ```
 
----
-### Verify default-args File in Archive
+#### Verify default-args File in Archive
 
 Verify that the archive contains the `default-args` file:
 ```
@@ -209,8 +188,7 @@ After starting up and loading the model, it should display:
 
 Hit `ctrl-C` on your keyboard to stop it.
 
----
-### Test Run on Public Interfaces
+#### Test Run on Public Interfaces
 
 If you'd like it to listen on all available interfaces, so you can connect from a browser on another computer:
 ```

From 0723a9e418083eacca9f6f6ea6ecaf4d2635e5bb Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 12:43:12 -0700
Subject: [PATCH 59/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 5d93271a15a58..9043197c9ef49 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -129,9 +129,9 @@ model.gguf
 --path
 /zip/website
 --ssl-key-file
-/zip/certs/dot-local.key
+/zip/certs/mmojo.local.key
 --ssl-cert-file
-/zip/certs/dot-local.crt
+/zip/certs/mmojo.local.crt
 ...
 EOF
 printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"

From 8d50e9c6eb7b468b1d6deb9e6f344bb8371c1015 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 12:43:46 -0700
Subject: [PATCH 60/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 9043197c9ef49..418f933039081 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -137,8 +137,7 @@ EOF
 printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
 ```
 
----
-### Add default-args File to Archive
+#### Add default-args File to Archive
 
 Add the `default-args` file to the archive:
 ```

From 4367c5b65c27591ec3492b29617fb0276cb6d4bc Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 13:52:49 -0700
Subject: [PATCH 61/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 257ed4b02930c..3cec3fa0ed1fb 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -7,9 +7,15 @@ This file contains instructions for building `llama.cpp` with `cosmocc` to yield
 
 ### Environment Variables
 
-Let's define some environment variables:
+Let's define some environment variables, resetting those that affect the Makefile:
 ```
 BUILDING_DIR="1-BUILDING-llama.cpp"
+export CC=""
+export CXX=""
+export AR=""
+export UNAME_S=""
+export UNAME_P=""
+export UNAME_M=""
 printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
 ```
 

From 604e07e2cb03e45a640e248097053cbc4665e3fb Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 13:54:35 -0700
Subject: [PATCH 62/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 3cec3fa0ed1fb..d774c0fe117ec 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -66,8 +66,7 @@ We use the old `Makefile` rather than CMake. We've updated the `Makefile` in thi
 ```
 cd ~/$BUILDING_DIR
 export LLAMA_MAKEFILE=1
-# LLAMA_SERVER_SSL doesn't work with cosmocc yet.
-# export LLAMA_SERVER_SSL=ON
+export LLAMA_SERVER_SSL=ON
 make clean
 make
 printf "\n**********\n*\n* FINISHED: Make llama.cpp.\n*\n**********\n\n"

From 8c3ffaea885dfab498dbcd84d8cdfd176d96a1d2 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 13:55:40 -0700
Subject: [PATCH 63/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index d774c0fe117ec..a9937910a9842 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -107,7 +107,7 @@ export PATH="$(pwd)/cosmocc/bin:$PATH"
 export CC="cosmocc -I$(pwd)/cosmocc/include -L$(pwd)/cosmocc/lib"
 export CXX="cosmocc -I$(pwd)/cosmocc/include \
     -I$(pwd)/cosmocc/include/third_party/libcxx \
-    -L$(pwd)/cosmocc/lib"
+    -L$(pwd)/cosmocc/lib -L$(pwd)/openssl"
 export AR="cosmoar"
 export UNAME_S="cosmocc"
 export UNAME_P="cosmocc"

From 444c4ad980178e3abe22c95556fa3eb5b87b039a Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 13:58:26 -0700
Subject: [PATCH 64/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index a9937910a9842..12f80d46a0b0e 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -10,12 +10,12 @@ This file contains instructions for building `llama.cpp` with `cosmocc` to yield
 Let's define some environment variables, resetting those that affect the Makefile:
 ```
 BUILDING_DIR="1-BUILDING-llama.cpp"
-export CC=""
-export CXX=""
-export AR=""
-export UNAME_S=""
-export UNAME_P=""
-export UNAME_M=""
+unset CC
+unset CXX
+unset AR
+unset UNAME_S
+unset UNAME_P
+unset UNAME_M
 printf "\n**********\n*\n* FINISHED: Environment Variables.\n*\n**********\n\n"
 ```
 

From ddeaedec95a0b42216007385ae4d195bab46b23c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 14:11:06 -0700
Subject: [PATCH 65/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 12f80d46a0b0e..d4b5df11a7cee 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -115,6 +115,19 @@ export UNAME_M="cosmocc"
 printf "\n**********\n*\n* FINISHED: Prepare to make llama.cpp with Cosmo.\n*\n**********\n\n"
 ```
 
+---
+### Make openssl with Cosmo
+We need cross-architectire `libssl` and `libcrypto` static libraries to support SSL in `llama-server-one`.
+```
+cp -r /usr/include/openssl/ ./cosmocc/include/
+cp -r /usr/include/x86_64-linux-gnu/openssl/* ./cosmocc/include/openssl
+git clone https://github.com/openssl/openssl.git
+cd openssl
+./Configure no-asm no-dso no-afalgeng no-shared no-pinshared no-apps
+make
+cd ..
+```
+
 ---
 ### Make llama.cpp with Cosmo
 ```

From 93c9dc4790ee2942a56d1edb0ddb5b7885406b6a Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 14:53:19 -0700
Subject: [PATCH 66/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index d4b5df11a7cee..7207c860e37b8 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -53,6 +53,7 @@ printf "\n**********\n*\n* FINISHED: Checkout work-in-progress.\n*\n**********\n
 APP_NAME='Mmojo Chat'
 sed -i -e "s/<title>.*<\/title>/<title>$APP_NAME<\/title>/g" examples/server/webui/index.html
 sed -i -e "s/>llama.cpp<\/div>/>$APP_NAME<\/div>/g" examples/server/webui/src/components/Header.tsx
+sed -i -e "s/<\/head>/  <link rel=\"manifest\" href=\"chat-manifest.json\" \/>\n    <link rel=\"icon\" href=\"images\/chat-logo-128.png\" \/>\n  <\/head>/g" examples/server/webui/index.html
 cd examples/server/webui
 npm i
 npm run build

From 783d0ec8dc07a62be343ce9d5feb953eb87ac11c Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 16:19:20 -0700
Subject: [PATCH 67/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index 7207c860e37b8..d4b5df11a7cee 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -53,7 +53,6 @@ printf "\n**********\n*\n* FINISHED: Checkout work-in-progress.\n*\n**********\n
 APP_NAME='Mmojo Chat'
 sed -i -e "s/<title>.*<\/title>/<title>$APP_NAME<\/title>/g" examples/server/webui/index.html
 sed -i -e "s/>llama.cpp<\/div>/>$APP_NAME<\/div>/g" examples/server/webui/src/components/Header.tsx
-sed -i -e "s/<\/head>/  <link rel=\"manifest\" href=\"chat-manifest.json\" \/>\n    <link rel=\"icon\" href=\"images\/chat-logo-128.png\" \/>\n  <\/head>/g" examples/server/webui/index.html
 cd examples/server/webui
 npm i
 npm run build

From b1f1b14bfc3c32ae091be9faf52daa0300834d23 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 16:42:39 -0700
Subject: [PATCH 68/73] Update Buidling-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Buidling-ls1-Brads-Env.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/Buidling-ls1-Brads-Env.md b/docs/Buidling-ls1-Brads-Env.md
index d4b5df11a7cee..c6f6f2a57b89d 100644
--- a/docs/Buidling-ls1-Brads-Env.md
+++ b/docs/Buidling-ls1-Brads-Env.md
@@ -126,6 +126,8 @@ cd openssl
 ./Configure no-asm no-dso no-afalgeng no-shared no-pinshared no-apps
 make
 cd ..
+printf "\n**********\n*\n* FINISHED: Make openssl with Cosmo.\n*\n**********\n\n"
+
 ```
 
 ---

From 51e9d7c3479de39d31b15b6ddda3320fa223f1f8 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 19:48:05 -0700
Subject: [PATCH 69/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 418f933039081..d8b220e77c995 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -92,8 +92,8 @@ mkdir website
 cp -r /mnt/hyperv/web-apps/completion-tool/* website
 cp /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-certs/selfsignCA.crt website/CA.crt
 rm website/*.txt
-rm website/images/*.svg
-rm website/images/*.psd
+rm website/completion/images/*.svg
+rm website/completion/images/*.psd
 zip -0 -r $LLAMA_SERVER_ONE_ZIP website/*
 printf "\n**********\n*\n* FINISHED: Create website Directory in Archive.\n*\n**********\n\n"
 ```

From f97b8ccbffcf681c451876ce0e7ece785612ea9d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 19:58:36 -0700
Subject: [PATCH 70/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index d8b220e77c995..27ae47b14d42b 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -107,7 +107,7 @@ printf "\n**********\n*\n* FINISHED: Verify website Directory in Archive.\n*\n**
 ```
 
 ---
-### Create default-args File
+### Create default-args File in Archive
 
 A `default-args` file in the archive can specify sane default parameters. The format of the file is parameter name on a line, parameter value on a line, rinse, repeat. End the file with a `...` line to include user specified parameters.
 
@@ -134,15 +134,8 @@ model.gguf
 /zip/certs/mmojo.local.crt
 ...
 EOF
-printf "\n**********\n*\n* FINISHED: Create Default args File.\n*\n**********\n\n"
-```
-
-#### Add default-args File to Archive
-
-Add the `default-args` file to the archive:
-```
 zip -0 -r $LLAMA_SERVER_ONE_ZIP $DEFAULT_ARGS
-printf "\n**********\n*\n* FINISHED: Add default-args File to Archive.\n*\n**********\n\n"
+printf "\n**********\n*\n* FINISHED: Create Default args File in Archive.\n*\n**********\n\n"
 ```
 
 #### Verify default-args File in Archive

From 321574b6108abf7260eae98fada99513d4ac4b53 Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Sun, 20 Apr 2025 22:34:43 -0700
Subject: [PATCH 71/73] Update Configuring-ls1-Brads-Env.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 docs/Configuring-ls1-Brads-Env.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Configuring-ls1-Brads-Env.md b/docs/Configuring-ls1-Brads-Env.md
index 27ae47b14d42b..854fd7c04c184 100644
--- a/docs/Configuring-ls1-Brads-Env.md
+++ b/docs/Configuring-ls1-Brads-Env.md
@@ -65,7 +65,7 @@ printf "\n**********\n*\n* FINISHED: Verify Contents of Zip Archive.\n*\n*******
 ---
 ### Add Certs to Archive
 
-**SSL DOESN'T WORK WITH COSMOCC YET.** Add self-signed certs to the archive. CA crt might need to go in the website folder?
+Add self-signed certs to the archive. CA cert is added to the website folder.
 ```
 mkdir certs
 cp /mnt/hyperv/Mmojo-Raspberry-Pi/Mmojo-certs/mmojo.local.crt certs

From c253e9f17b7baea539b7a1dc701c386649d94f7d Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 21 Apr 2025 07:51:48 -0700
Subject: [PATCH 72/73] Update and rename README.md to README-LS1.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README.md => README-LS1.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README.md => README-LS1.md (100%)

diff --git a/README.md b/README-LS1.md
similarity index 100%
rename from README.md
rename to README-LS1.md

From dfe634dbe4ddfbf96682257230d5829f00a68c0a Mon Sep 17 00:00:00 2001
From: Brad Hutchings <brad@componentx.com>
Date: Mon, 21 Apr 2025 07:52:19 -0700
Subject: [PATCH 73/73] Rename README-llama-cpp.md to README.md

Signed-off-by: Brad Hutchings <brad@componentx.com>
---
 README-llama-cpp.md => README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename README-llama-cpp.md => README.md (100%)

diff --git a/README-llama-cpp.md b/README.md
similarity index 100%
rename from README-llama-cpp.md
rename to README.md