This is a small proof-of-concept demo showing how agent workflows with tool-calling SLMs can be implemented on CPU-only deployments.
A growing number of small (and not so large) language models exist, some specifically geared towards tool calling:
- Gemma3 270m, 1b and 4b
- Phi3 Mini
- Qwen3 0.6B and 4B
- Hermes 3 (Llama 3.2) 3B
In this proof-of-concept we deploy a model using llama-cpp
and run it using the llama-server utility. You can deploy this demo comfortably on a CPU-only VM
such as an Oracle Cloud Infrastructure (OCI) VM.Standard.E6.Flex shape with 8 oCPUs and 32GB memory.
The demo shows the use-case of an AI assistant used for supporting operations at a (fictitious) small enterprise logistics company. The chatbot interface has access to a number of tools/workflows for checking the delivery truck fleet, pending deliveries, scheduling new deliveries, assigning trucks to deliveries as well as routing and weather APIs. The (human) operator interacts with the different tools via the chatbot using it as a querying interface or to start workflows. Multi-task workflows, such as assigning a car to a delivery, fetching routing time estimationd and weather forecast and updating the different databases are all done transparently by the chatbot through the application server.
The application server is implemented using streamlit. The tool backend uses the LangChain tools API to expose Python function as tools and present them in the appropriate format for an OpenAI compatible server. It also implements the business logic, tool workflows and interaction with the language model, which it interacts with using the OpenAI REST API.
- Get some system dependencies, on ubuntu this should suffice:
sudo apt install git tmux htop \
build-essential gcc-12 g++-12 cmake libcurl4-openssl-dev
mkdir $HOME/.local- Get and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release
cmake --install build --config Release --prefix $HOME/.local
echo "export PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc- run the llama-cpp server (inside a tmux or screen session) like this:
llama-server -hf NousResearch/Hermes-3-Llama-3.2-3B-GGUF -c 0 -fa on --jinja --chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja- Use
uvfor easy deployment:
curl -LsSf https://astral.sh/uv/install.sh | sh- Get the code:
git clone https://github.com/ohm314/slm_tool_calling_demo.git
cd slm_tool_calling_demo- Run the streamlit app:
uv run streamlit run src/logistics_poc/app.py
The app assumes llama-server is running on 127.0.0.1:8080
Copyright (c) 2025 Omar Awile Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/
