A robust, CLI-based LLM (Large Language Model) chat application built with Spring Boot 3 and Java 17, utilizing LlamaCpp-Java bindings for high-performance inference.
This project demonstrates how to integrate local LLM inference within a Spring Boot application, supporting GGUF model formats.
- Interactive CLI Chat: Real-time chat interface via the command line.
- Local Inference: Runs GGUF models locally (no API keys required).
- Customizable Prompts: Support for external prompt templates.
- Configurable Generation: Fine-tune temperature, top-p, context size, and CPU threads.
- Performance Statistics: Detailed metrics for every response (tokens/sec, time to first token, total tokens).
- Modular Architecture: Decoupled I/O and business logic for better testability.
- Comprehensive Tests: Includes unit tests for services and components.
- Docker Support: Ready-to-use Dockerfile for containerized deployment.
- Java: JDK 21 or higher.
- Maven: 3.8+ (Wrapper included).
- RAM: Sufficient RAM to load your chosen GGUF model (e.g., ~1GB for TinyLlama 1.1B Q4).
Clone the repository and build the application using Maven:
git clone <repository-url>
cd llm-chatbot-springboot
./mvnw clean packageThe executable JAR will be located in the target directory (e.g., target/LLMCpp-Chat-SpringBoot.jar).
Download a GGUF model file (e.g., from Hugging Face).
- Recommended for testing: TinyLlama-1.1B-Chat-v1.0-GGUF
Run the JAR, pointing it to your model file:
java -jar target/LLMCpp-Chat-SpringBoot.jar --llamacpp.model=/path/to/your/model.ggufOr using the default configuration (looks for tinyllama-1.1b-chat-v1.0.Q6_K.gguf in the working directory):
java -jar target/LLMCpp-Chat-SpringBoot.jarYou can run the unit tests using the Maven wrapper:
./mvnw testYou can configure the application via application.properties, system properties, or command-line arguments.
| Property | Description | Default Value |
|---|---|---|
llamacpp.model |
Absolute or relative path to the GGUF model file. | tinyllama-1.1b-chat-v1.0.Q6_K.gguf |
llamacpp.prompt.path |
Path to a text file containing the system prompt template. | llamacpp_prompt.txt |
llamacpp.temperature |
Controls randomness (0.0 to 1.0). Higher is more creative. | 0.2 |
llamacpp.topp |
Nucleus sampling probability threshold. | 10 |
llamacpp.thread.cpu |
Number of CPU threads to use for inference. | 1 |
llamacpp.number.context |
Context window size (0 uses model default). | 0 |
llamacpp.frequency-penalty |
Penalty for token repetition. | 0.2 |
llamacpp.miro-stat |
MiroStat sampling version (V0, V1, V2). |
V2 |
llamacpp.stop-strings |
List of strings that stop generation. | `, < |
By default, the application uses a built-in prompt template suitable for chat-tuned models. To customize it, create a file (e.g., my_prompt.txt) and pass it:
java -jar target/LLMCpp-Chat-SpringBoot.jar --llamacpp.prompt.path=my_prompt.txtTemplate Variables:
{question}: Will be replaced by the user's input.
Example Prompt File:
<|system|>
You are a helpful coding assistant.
<|user|>
{question}
<|assistant|>
Build the Docker image:
docker build -t chat-cli .Run the container, mounting the model file:
docker run -it -v /local/path/to/model.gguf:/app/model.gguf chat-cli --llamacpp.model=/app/model.ggufThe application follows a clean Spring Boot architecture with decoupled concerns:
ChatRunner: ImplementsCommandLineRunnerto start the chat service without blocking the application context initialization.ChatServicesImpl: Manages the high-level chat loop, using anIOServicefor interaction.ChatbotServicesImpl: Handles the business logic for generating responses using the LLM.IOService/ConsoleIOService: Abstracts I/O operations (CLI), enabling easy unit testing and potential future UI swaps.LlamaCppProperties: Centralized, type-safe configuration bean for allllamacpp.*properties.LlamaModelComponent: Manages the lifecycle of the nativeLlamaModelinstance.PromptComponent: Loads and formats the prompt template.
See docs/ARCHITECTURE.md for more details.
Please raise issues in the repository for bugs or feature requests.
