Add post on Jlama integration

mariofusco · mariofusco · commit 79d49dfdddc5 · 2024-11-29T11:06:58.000+01:00
diff --git a/_posts/2024-11-29-quarkus-jlama.adoc b/_posts/2024-11-29-quarkus-jlama.adoc
@@ -0,0 +1,68 @@
+---
+layout: post
+title: "Creating pure Java LLM infused application with Quarkus, Langchain4j and Jlama"
+date: 2024-11-29
+tags: AI LLM local-inference Jlama
+synopsis: "Creating pure Java LLM infused application with Quarkus, Langchain4j and Jlama"
+author: mariofusco
+---
+
+Currently the vast majority of LLM-based applications relies on external services provided by specialized companies. These services typically offer the access to huge, general purpose models, implying energy consumption and then costs that are proportional to the size of these models.
+
+Even worse, this usage pattern also comes with both privacy and security concerns, since it is virtually impossible to be sure how those service providers will eventually re-use the prompts of their customers, which in some cases could also contain sensitive information.
+
+For these reasons many companies are deciding to train or fine-tune smaller models that do not claim to be usable in any context, but that will be tailored for the business specific needs and to run these models on premise or on private clouds.
+
+The features provided by these specialized models need to be integrated into the existing software infrastructure, that in the enterprise world are very often written in Java. This could be accomplished following a traditional client-server architecture, for instance serving the model through an external server like https://ollama.com/[Ollama] and querying it through REST calls. While this should not present any particular problem for Java developers, they could work more efficiently, if they could consume the model directly in Java and without any need to install additional tools. Finally the possibility of embedding the LLM interaction directly in the same Java process running the application will make it easier to move from local dev to deployment, relieving IT from the burden of managing an external server, thus bypassing the need for a more mature platform engineering strategy. This is where Jlama comes into play.
+
+== How and why executing LLM inference in pure Java with Jlama
+
+https://github.com/tjake/Jlama[Jlama] is a library allowing to execute LLM inference in pure Java. It supports many LLM model families like Llama, Mistral, Qwen2 and Granite. It also implements out-of-the-box many useful LLM related features like tools calling, embeddings, mixture of experts and even distributed inference.
+
+Jlama is well integrated with Quarkus through the https://quarkus.io/extensions/io.quarkiverse.langchain4j/quarkus-langchain4j-jlama/[dedicated lanchain4j based extension]. Note that for performance reasons Jlama uses the https://openjdk.org/jeps/469[Vector API] which is still in preview in Java 23, and very likely will be released as a supported feature in Java 25.
+
+In essence Jlama makes it possible to serve a LLM in Java, eventually directly embedded in the same JVM running your Java application, but why could this be useful? Actually this is desirable in many use cases and presents a number of relevant advantages like the following:
+
+. *Fast development/prototyping*: Not having to install, configure and interact with an external server can make the development of a LLM-based Java application much easier.
+. *Easy models testing*: Running the LLM inference embedded in the JVM also makes it easier to test different models and their integration during the development phase.
+. *Security/Portability/Performances*: Performing the model inference in the same JVM instance that is run the application that is using it, eliminates the need of interacting with the LLM only through REST calls, that not only could be impossible in specific secure contexts, but also come with a performance cost caused by the avoidable remote call.
+. *Legacy support*: The former point will be especially beneficial for legacy users, still running monolithic applications, who in this way will be also able to include LLM-based capabilities in those applications without changing their architecture or platform.
+. *Monitoring and Observability*: Running the LLM inference in pure Java will also allow simplify monitoring and observability, gathering statistics on the reliability and speed of the LLM response.
+. *Developer Experience*: Debuggability will be simplified in the same way, allowing the Java developer to also navigate and debug the Jlama code if necessary.
+. Distribution: Having the possibility to run LLM inference embedded in the same Java process will also make it possible to include the model itself into the same fat jar of the application using it (even though this could probably be advisable only in very specific circumstances).
+. *Edge friendliness*: The possibility of implementing and deploying a self-contained LLM-capable Java application will also make it a better fit than a client/server architecture for edge environments.
+. *Embedding of auxiliary LLMs*: Many applications, especially the ones relying on agentic AI patterns, uses many different LLMs at once. For instance a smaller LLM could be used to validate and approve the responses of the main bigger one. In this case an hybrid approach could be convenient, embedding the smaller auxiliary LLMs while keeping serving the main one through a dedicated server.
+. *Similar lifecycle between model and app*: There can be use cases where the model and the application using it have the same lifecycle, so that the development of a new feature in the application also requires a change in the model. In these situations having the model embedded in the application will contribute to simplify the development cycle.
+
+== The site summarizer: a pure Java LLM-based application
+
+To demonstrate how Quarkus, Langchain4j and Jlama make straightforward to create a pure Java LLM infused application, where the LLM inference is directly embedded in the same JVM running the application I created a https://github.com/mariofusco/site-summarizer[simple project] that uses a LLM to automatically generate the summarization of a Wikipedia page or more in general of a blog post taken from any website.
+
+Out-of-the-box this project uses a https://huggingface.co/tjake/Llama-3.2-1B-Instruct-JQ4[small Llama-3.2 model with 4-bit quantization]. When the application is compiled for the first time the model is automatically downloaded locally by Jlama from the Huggingface repository. However it is possible to replace this model and experiment with any other one by simply editing the https://github.com/mariofusco/site-summarizer/blob/main/src/main/resources/application.properties#L4[quarkus.langchain4j.jlama.chat-model.model-name property] in the application.properties file.
+
+The readme of the project clarifies pretty well how this works: after the text of the web page to be summarized is programmatically extracted from the HTML, it is sent to Jlama to be processed via a usual Langchain4j AiService.
+
+[source, java]
+----
+import dev.langchain4j.service.SystemMessage;
+import dev.langchain4j.service.UserMessage;
+import io.quarkiverse.langchain4j.RegisterAiService;
+import io.smallrye.mutiny.Multi;
+import jakarta.inject.Singleton;
+
+@RegisterAiService
+@Singleton
+public interface SummarizerAiService {
+
+    @SystemMessage("""
+            You are an assistant that receives the content of a web page and sums up
+            the text on that page. Add key takeaways to the end of the sum-up.
+    """)
+    @UserMessage("Here's the text: '{text}'")
+    Multi<String> summarize(String text);
+}
+----
+
+The only thing that needs to be remarked here is that, differently from all other LLM inference engine integrations, this doesn’t require any remote call to an external service, but performs the LLM inference directly inside the same JVM running the application.
+
+The combination of the 2 trends of the increasing spread of small and tailored models and the adoption of these models in the enterprise software development world will very likely promote the use of similar solutions in the near future.