ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/_index.md‎
Lines changed: 59 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/_index.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/_next-steps.md‎
Lines changed: 8 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/_next-steps.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/backend.md‎
Lines changed: 150 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/backend.md‎
Lines changed: 150 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/backend_output.png‎
101 KB b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/backend_output.png‎
101 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/browser_output.png‎
378 KB b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/browser_output.png‎
378 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/conclusion.md‎
Lines changed: 44 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/conclusion.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/frontend.md‎
Lines changed: 88 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/frontend.md‎
Lines changed: 88 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/llama-vision/frontend_output.png‎
67.4 KB b/‎content/learning-paths/servers-and-cloud-computing/llama-vision/frontend_output.png‎
67.4 KB
@@ -0,0 +1,59 @@
+---
+title: Deploy a LLM based Vision Chatbot with PyTorch and Hugging Face Transformers on Google Axion processors
+
+minutes_to_complete: 45
+
+who_is_this_for: This Learning Path is for software developers, ML engineers, and those who are interested to deploy production-ready vision chatbot for their application with optimized performance on Arm Architecture.
+
+learning_objectives:
+    - Download PyTorch and Torch AO.
+    - Install required dependencies
+    - Build frontend with Streamlit to input image and prompt.
+    - Build backend to download the Llama 3.2 Vision model, Quantize and run it using PyTorch and Transformers.
+    - Monitor and analyze inference on Arm CPUs.
+
+prerequisites:
+    - A Google Cloud Axion (or other Arm) compute instance with at least 32 cores.
+    - Basic understanding of Python and ML concepts.
+    - Familiarity with REST APIs and web services.
+    - Basic knowledge on Streamlit.
+    - Understanding of LLM fundamentals.
+
+author: Nobel Chowdary Mandepudi
+
+### Tags
+skilllevels: Advanced
+armips:
+    - Neoverse
+subjects: ML
+operatingsystems:
+    - Linux
+tools_software_languages:
+    - Python
+    - PyTorch
+    - Streamlit
+    - Google Axion
+    - Demo
+
+further_reading:
+    - resource:
+        title: Getting started with Llama
+        link: https://llama.meta.com/get-started
+        type: documentation
+    - resource:
+        title: Hugging Face Documentation
+        link: https://huggingface.co/docs
+        type: documentation
+    - resource:
+        title: Democratizing Generative AI with CPU-based inference
+        link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
+        type: blog
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
@@ -0,0 +1,150 @@
+---
+title: Deploy Vision Chatbot LLM backend server
+weight: 4
+
+layout: learningpathall
+---
+
+## Backend Script for Vision Chatbot LLM Server
+Once the virtual environment is activated, create a `backend.py` script using the following content. This script downloads the Llama 3.2 Vision model from Hugging Face, performs 4-bit quantization on the model and then serves it with PyTorch on Arm:
+
+```python
+from flask import Flask, request, Response, stream_with_context
+from transformers import MllamaForConditionalGeneration, AutoProcessor, TextIteratorStreamer
+from threading import Thread
+from PIL import Image
+import torch
+import json
+import time
+import io
+import base64
+
+app = Flask(__name__)
+
+# Load model and processor
+model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
+model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float32)
+
+# Apply torchao quantization
+from torchao.dtypes import PlainLayout
+from torchao.experimental.packed_linear_int8_dynamic_activation_intx_weight_layout import (
+    PackedLinearInt8DynamicActivationIntxWeightLayout,
+)
+from torchao.experimental.quant_api import int8_dynamic_activation_intx_weight
+from torchao.quantization.granularity import PerGroup
+from torchao.quantization.quant_api import quantize_
+from torchao.quantization.quant_primitives import MappingType
+
+quantize_(
+    model,
+    int8_dynamic_activation_intx_weight(
+        weight_dtype=torch.int4,
+        granularity=PerGroup(32),
+        has_weight_zeros=True,
+        weight_mapping_type=MappingType.SYMMETRIC_NO_CLIPPING_ERR,
+        layout=PackedLinearInt8DynamicActivationIntxWeightLayout(target="aten"),
+    ),
+)
+
+processor = AutoProcessor.from_pretrained(model_id)
+model.eval()
+
+@app.route("/v1/chat/completions", methods=["POST"])
+def chat_completions():
+    image = None
+    prompt = ""
+
+    if "image" in request.files:
+        file = request.files["image"]
+        image = Image.open(file.stream).convert("RGB")
+        prompt = request.form.get("prompt", "")
+    elif request.is_json:
+        data = request.get_json()
+        if "image" in data:
+            image_bytes = base64.b64decode(data["image"])
+            image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+        if "prompt" in data:
+            prompt = data["prompt"]
+        elif "messages" in data:
+            for msg in data["messages"]:
+                if msg.get("role") == "user":
+                    prompt = msg.get("content", "")
+                    break
+
+    if image is None or not prompt:
+        return {"error": "Both image and prompt are required."}, 400
+
+    # Format the prompt
+    formatted_prompt = (
+        f"<|begin_of_text|><|image|>\n"
+        f"<|user|>\n{prompt.strip()}<|end_of_text|>\n"
+        "<|assistant|>\n"
+    )
+
+    inputs = processor(image, formatted_prompt, return_tensors="pt").to(model.device)
+    tokenizer = processor.tokenizer if hasattr(processor, "tokenizer") else processor
+
+    # Initialize the TextIteratorStreamer
+    text_streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+
+    # Define generation arguments
+    gen_kwargs = {
+        "max_new_tokens": 512,
+        "do_sample": False,
+        "temperature": 1.0,
+        "streamer": text_streamer,
+        "eos_token_id": tokenizer.eos_token_id,
+    }
+
+    # Run generation in a separate thread
+    generation_thread = Thread(target=model.generate, kwargs={**inputs, **gen_kwargs})
+    generation_thread.start()
+
+    def stream_response():
+        assistant_role_chunk = {
+            "id": f"chatcmpl-{int(time.time()*1000)}",
+            "object": "chat.completion.chunk",
+            "created": int(time.time()),
+            "model": model_id,
+            "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}]
+        }
+        yield f"data: {json.dumps(assistant_role_chunk)}\n\n"
+
+        for token in text_streamer:
+            if token.strip():
+                content_chunk = {
+                    "id": assistant_role_chunk["id"],
+                    "object": "chat.completion.chunk",
+                    "created": int(time.time()),
+                    "model": model_id,
+                    "choices": [{"index": 0, "delta": {"content": token}, "finish_reason": None}]
+                }
+                yield f"data: {json.dumps(content_chunk)}\n\n"
+
+        finish_chunk = {
+            "id": assistant_role_chunk["id"],
+            "object": "chat.completion.chunk",
+            "created": int(time.time()),
+            "model": model_id,
+            "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
+        }
+        yield f"data: {json.dumps(finish_chunk)}\n\n"
+        yield "data: [DONE]\n\n"
+
+    return Response(stream_with_context(stream_response()), mimetype='text/event-stream')
+
+if __name__ == "__main__":
+    app.run(host="0.0.0.0", port=5000, threaded=True)
+```
+
+## Run the Backend Server
+
+You are now ready to run the backend server for the Vision Chatbot.
+Use the following command in a terminal to start the backend server:
+
+```python
+LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python3 backend.py
+```
+
+You should see output similar to the image below when the backend server starts successfully:
+![backend](backend_output.png)
@@ -0,0 +1,44 @@
+---
+title: Inference with Vision Chatbot
+weight: 6
+
+layout: learningpathall
+---
+
+## Access the Web Application
+
+Open the web application in your browser using the external URL:
+
+```bash
+http://[your instance ip]:8501
+```
+
+{{% notice Note %}}
+
+To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities.
+
+For an Axion instance, you can do this from the gcloud cli:
+
+gcloud compute firewall-rules create allow-my-ip \
+    --direction=INGRESS \
+    --network=default \
+    --action=ALLOW \
+    --rules=tcp:8501 \
+    --source-ranges=[your IP]/32 \
+    --target-tags=allow-my-ip
+
+For this to work, you must ensure that the allow-my-ip tag is present on your Axion instance.
+
+{{% /notice %}}
+## Interact with the LLM
+
+You can upload an image and enter the prompt in the UI to generate response.
+
+You should see LLM generating response based on the prompt considering image as the context as shown in the image below:
+![browser_output](browser_output.png)
+
+## Further Interaction and Custom Applications
+
+You can continue to query on different images with prompts and observe the response of Vision model on Arm Neoverse based CPUs.
+
+This setup demonstrates how you can create various applications and configure your vision based LLMs. This Learning Path serves as a guide and example to showcase the LLM inference of vision models on Arm CPUs, highlighting the optimized inference on CPUs.
@@ -0,0 +1,88 @@
+---
+title: Deploy Vision Chatbot LLM frontend server
+weight: 5
+
+layout: learningpathall
+---
+
+## Frontend Script for Vision Chatbot LLM Server
+
+After activating the virtual environment in a new terminal, you can use the following `frontend.py` script to input image, text prompt and interact with the backend. This script uses the Streamlit framework to create a web interface for the vision chatbot LLM server.
+
+Create a `frontend.py` script with the following content:
+
+```python
+import streamlit as st
+import requests, time, base64, json
+
+st.title("LLM Vision Chatbot on Arm")
+st.write("Upload an image and input the prompt. The model will generate response based on the image as context.")
+
+# File uploader for image and text input for prompt
+uploaded_image = st.file_uploader("**Upload an image**", type=["png", "jpg", "jpeg"])
+user_prompt = st.text_area("**Enter your prompt or question about the image**", "")
+
+# Placeholder for the generated answer and metrics
+output_area = st.empty()
+metrics_area = st.empty()
+
+if st.button("Generate Response"):
+    if uploaded_image is None or user_prompt.strip() == "":
+        st.warning("Please provide both the image and prompt before submitting.")
+    else:
+        # Prepare the request (OpenAI-compatible format with image in base64)
+        image_bytes = uploaded_image.read()
+        b64_image = base64.b64encode(image_bytes).decode('utf-8')
+        # Construct request payload similar to OpenAI ChatCompletion
+        payload = {
+            "messages": [
+                {"role": "user", "content": user_prompt}
+            ],
+            "image": b64_image,       # custom field for image
+            "stream": True,           # token streaming
+        }
+
+        # Initialize streaming request to backend
+        backend_url = "http://localhost:5000/v1/chat/completions"
+        generated_text = ""
+        # Make POST request with streaming response
+        try:
+            with requests.post(backend_url, json=payload, stream=True) as resp:
+                # Iterate over the streamed lines from the response
+                for line in resp.iter_lines(decode_unicode=True):
+                    if line is None or line.strip() == "":
+                        continue  # skip empty keep-alive lines
+                    # OpenAI SSE format lines begin with "data: "
+                    if line.startswith("data: "):
+                        data = line[len("data: "):]
+                        if data.strip() == "[DONE]":
+                            break  # stream finished
+                        # Parse the JSON chunk
+                        chunk = json.loads(data)
+                        # The first chunk contains the role, subsequent contain content
+                        delta = chunk["choices"][0]["delta"]
+                        if "role" in delta:
+                            # Initial role announcement (assistant) – skip it
+                            continue
+                        if "content" in delta:
+                            token = delta["content"]
+                            # Append token to the output text
+                            generated_text += token
+                            # Update the output area with the new partial text
+                            output_area.markdown(f"**Assistant:** {generated_text}")
+
+        except requests.exceptions.RequestException as e:
+            st.error(f"Error connecting to backend: {e}")
+```
+
+## Run the Frontend Server
+
+You are now ready to run the frontend server for the Vision Chatbot.
+Use the following command in a new terminal to start the Streamlit frontend server:
+
+```python
+python3 -m streamlit run frontend.py
+```
+
+You should see output similar to the image below when the frontend server starts successfully:
+![frontend](frontend_output.png)