Skip to content

Commit 835f1e2

Browse files
committed
llama llm vision chatbot LP
1 parent f23d616 commit 835f1e2

File tree

9 files changed

+428
-0
lines changed

9 files changed

+428
-0
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: Deploy a LLM based Vision Chatbot with PyTorch and Hugging Face Transformers on Google Axion processors
3+
4+
minutes_to_complete: 45
5+
6+
who_is_this_for: This Learning Path is for software developers, ML engineers, and those who are interested to deploy production-ready vision chatbot for their application with optimized performance on Arm Architecture.
7+
8+
learning_objectives:
9+
- Download PyTorch and Torch AO.
10+
- Install required dependencies
11+
- Build frontend with Streamlit to input image and prompt.
12+
- Build backend to download the Llama 3.2 Vision model, Quantize and run it using PyTorch and Transformers.
13+
- Monitor and analyze inference on Arm CPUs.
14+
15+
prerequisites:
16+
- A Google Cloud Axion (or other Arm) compute instance with at least 32 cores.
17+
- Basic understanding of Python and ML concepts.
18+
- Familiarity with REST APIs and web services.
19+
- Basic knowledge on Streamlit.
20+
- Understanding of LLM fundamentals.
21+
22+
author: Nobel Chowdary Mandepudi
23+
24+
### Tags
25+
skilllevels: Advanced
26+
armips:
27+
- Neoverse
28+
subjects: ML
29+
operatingsystems:
30+
- Linux
31+
tools_software_languages:
32+
- Python
33+
- PyTorch
34+
- Streamlit
35+
- Google Axion
36+
- Demo
37+
38+
further_reading:
39+
- resource:
40+
title: Getting started with Llama
41+
link: https://llama.meta.com/get-started
42+
type: documentation
43+
- resource:
44+
title: Hugging Face Documentation
45+
link: https://huggingface.co/docs
46+
type: documentation
47+
- resource:
48+
title: Democratizing Generative AI with CPU-based inference
49+
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
50+
type: blog
51+
52+
53+
54+
### FIXED, DO NOT MODIFY
55+
# ================================================================================
56+
weight: 1 # _index.md always has weight of 1 to order correctly
57+
layout: "learningpathall" # All files under learning paths have this same wrapper
58+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
59+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: Deploy Vision Chatbot LLM backend server
3+
weight: 4
4+
5+
layout: learningpathall
6+
---
7+
8+
## Backend Script for Vision Chatbot LLM Server
9+
Once the virtual environment is activated, create a `backend.py` script using the following content. This script downloads the Llama 3.2 Vision model from Hugging Face, performs 4-bit quantization on the model and then serves it with PyTorch on Arm:
10+
11+
```python
12+
from flask import Flask, request, Response, stream_with_context
13+
from transformers import MllamaForConditionalGeneration, AutoProcessor, TextIteratorStreamer
14+
from threading import Thread
15+
from PIL import Image
16+
import torch
17+
import json
18+
import time
19+
import io
20+
import base64
21+
22+
app = Flask(__name__)
23+
24+
# Load model and processor
25+
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
26+
model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float32)
27+
processor = AutoProcessor.from_pretrained(model_id)
28+
model.eval()
29+
30+
@app.route("/v1/chat/completions", methods=["POST"])
31+
def chat_completions():
32+
image = None
33+
prompt = ""
34+
35+
if "image" in request.files:
36+
file = request.files["image"]
37+
image = Image.open(file.stream).convert("RGB")
38+
prompt = request.form.get("prompt", "")
39+
elif request.is_json:
40+
data = request.get_json()
41+
if "image" in data:
42+
image_bytes = base64.b64decode(data["image"])
43+
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
44+
if "prompt" in data:
45+
prompt = data["prompt"]
46+
elif "messages" in data:
47+
for msg in data["messages"]:
48+
if msg.get("role") == "user":
49+
prompt = msg.get("content", "")
50+
break
51+
52+
if image is None or not prompt:
53+
return {"error": "Both image and prompt are required."}, 400
54+
55+
# Format the prompt
56+
formatted_prompt = (
57+
f"<|begin_of_text|><|image|>\n"
58+
f"<|user|>\n{prompt.strip()}<|end_of_text|>\n"
59+
"<|assistant|>\n"
60+
)
61+
62+
inputs = processor(image, formatted_prompt, return_tensors="pt").to(model.device)
63+
tokenizer = processor.tokenizer if hasattr(processor, "tokenizer") else processor
64+
65+
# Initialize the TextIteratorStreamer
66+
text_streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
67+
68+
# Define generation arguments
69+
gen_kwargs = {
70+
"max_new_tokens": 512,
71+
"do_sample": False,
72+
"temperature": 1.0,
73+
"streamer": text_streamer,
74+
"eos_token_id": tokenizer.eos_token_id,
75+
}
76+
77+
# Run generation in a separate thread
78+
generation_thread = Thread(target=model.generate, kwargs={**inputs, **gen_kwargs})
79+
generation_thread.start()
80+
81+
def stream_response():
82+
assistant_role_chunk = {
83+
"id": f"chatcmpl-{int(time.time()*1000)}",
84+
"object": "chat.completion.chunk",
85+
"created": int(time.time()),
86+
"model": model_id,
87+
"choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}]
88+
}
89+
yield f"data: {json.dumps(assistant_role_chunk)}\n\n"
90+
91+
for token in text_streamer:
92+
if token.strip():
93+
content_chunk = {
94+
"id": assistant_role_chunk["id"],
95+
"object": "chat.completion.chunk",
96+
"created": int(time.time()),
97+
"model": model_id,
98+
"choices": [{"index": 0, "delta": {"content": token}, "finish_reason": None}]
99+
}
100+
yield f"data: {json.dumps(content_chunk)}\n\n"
101+
102+
finish_chunk = {
103+
"id": assistant_role_chunk["id"],
104+
"object": "chat.completion.chunk",
105+
"created": int(time.time()),
106+
"model": model_id,
107+
"choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
108+
}
109+
yield f"data: {json.dumps(finish_chunk)}\n\n"
110+
yield "data: [DONE]\n\n"
111+
112+
return Response(stream_with_context(stream_response()), mimetype='text/event-stream')
113+
114+
if __name__ == "__main__":
115+
app.run(host="0.0.0.0", port=5000, threaded=True)
116+
```
117+
118+
## Run the Backend Server
119+
120+
You are now ready to run the backend server for the Vision Chatbot.
121+
Use the following command in a terminal to start the backend server:
122+
123+
```python
124+
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python3 backend.py
125+
```
126+
127+
You should see output similar to the image below when the backend server starts successfully:
128+
![backend](backend_output.png)
101 KB
Loading
378 KB
Loading
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: Inference with Vision Chatbot
3+
weight: 6
4+
5+
layout: learningpathall
6+
---
7+
8+
## Access the Web Application
9+
10+
Open the web application in your browser using the external URL:
11+
12+
```bash
13+
http://[your instance ip]:8501
14+
```
15+
16+
{{% notice Note %}}
17+
18+
To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities.
19+
20+
For an Axion instance, you can do this from the gcloud cli:
21+
22+
gcloud compute firewall-rules create allow-my-ip \
23+
--direction=INGRESS \
24+
--network=default \
25+
--action=ALLOW \
26+
--rules=tcp:8501 \
27+
--source-ranges=[your IP]/32 \
28+
--target-tags=allow-my-ip
29+
30+
For this to work, you must ensure that the allow-my-ip tag is present on your Axion instance.
31+
32+
{{% /notice %}}
33+
## Interact with the LLM
34+
35+
You can upload an image and enter the prompt in the UI to generate response.
36+
37+
You should see LLM generating response based on the prompt considering image as the context as shown in the image below:
38+
![browser_output](browser_output.png)
39+
40+
## Further Interaction and Custom Applications
41+
42+
You can continue to query on different images with prompts and observe the response of Vision model on Arm Neoverse based CPUs.
43+
44+
This setup demonstrates how you can create various applications and configure your vision based LLMs. This Learning Path serves as a guide and example to showcase the LLM inference of vision models on Arm CPUs, highlighting the optimized inference on CPUs.
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: Deploy Vision Chatbot LLM frontend server
3+
weight: 5
4+
5+
layout: learningpathall
6+
---
7+
8+
## Frontend Script for Vision Chatbot LLM Server
9+
10+
After activating the virtual environment in a new terminal, you can use the following `frontend.py` script to input image, text prompt and interact with the backend. This script uses the Streamlit framework to create a web interface for the vision chatbot LLM server.
11+
12+
Create a `frontend.py` script with the following content:
13+
14+
```python
15+
import streamlit as st
16+
import requests, time, base64, json
17+
18+
st.title("LLM Vision Chatbot on Arm")
19+
st.write("Upload an image and input the prompt. The model will generate response based on the image as context.")
20+
21+
# File uploader for image and text input for prompt
22+
uploaded_image = st.file_uploader("**Upload an image**", type=["png", "jpg", "jpeg"])
23+
user_prompt = st.text_area("**Enter your prompt or question about the image**", "")
24+
25+
# Placeholder for the generated answer and metrics
26+
output_area = st.empty()
27+
metrics_area = st.empty()
28+
29+
if st.button("Generate Response"):
30+
if uploaded_image is None or user_prompt.strip() == "":
31+
st.warning("Please provide both the image and prompt before submitting.")
32+
else:
33+
# Prepare the request (OpenAI-compatible format with image in base64)
34+
image_bytes = uploaded_image.read()
35+
b64_image = base64.b64encode(image_bytes).decode('utf-8')
36+
# Construct request payload similar to OpenAI ChatCompletion
37+
payload = {
38+
"messages": [
39+
{"role": "user", "content": user_prompt}
40+
],
41+
"image": b64_image, # custom field for image
42+
"stream": True, # token streaming
43+
}
44+
45+
# Initialize streaming request to backend
46+
backend_url = "http://localhost:5000/v1/chat/completions"
47+
generated_text = ""
48+
# Make POST request with streaming response
49+
try:
50+
with requests.post(backend_url, json=payload, stream=True) as resp:
51+
# Iterate over the streamed lines from the response
52+
for line in resp.iter_lines(decode_unicode=True):
53+
if line is None or line.strip() == "":
54+
continue # skip empty keep-alive lines
55+
# OpenAI SSE format lines begin with "data: "
56+
if line.startswith("data: "):
57+
data = line[len("data: "):]
58+
if data.strip() == "[DONE]":
59+
break # stream finished
60+
# Parse the JSON chunk
61+
chunk = json.loads(data)
62+
# The first chunk contains the role, subsequent contain content
63+
delta = chunk["choices"][0]["delta"]
64+
if "role" in delta:
65+
# Initial role announcement (assistant) – skip it
66+
continue
67+
if "content" in delta:
68+
token = delta["content"]
69+
# Append token to the output text
70+
generated_text += token
71+
# Update the output area with the new partial text
72+
output_area.markdown(f"**Assistant:** {generated_text}")
73+
74+
except requests.exceptions.RequestException as e:
75+
st.error(f"Error connecting to backend: {e}")
76+
```
77+
78+
## Run the Frontend Server
79+
80+
You are now ready to run the frontend server for the Vision Chatbot.
81+
Use the following command in a new terminal to start the Streamlit frontend server:
82+
83+
```python
84+
python3 -m streamlit run frontend.py
85+
```
86+
87+
You should see output similar to the image below when the frontend server starts successfully:
88+
![frontend](frontend_output.png)
67.4 KB
Loading

0 commit comments

Comments
 (0)