Skip to content

Commit 33dd5d0

Browse files
authored
Merge pull request #1755 from nobelchowdary/llama_vision
llama llm vision chatbot LP
2 parents d75e81f + 9b08c79 commit 33dd5d0

File tree

9 files changed

+450
-0
lines changed

9 files changed

+450
-0
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: Deploy a LLM based Vision Chatbot with PyTorch and Hugging Face Transformers on Google Axion processors
3+
4+
minutes_to_complete: 45
5+
6+
who_is_this_for: This Learning Path is for software developers, ML engineers, and those who are interested to deploy production-ready vision chatbot for their application with optimized performance on Arm Architecture.
7+
8+
learning_objectives:
9+
- Download PyTorch and Torch AO.
10+
- Install required dependencies
11+
- Build frontend with Streamlit to input image and prompt.
12+
- Build backend to download the Llama 3.2 Vision model, Quantize and run it using PyTorch and Transformers.
13+
- Monitor and analyze inference on Arm CPUs.
14+
15+
prerequisites:
16+
- A Google Cloud Axion (or other Arm) compute instance with at least 32 cores.
17+
- Basic understanding of Python and ML concepts.
18+
- Familiarity with REST APIs and web services.
19+
- Basic knowledge on Streamlit.
20+
- Understanding of LLM fundamentals.
21+
22+
author: Nobel Chowdary Mandepudi
23+
24+
### Tags
25+
skilllevels: Advanced
26+
armips:
27+
- Neoverse
28+
subjects: ML
29+
operatingsystems:
30+
- Linux
31+
tools_software_languages:
32+
- Python
33+
- PyTorch
34+
- Streamlit
35+
- Google Axion
36+
- Demo
37+
38+
further_reading:
39+
- resource:
40+
title: Getting started with Llama
41+
link: https://llama.meta.com/get-started
42+
type: documentation
43+
- resource:
44+
title: Hugging Face Documentation
45+
link: https://huggingface.co/docs
46+
type: documentation
47+
- resource:
48+
title: Democratizing Generative AI with CPU-based inference
49+
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
50+
type: blog
51+
52+
53+
54+
### FIXED, DO NOT MODIFY
55+
# ================================================================================
56+
weight: 1 # _index.md always has weight of 1 to order correctly
57+
layout: "learningpathall" # All files under learning paths have this same wrapper
58+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
59+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
title: Deploy Vision Chatbot LLM backend server
3+
weight: 4
4+
5+
layout: learningpathall
6+
---
7+
8+
## Backend Script for Vision Chatbot LLM Server
9+
Once the virtual environment is activated, create a `backend.py` script using the following content. This script downloads the Llama 3.2 Vision model from Hugging Face, performs 4-bit quantization on the model and then serves it with PyTorch on Arm:
10+
11+
```python
12+
from flask import Flask, request, Response, stream_with_context
13+
from transformers import MllamaForConditionalGeneration, AutoProcessor, TextIteratorStreamer
14+
from threading import Thread
15+
from PIL import Image
16+
import torch
17+
import json
18+
import time
19+
import io
20+
import base64
21+
22+
app = Flask(__name__)
23+
24+
# Load model and processor
25+
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
26+
model = MllamaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float32)
27+
28+
# Apply torchao quantization
29+
from torchao.dtypes import PlainLayout
30+
from torchao.experimental.packed_linear_int8_dynamic_activation_intx_weight_layout import (
31+
PackedLinearInt8DynamicActivationIntxWeightLayout,
32+
)
33+
from torchao.experimental.quant_api import int8_dynamic_activation_intx_weight
34+
from torchao.quantization.granularity import PerGroup
35+
from torchao.quantization.quant_api import quantize_
36+
from torchao.quantization.quant_primitives import MappingType
37+
38+
quantize_(
39+
model,
40+
int8_dynamic_activation_intx_weight(
41+
weight_dtype=torch.int4,
42+
granularity=PerGroup(32),
43+
has_weight_zeros=True,
44+
weight_mapping_type=MappingType.SYMMETRIC_NO_CLIPPING_ERR,
45+
layout=PackedLinearInt8DynamicActivationIntxWeightLayout(target="aten"),
46+
),
47+
)
48+
49+
processor = AutoProcessor.from_pretrained(model_id)
50+
model.eval()
51+
52+
@app.route("/v1/chat/completions", methods=["POST"])
53+
def chat_completions():
54+
image = None
55+
prompt = ""
56+
57+
if "image" in request.files:
58+
file = request.files["image"]
59+
image = Image.open(file.stream).convert("RGB")
60+
prompt = request.form.get("prompt", "")
61+
elif request.is_json:
62+
data = request.get_json()
63+
if "image" in data:
64+
image_bytes = base64.b64decode(data["image"])
65+
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
66+
if "prompt" in data:
67+
prompt = data["prompt"]
68+
elif "messages" in data:
69+
for msg in data["messages"]:
70+
if msg.get("role") == "user":
71+
prompt = msg.get("content", "")
72+
break
73+
74+
if image is None or not prompt:
75+
return {"error": "Both image and prompt are required."}, 400
76+
77+
# Format the prompt
78+
formatted_prompt = (
79+
f"<|begin_of_text|><|image|>\n"
80+
f"<|user|>\n{prompt.strip()}<|end_of_text|>\n"
81+
"<|assistant|>\n"
82+
)
83+
84+
inputs = processor(image, formatted_prompt, return_tensors="pt").to(model.device)
85+
tokenizer = processor.tokenizer if hasattr(processor, "tokenizer") else processor
86+
87+
# Initialize the TextIteratorStreamer
88+
text_streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
89+
90+
# Define generation arguments
91+
gen_kwargs = {
92+
"max_new_tokens": 512,
93+
"do_sample": False,
94+
"temperature": 1.0,
95+
"streamer": text_streamer,
96+
"eos_token_id": tokenizer.eos_token_id,
97+
}
98+
99+
# Run generation in a separate thread
100+
generation_thread = Thread(target=model.generate, kwargs={**inputs, **gen_kwargs})
101+
generation_thread.start()
102+
103+
def stream_response():
104+
assistant_role_chunk = {
105+
"id": f"chatcmpl-{int(time.time()*1000)}",
106+
"object": "chat.completion.chunk",
107+
"created": int(time.time()),
108+
"model": model_id,
109+
"choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}]
110+
}
111+
yield f"data: {json.dumps(assistant_role_chunk)}\n\n"
112+
113+
for token in text_streamer:
114+
if token.strip():
115+
content_chunk = {
116+
"id": assistant_role_chunk["id"],
117+
"object": "chat.completion.chunk",
118+
"created": int(time.time()),
119+
"model": model_id,
120+
"choices": [{"index": 0, "delta": {"content": token}, "finish_reason": None}]
121+
}
122+
yield f"data: {json.dumps(content_chunk)}\n\n"
123+
124+
finish_chunk = {
125+
"id": assistant_role_chunk["id"],
126+
"object": "chat.completion.chunk",
127+
"created": int(time.time()),
128+
"model": model_id,
129+
"choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
130+
}
131+
yield f"data: {json.dumps(finish_chunk)}\n\n"
132+
yield "data: [DONE]\n\n"
133+
134+
return Response(stream_with_context(stream_response()), mimetype='text/event-stream')
135+
136+
if __name__ == "__main__":
137+
app.run(host="0.0.0.0", port=5000, threaded=True)
138+
```
139+
140+
## Run the Backend Server
141+
142+
You are now ready to run the backend server for the Vision Chatbot.
143+
Use the following command in a terminal to start the backend server:
144+
145+
```python
146+
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python3 backend.py
147+
```
148+
149+
You should see output similar to the image below when the backend server starts successfully:
150+
![backend](backend_output.png)
101 KB
Loading
378 KB
Loading
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: Inference with Vision Chatbot
3+
weight: 6
4+
5+
layout: learningpathall
6+
---
7+
8+
## Access the Web Application
9+
10+
Open the web application in your browser using the external URL:
11+
12+
```bash
13+
http://[your instance ip]:8501
14+
```
15+
16+
{{% notice Note %}}
17+
18+
To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities.
19+
20+
For an Axion instance, you can do this from the gcloud cli:
21+
22+
gcloud compute firewall-rules create allow-my-ip \
23+
--direction=INGRESS \
24+
--network=default \
25+
--action=ALLOW \
26+
--rules=tcp:8501 \
27+
--source-ranges=[your IP]/32 \
28+
--target-tags=allow-my-ip
29+
30+
For this to work, you must ensure that the allow-my-ip tag is present on your Axion instance.
31+
32+
{{% /notice %}}
33+
## Interact with the LLM
34+
35+
You can upload an image and enter the prompt in the UI to generate response.
36+
37+
You should see LLM generating response based on the prompt considering image as the context as shown in the image below:
38+
![browser_output](browser_output.png)
39+
40+
## Further Interaction and Custom Applications
41+
42+
You can continue to query on different images with prompts and observe the response of Vision model on Arm Neoverse based CPUs.
43+
44+
This setup demonstrates how you can create various applications and configure your vision based LLMs. This Learning Path serves as a guide and example to showcase the LLM inference of vision models on Arm CPUs, highlighting the optimized inference on CPUs.
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: Deploy Vision Chatbot LLM frontend server
3+
weight: 5
4+
5+
layout: learningpathall
6+
---
7+
8+
## Frontend Script for Vision Chatbot LLM Server
9+
10+
After activating the virtual environment in a new terminal, you can use the following `frontend.py` script to input image, text prompt and interact with the backend. This script uses the Streamlit framework to create a web interface for the vision chatbot LLM server.
11+
12+
Create a `frontend.py` script with the following content:
13+
14+
```python
15+
import streamlit as st
16+
import requests, time, base64, json
17+
18+
st.title("LLM Vision Chatbot on Arm")
19+
st.write("Upload an image and input the prompt. The model will generate response based on the image as context.")
20+
21+
# File uploader for image and text input for prompt
22+
uploaded_image = st.file_uploader("**Upload an image**", type=["png", "jpg", "jpeg"])
23+
user_prompt = st.text_area("**Enter your prompt or question about the image**", "")
24+
25+
# Placeholder for the generated answer and metrics
26+
output_area = st.empty()
27+
metrics_area = st.empty()
28+
29+
if st.button("Generate Response"):
30+
if uploaded_image is None or user_prompt.strip() == "":
31+
st.warning("Please provide both the image and prompt before submitting.")
32+
else:
33+
# Prepare the request (OpenAI-compatible format with image in base64)
34+
image_bytes = uploaded_image.read()
35+
b64_image = base64.b64encode(image_bytes).decode('utf-8')
36+
# Construct request payload similar to OpenAI ChatCompletion
37+
payload = {
38+
"messages": [
39+
{"role": "user", "content": user_prompt}
40+
],
41+
"image": b64_image, # custom field for image
42+
"stream": True, # token streaming
43+
}
44+
45+
# Initialize streaming request to backend
46+
backend_url = "http://localhost:5000/v1/chat/completions"
47+
generated_text = ""
48+
# Make POST request with streaming response
49+
try:
50+
with requests.post(backend_url, json=payload, stream=True) as resp:
51+
# Iterate over the streamed lines from the response
52+
for line in resp.iter_lines(decode_unicode=True):
53+
if line is None or line.strip() == "":
54+
continue # skip empty keep-alive lines
55+
# OpenAI SSE format lines begin with "data: "
56+
if line.startswith("data: "):
57+
data = line[len("data: "):]
58+
if data.strip() == "[DONE]":
59+
break # stream finished
60+
# Parse the JSON chunk
61+
chunk = json.loads(data)
62+
# The first chunk contains the role, subsequent contain content
63+
delta = chunk["choices"][0]["delta"]
64+
if "role" in delta:
65+
# Initial role announcement (assistant) – skip it
66+
continue
67+
if "content" in delta:
68+
token = delta["content"]
69+
# Append token to the output text
70+
generated_text += token
71+
# Update the output area with the new partial text
72+
output_area.markdown(f"**Assistant:** {generated_text}")
73+
74+
except requests.exceptions.RequestException as e:
75+
st.error(f"Error connecting to backend: {e}")
76+
```
77+
78+
## Run the Frontend Server
79+
80+
You are now ready to run the frontend server for the Vision Chatbot.
81+
Use the following command in a new terminal to start the Streamlit frontend server:
82+
83+
```python
84+
python3 -m streamlit run frontend.py
85+
```
86+
87+
You should see output similar to the image below when the frontend server starts successfully:
88+
![frontend](frontend_output.png)
67.4 KB
Loading

0 commit comments

Comments
 (0)