Skip to content

Commit f98098c

Browse files
committed
ONNX Learning Path
1 parent 59f4931 commit f98098c

File tree

8 files changed

+332
-0
lines changed

8 files changed

+332
-0
lines changed
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Running Phi-3.5 Vision Model with ONNX Runtime on Cobalt 100
3+
4+
minutes_to_complete: 30
5+
6+
who_is_this_for:
7+
- Software developers, ML engineers, and cloud practitioners looking to deploy Microsoft Phi Models on Arm-based servers using ONNX Runtime.
8+
9+
learning_objectives:
10+
- Install ONNX Runtime, download and quantize the Phi-3.5 vision model.
11+
- Run the Phi-3.5 model with ONNX Runtime on Azure.
12+
- Analyze performance on Neoverse N2-based Cobalt 100 servers.
13+
14+
prerequisites:
15+
- Access to an Azure Cobalt 100 (or other Arm-based) compute instance with at least 16 cores, 8GB of RAM, and 32GB of disk space.
16+
- Basic understanding of Python and machine learning concepts.
17+
- Familiarity with ONNX Runtime and Azure cloud services.
18+
- Knowledge of LLM (Large Language Model) fundamentals.
19+
20+
21+
author: Nobel Chowdary Mandepudi
22+
23+
### Tags
24+
skilllevels: Advanced
25+
armips:
26+
- Neoverse
27+
subjects: Machine Learning
28+
operatingsystems:
29+
- Linux
30+
tools_software_languages:
31+
- Python
32+
- ONNX Runtime
33+
- Microsoft Azure
34+
35+
further_reading:
36+
- resource:
37+
title: Getting Started with Llama
38+
link: https://llama.meta.com/get-started
39+
type: documentation
40+
- resource:
41+
title: Hugging Face Documentation
42+
link: https://huggingface.co/docs
43+
type: documentation
44+
- resource:
45+
title: Democratizing Generative AI with CPU-Based Inference
46+
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
47+
type: blog
48+
49+
### FIXED, DO NOT MODIFY
50+
# ================================================================================
51+
weight: 1 # _index.md always has a weight of 1 to order correctly
52+
layout: "learningpathall" # All files under learning paths use this wrapper
53+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
54+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: Phi 3.5 Chatbot Performance Analysis
3+
weight: 4
4+
5+
layout: learningpathall
6+
---
7+
8+
## Input a Prompt
9+
10+
To begin, skip the image prompt and input the text prompt as shown in the example below:
11+
![output](output.png)
12+
13+
Next, download a sample image from the internet using the following `wget` command in the main directory:
14+
```bash
15+
wget https://cdn.pixabay.com/photo/2020/06/30/22/34/dog-5357794__340.jpg
16+
```
17+
18+
After downloading the image, input the image prompt along with the image name, and enter the text prompt as demonstrated in the example below:
19+
![image_output](image_output.png)
20+
21+
## Observe Performance Metrics
22+
23+
As shown in the example above, the LLM Chatbot performs inference at a speed of **44 tokens/second**, with the time to first token being approximately **1 second**. This highlights the efficiency and responsiveness of the LLM Chatbot in processing queries and generating outputs.
24+
25+
## Further Interaction and Custom Applications
26+
27+
You can continue interacting with the chatbot by asking follow-up prompts and observing the performance metrics displayed in the terminal.
28+
29+
This setup demonstrates how to build and configure applications using the Phi 3.5 model for text generation with both text and image inputs. It also showcases the optimized performance of running Phi models on Arm CPUs, emphasizing the significant performance gains achieved through this workflow.
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
---
2+
title: Run the Chatbot Server
3+
weight: 3
4+
5+
layout: learningpathall
6+
---
7+
8+
## Script for ONNX Runtime based LLM Server
9+
Now create a `phi3v.py` script using the following content. This script runs the Phi3.5 vision model with ONNX Runtime.
10+
11+
```python
12+
# Copyright (c) Microsoft Corporation. All rights reserved.
13+
# Licensed under the MIT License
14+
import argparse
15+
import os
16+
import glob
17+
import time
18+
from pathlib import Path
19+
import onnxruntime_genai as og
20+
21+
def _find_dir_contains_sub_dir(current_dir: Path, target_dir_name):
22+
curr_path = Path(current_dir).absolute()
23+
target_dir = glob.glob(target_dir_name, root_dir=curr_path)
24+
if target_dir:
25+
return Path(curr_path / target_dir[0]).absolute()
26+
else:
27+
if curr_path.parent == curr_path:
28+
# Root dir
29+
return None
30+
return _find_dir_contains_sub_dir(curr_path / '..', target_dir_name)
31+
32+
def _complete(text, state):
33+
return (glob.glob(text + "*") + [None])[state]
34+
35+
def run(args: argparse.Namespace):
36+
print("Loading model...")
37+
config = og.Config(args.model_path)
38+
config.clear_providers()
39+
if args.execution_provider != "cpu":
40+
print(f"Setting model to {args.execution_provider}...")
41+
config.append_provider(args.execution_provider)
42+
model = og.Model(config)
43+
print("Model loaded")
44+
processor = model.create_multimodal_processor()
45+
tokenizer_stream = processor.create_stream()
46+
interactive = not args.non_interactive
47+
while True:
48+
if interactive:
49+
try:
50+
import readline
51+
readline.set_completer_delims(" \t\n;")
52+
readline.parse_and_bind("tab: complete")
53+
readline.set_completer(_complete)
54+
except ImportError:
55+
# Not available on some platforms. Ignore it.
56+
pass
57+
image_paths = [
58+
image_path.strip()
59+
for image_path in input(
60+
"Image Path (comma separated; leave empty if no image): "
61+
).split(",")
62+
]
63+
else:
64+
if args.image_paths:
65+
image_paths = args.image_paths
66+
else:
67+
image_paths = [str(_find_dir_contains_sub_dir(Path(__file__).parent, "test") / "test_models" / "images" / "australia.jpg")]
68+
image_paths = [image_path for image_path in image_paths if image_path]
69+
images = None
70+
prompt = "<|user|>\n"
71+
if len(image_paths) == 0:
72+
print("No image provided")
73+
else:
74+
for i, image_path in enumerate(image_paths):
75+
if not os.path.exists(image_path):
76+
raise FileNotFoundError(f"Image file not found: {image_path}")
77+
print(f"Using image: {image_path}")
78+
prompt += f"<|image_{i+1}|>\n"
79+
images = og.Images.open(*image_paths)
80+
if interactive:
81+
text = input("Prompt: ")
82+
else:
83+
if args.prompt:
84+
text = args.prompt
85+
else:
86+
text = "What is shown in this image?"
87+
prompt += f"{text}<|end|>\n<|assistant|>\n"
88+
print("Processing images and prompt...")
89+
inputs = processor(prompt, images=images)
90+
print("Generating response...")
91+
start_time = time.time()
92+
params = og.GeneratorParams(model)
93+
params.set_inputs(inputs)
94+
params.set_search_options(max_length=7680)
95+
generator = og.Generator(model, params)
96+
#start_time = time.time()
97+
first_token_duration = None
98+
token_count = 0
99+
while not generator.is_done():
100+
generator.generate_next_token()
101+
new_token = generator.get_next_tokens()[0]
102+
decoded_token = tokenizer_stream.decode(new_token)
103+
token_count += 1
104+
if token_count == 1:
105+
ft_end = time.time()
106+
first_token_duration = ft_end - start_time
107+
print(decoded_token, end="", flush=True)
108+
end_time = time.time()
109+
total_run_time = end_time - start_time
110+
tokens_per_sec = token_count / (end_time - ft_end)
111+
print()
112+
print(f"Total Time : {total_run_time:.4f} sec")
113+
print(f"Time to First Token : {first_token_duration:.4f} sec")
114+
print(f"Tokens per second : {tokens_per_sec:.2f} tokens/sec")
115+
for _ in range(3):
116+
print()
117+
# Delete the generator to free the captured graph before creating another one
118+
del generator
119+
if not interactive:
120+
break
121+
122+
if __name__ == "__main__":
123+
parser = argparse.ArgumentParser()
124+
parser.add_argument(
125+
"-m", "--model_path", type=str, required=True, help="Path to the folder containing the model"
126+
)
127+
parser.add_argument(
128+
"-e", "--execution_provider", type=str, required=True, choices=["cpu", "cuda", "dml"], help="Execution provider to run model"
129+
)
130+
parser.add_argument(
131+
"--image_paths", nargs='*', type=str, required=False, help="Path to the images, mainly for CI usage"
132+
)
133+
parser.add_argument(
134+
'-pr', '--prompt', required=False, help='Input prompts to generate tokens from, mainly for CI usage'
135+
)
136+
parser.add_argument(
137+
'--non-interactive', action=argparse.BooleanOptionalAction, required=False, help='Non-interactive mode, mainly for CI usage'
138+
)
139+
args = parser.parse_args()
140+
run(args)
141+
```
142+
143+
## Run the Server
144+
145+
You are now ready to run the server to enable chatbot.
146+
Use the following command in a terminal to start the server:
147+
148+
```python
149+
python3 phi3v.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu
150+
```
151+
152+
You should see output similar to the image below when the server starts successfully:
153+
![server](server.png)
113 KB
Loading
122 KB
Loading
45.1 KB
Loading
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
# User change
3+
title: "Build ONNX Runtime and setup Phi-3.5 vision model"
4+
5+
weight: 2
6+
7+
# Do not modify these elements
8+
layout: "learningpathall"
9+
---
10+
11+
## Before You Begin
12+
13+
This Learning Path demonstrates how to run quantized Phi models on the Cobalt 100 servers using ONNX Runtime. Specifically, it focuses on deploying the Phi 3.5 vision model on Arm-based servers running Ubuntu 24.04 LTS. The instructions have been tested on a Azure Dpls_v6 - 16 cores instance.
14+
15+
## Overview
16+
17+
In this Learning Path, you will learn how to build and configure ONNX Runtime to enable efficient LLM inference on Arm CPUs.
18+
19+
The tutorial covers the following steps:
20+
- Building ONNX Runtime, quantizing and converting the Phi 3.5 vision model to the ONNX format.
21+
- Running the model using a Python script with ONNX Runtime to perform LLM inference on the CPU.
22+
- Analyze the performance.
23+
24+
By the end of this Learning Path, you will have a complete workflow for deploying and running quantized vision models on Arm-based servers.
25+
26+
## Install dependencies
27+
28+
Install the following packages on your Arm based server instance:
29+
30+
```bash
31+
sudo apt update
32+
sudo apt install python3-pip python3-venv cmake -y
33+
```
34+
35+
## Create a requirements file
36+
37+
```bash
38+
vim requirements.txt
39+
```
40+
41+
Add the following dependencies to your `requirements.txt` file:
42+
43+
```python
44+
requests
45+
torch
46+
transformers
47+
accelerate
48+
huggingface-hub
49+
pyreadline3
50+
```
51+
52+
## Install Python Dependencies
53+
54+
Create a virtual environment:
55+
```bash
56+
python3 -m venv onnx-env
57+
```
58+
59+
Activate the virtual environment:
60+
```bash
61+
source onnx-env/bin/activate
62+
```
63+
64+
Install the required libraries using pip:
65+
```bash
66+
pip install -r requirements.txt
67+
```
68+
## Clone and build ONNX Runtime
69+
70+
Clone and build the `onnxruntime-genai` repository, which includes the Kleidi AI optimized ONNX Runtime, using the following commands:
71+
72+
```bash
73+
git clone https://github.com/microsoft/onnxruntime-genai.git
74+
cd onnxruntime-genai/
75+
python3 build.py --config Release
76+
cd build/Linux/Release/wheel/
77+
pip install onnxruntime_genai-0.8.0.dev0-cp312-cp312-linux_aarch64.whl
78+
```
79+
80+
## Download and Quantize the Model
81+
82+
Navigate to the home directory, download and quantize model using `huggingface-cli`:
83+
```bash
84+
cd ~
85+
huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
86+
```
87+
88+
The Phi 3.5 vision model has now been successfully quantized into the ONNX format. The next step is to run the model using ONNX Runtime.

0 commit comments

Comments
 (0)