Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/onnx/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Running Phi-3.5 Vision Model with ONNX Runtime on Cobalt 100

minutes_to_complete: 30

who_is_this_for:
- Software developers, ML engineers, and cloud practitioners looking to deploy Microsoft Phi Models on Arm-based servers using ONNX Runtime.

learning_objectives:
- Install ONNX Runtime, download and quantize the Phi-3.5 vision model.
- Run the Phi-3.5 model with ONNX Runtime on Azure.
- Analyze performance on Neoverse N2-based Cobalt 100 servers.

prerequisites:
- Access to an Azure Cobalt 100 (or other Arm-based) compute instance with at least 16 cores, 8GB of RAM, and 32GB of disk space.
- Basic understanding of Python and machine learning concepts.
- Familiarity with ONNX Runtime and Azure cloud services.
- Knowledge of LLM (Large Language Model) fundamentals.


author: Nobel Chowdary Mandepudi

### Tags
skilllevels: Advanced
armips:
- Neoverse
subjects: Machine Learning
operatingsystems:
- Linux
tools_software_languages:
- Python
- ONNX Runtime
- Microsoft Azure

further_reading:
- resource:
title: Getting Started with Llama
link: https://llama.meta.com/get-started
type: documentation
- resource:
title: Hugging Face Documentation
link: https://huggingface.co/docs
type: documentation
- resource:
title: Democratizing Generative AI with CPU-Based Inference
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
type: blog

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has a weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths use this wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Phi 3.5 Chatbot Performance Analysis
weight: 4

layout: learningpathall
---

## Input a Prompt

To begin, skip the image prompt and input the text prompt as shown in the example below:
![output](output.png)

Next, download a sample image from the internet using the following `wget` command in the main directory:
```bash
wget https://cdn.pixabay.com/photo/2020/06/30/22/34/dog-5357794__340.jpg
```

After downloading the image, input the image prompt along with the image name, and enter the text prompt as demonstrated in the example below:
![image_output](image_output.png)

## Observe Performance Metrics

As shown in the example above, the LLM Chatbot performs inference at a speed of **44 tokens/second**, with the time to first token being approximately **1 second**. This highlights the efficiency and responsiveness of the LLM Chatbot in processing queries and generating outputs.

## Further Interaction and Custom Applications

You can continue interacting with the chatbot by asking follow-up prompts and observing the performance metrics displayed in the terminal.

This setup demonstrates how to build and configure applications using the Phi 3.5 model for text generation with both text and image inputs. It also showcases the optimized performance of running Phi models on Arm CPUs, emphasizing the significant performance gains achieved through this workflow.
153 changes: 153 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/onnx/chatbot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: Run the Chatbot Server
weight: 3

layout: learningpathall
---

## Script for ONNX Runtime based LLM Server
Now create a `phi3v.py` script using the following content. This script runs the Phi3.5 vision model with ONNX Runtime.

```python
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License
import argparse
import os
import glob
import time
from pathlib import Path
import onnxruntime_genai as og

def _find_dir_contains_sub_dir(current_dir: Path, target_dir_name):
curr_path = Path(current_dir).absolute()
target_dir = glob.glob(target_dir_name, root_dir=curr_path)
if target_dir:
return Path(curr_path / target_dir[0]).absolute()
else:
if curr_path.parent == curr_path:
# Root dir
return None
return _find_dir_contains_sub_dir(curr_path / '..', target_dir_name)

def _complete(text, state):
return (glob.glob(text + "*") + [None])[state]

def run(args: argparse.Namespace):
print("Loading model...")
config = og.Config(args.model_path)
config.clear_providers()
if args.execution_provider != "cpu":
print(f"Setting model to {args.execution_provider}...")
config.append_provider(args.execution_provider)
model = og.Model(config)
print("Model loaded")
processor = model.create_multimodal_processor()
tokenizer_stream = processor.create_stream()
interactive = not args.non_interactive
while True:
if interactive:
try:
import readline
readline.set_completer_delims(" \t\n;")
readline.parse_and_bind("tab: complete")
readline.set_completer(_complete)
except ImportError:
# Not available on some platforms. Ignore it.
pass
image_paths = [
image_path.strip()
for image_path in input(
"Image Path (comma separated; leave empty if no image): "
).split(",")
]
else:
if args.image_paths:
image_paths = args.image_paths
else:
image_paths = [str(_find_dir_contains_sub_dir(Path(__file__).parent, "test") / "test_models" / "images" / "australia.jpg")]
image_paths = [image_path for image_path in image_paths if image_path]
images = None
prompt = "<|user|>\n"
if len(image_paths) == 0:
print("No image provided")
else:
for i, image_path in enumerate(image_paths):
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image file not found: {image_path}")
print(f"Using image: {image_path}")
prompt += f"<|image_{i+1}|>\n"
images = og.Images.open(*image_paths)
if interactive:
text = input("Prompt: ")
else:
if args.prompt:
text = args.prompt
else:
text = "What is shown in this image?"
prompt += f"{text}<|end|>\n<|assistant|>\n"
print("Processing images and prompt...")
inputs = processor(prompt, images=images)
print("Generating response...")
start_time = time.time()
params = og.GeneratorParams(model)
params.set_inputs(inputs)
params.set_search_options(max_length=7680)
generator = og.Generator(model, params)
#start_time = time.time()
first_token_duration = None
token_count = 0
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
decoded_token = tokenizer_stream.decode(new_token)
token_count += 1
if token_count == 1:
ft_end = time.time()
first_token_duration = ft_end - start_time
print(decoded_token, end="", flush=True)
end_time = time.time()
total_run_time = end_time - start_time
tokens_per_sec = token_count / (end_time - ft_end)
print()
print(f"Total Time : {total_run_time:.4f} sec")
print(f"Time to First Token : {first_token_duration:.4f} sec")
print(f"Tokens per second : {tokens_per_sec:.2f} tokens/sec")
for _ in range(3):
print()
# Delete the generator to free the captured graph before creating another one
del generator
if not interactive:
break

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-m", "--model_path", type=str, required=True, help="Path to the folder containing the model"
)
parser.add_argument(
"-e", "--execution_provider", type=str, required=True, choices=["cpu", "cuda", "dml"], help="Execution provider to run model"
)
parser.add_argument(
"--image_paths", nargs='*', type=str, required=False, help="Path to the images, mainly for CI usage"
)
parser.add_argument(
'-pr', '--prompt', required=False, help='Input prompts to generate tokens from, mainly for CI usage'
)
parser.add_argument(
'--non-interactive', action=argparse.BooleanOptionalAction, required=False, help='Non-interactive mode, mainly for CI usage'
)
args = parser.parse_args()
run(args)
```

## Run the Server

You are now ready to run the server to enable chatbot.
Use the following command in a terminal to start the server:

```python
python3 phi3v.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu
```

You should see output similar to the image below when the server starts successfully:
![server](server.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
88 changes: 88 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/onnx/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
# User change
title: "Build ONNX Runtime and setup Phi-3.5 vision model"

weight: 2

# Do not modify these elements
layout: "learningpathall"
---

## Before You Begin

This Learning Path demonstrates how to run quantized Phi models on the Cobalt 100 servers using ONNX Runtime. Specifically, it focuses on deploying the Phi 3.5 vision model on Arm-based servers running Ubuntu 24.04 LTS. The instructions have been tested on a Azure Dpls_v6 - 32 cores instance.

## Overview

In this Learning Path, you will learn how to build and configure ONNX Runtime to enable efficient LLM inference on Arm CPUs.

The tutorial covers the following steps:
- Building ONNX Runtime, quantizing and converting the Phi 3.5 vision model to the ONNX format.
- Running the model using a Python script with ONNX Runtime to perform LLM inference on the CPU.
- Analyze the performance.

By the end of this Learning Path, you will have a complete workflow for deploying and running quantized vision models on Arm-based servers.

## Install dependencies

Install the following packages on your Arm based server instance:

```bash
sudo apt update
sudo apt install python3-pip python3-venv cmake -y
```

## Create a requirements file

```bash
vim requirements.txt
```

Add the following dependencies to your `requirements.txt` file:

```python
requests
torch
transformers
accelerate
huggingface-hub
pyreadline3
```

## Install Python Dependencies

Create a virtual environment:
```bash
python3 -m venv onnx-env
```

Activate the virtual environment:
```bash
source onnx-env/bin/activate
```

Install the required libraries using pip:
```bash
pip install -r requirements.txt
```
## Clone and build ONNX Runtime

Clone and build the `onnxruntime-genai` repository, which includes the Kleidi AI optimized ONNX Runtime, using the following commands:

```bash
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai/
python3 build.py --config Release
cd build/Linux/Release/wheel/
pip install onnxruntime_genai-0.8.0.dev0-cp312-cp312-linux_aarch64.whl
```

## Download and Quantize the Model

Navigate to the home directory, download and quantize model using `huggingface-cli`:
```bash
cd ~
huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
```

The Phi 3.5 vision model has now been successfully quantized into the ONNX format. The next step is to run the model using ONNX Runtime.