ArmDeveloperEcosystem · pareenaverma · Apr 30, 2025 · Apr 29, 2025 · Apr 30, 2025
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/_index.md b/content/learning-paths/servers-and-cloud-computing/onnx/_index.md
@@ -0,0 +1,54 @@
+---
+title: Running Phi-3.5 Vision Model with ONNX Runtime on Cobalt 100
+
+minutes_to_complete: 30
+
+who_is_this_for:
+    - Software developers, ML engineers, and cloud practitioners looking to deploy Microsoft Phi Models on Arm-based servers using ONNX Runtime.
+
+learning_objectives:
+    - Install ONNX Runtime, download and quantize the Phi-3.5 vision model.
+    - Run the Phi-3.5 model with ONNX Runtime on Azure.
+    - Analyze performance on Neoverse N2-based Cobalt 100 servers.
+
+prerequisites:
+    - Access to an Azure Cobalt 100 (or other Arm-based) compute instance with at least 16 cores, 8GB of RAM, and 32GB of disk space.
+    - Basic understanding of Python and machine learning concepts.
+    - Familiarity with ONNX Runtime and Azure cloud services.
+    - Knowledge of LLM (Large Language Model) fundamentals.
+
+
+author: Nobel Chowdary Mandepudi
+
+### Tags
+skilllevels: Advanced
+armips:
+    - Neoverse
+subjects: Machine Learning
+operatingsystems:
+    - Linux
+tools_software_languages:
+    - Python
+    - ONNX Runtime
+    - Microsoft Azure
+
+further_reading:
+    - resource:
+        title: Getting Started with Llama
+        link: https://llama.meta.com/get-started
+        type: documentation
+    - resource:
+        title: Hugging Face Documentation
+        link: https://huggingface.co/docs
+        type: documentation
+    - resource:
+        title: Democratizing Generative AI with CPU-Based Inference
+        link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
+        type: blog
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has a weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths use this wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/onnx/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/analysis.md b/content/learning-paths/servers-and-cloud-computing/onnx/analysis.md
@@ -0,0 +1,29 @@
+---
+title: Phi 3.5 Chatbot Performance Analysis
+weight: 4
+
+layout: learningpathall
+---
+
+## Input a Prompt
+
+To begin, skip the image prompt and input the text prompt as shown in the example below:
+![output](output.png)
+
+Next, download a sample image from the internet using the following `wget` command in the main directory:
+```bash
+wget https://cdn.pixabay.com/photo/2020/06/30/22/34/dog-5357794__340.jpg
+```
+
+After downloading the image, input the image prompt along with the image name, and enter the text prompt as demonstrated in the example below:
+![image_output](image_output.png)
+
+## Observe Performance Metrics
+
+As shown in the example above, the LLM Chatbot performs inference at a speed of **44 tokens/second**, with the time to first token being approximately **1 second**. This highlights the efficiency and responsiveness of the LLM Chatbot in processing queries and generating outputs.
+
+## Further Interaction and Custom Applications
+
+You can continue interacting with the chatbot by asking follow-up prompts and observing the performance metrics displayed in the terminal.
+
+This setup demonstrates how to build and configure applications using the Phi 3.5 model for text generation with both text and image inputs. It also showcases the optimized performance of running Phi models on Arm CPUs, emphasizing the significant performance gains achieved through this workflow.
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/chatbot.md b/content/learning-paths/servers-and-cloud-computing/onnx/chatbot.md
@@ -0,0 +1,153 @@
+---
+title: Run the Chatbot Server
+weight: 3
+
+layout: learningpathall
+---
+
+## Script for ONNX Runtime based LLM Server
+Now create a `phi3v.py` script using the following content. This script runs the Phi3.5 vision model with ONNX Runtime.
+
+```python
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License
+import argparse
+import os
+import glob
+import time
+from pathlib import Path
+import onnxruntime_genai as og
+
+def _find_dir_contains_sub_dir(current_dir: Path, target_dir_name):
+    curr_path = Path(current_dir).absolute()
+    target_dir = glob.glob(target_dir_name, root_dir=curr_path)
+    if target_dir:
+        return Path(curr_path / target_dir[0]).absolute()
+    else:
+        if curr_path.parent == curr_path:
+            # Root dir
+            return None
+        return _find_dir_contains_sub_dir(curr_path / '..', target_dir_name)
+
+def _complete(text, state):
+    return (glob.glob(text + "*") + [None])[state]
+
+def run(args: argparse.Namespace):
+    print("Loading model...")
+    config = og.Config(args.model_path)
+    config.clear_providers()
+    if args.execution_provider != "cpu":
+        print(f"Setting model to {args.execution_provider}...")
+        config.append_provider(args.execution_provider)
+    model = og.Model(config)
+    print("Model loaded")
+    processor = model.create_multimodal_processor()
+    tokenizer_stream = processor.create_stream()
+    interactive = not args.non_interactive
+    while True:
+        if interactive:
+            try:
+                import readline
+                readline.set_completer_delims(" \t\n;")
+                readline.parse_and_bind("tab: complete")
+                readline.set_completer(_complete)
+            except ImportError:
+                # Not available on some platforms. Ignore it.
+                pass
+            image_paths = [
+                image_path.strip()
+                for image_path in input(
+                    "Image Path (comma separated; leave empty if no image): "
+                ).split(",")
+            ]
+        else:
+            if args.image_paths:
+                image_paths = args.image_paths
+            else:
+                image_paths = [str(_find_dir_contains_sub_dir(Path(__file__).parent, "test") / "test_models" / "images" / "australia.jpg")]
+        image_paths = [image_path for image_path in image_paths if image_path]
+        images = None
+        prompt = "<|user|>\n"
+        if len(image_paths) == 0:
+            print("No image provided")
+        else:
+            for i, image_path in enumerate(image_paths):
+                if not os.path.exists(image_path):
+                    raise FileNotFoundError(f"Image file not found: {image_path}")
+                print(f"Using image: {image_path}")
+                prompt += f"<|image_{i+1}|>\n"
+            images = og.Images.open(*image_paths)
+        if interactive:
+            text = input("Prompt: ")
+        else:
+            if args.prompt:
+                text = args.prompt
+            else:
+                text = "What is shown in this image?"
+        prompt += f"{text}<|end|>\n<|assistant|>\n"
+        print("Processing images and prompt...")
+        inputs = processor(prompt, images=images)
+        print("Generating response...")
+        start_time = time.time()
+        params = og.GeneratorParams(model)
+        params.set_inputs(inputs)
+        params.set_search_options(max_length=7680)
+        generator = og.Generator(model, params)
+        #start_time = time.time()
+        first_token_duration = None
+        token_count = 0
+        while not generator.is_done():
+            generator.generate_next_token()
+            new_token = generator.get_next_tokens()[0]
+            decoded_token = tokenizer_stream.decode(new_token)
+            token_count += 1
+            if token_count == 1:
+                ft_end = time.time()
+                first_token_duration = ft_end - start_time
+            print(decoded_token, end="", flush=True)
+        end_time = time.time()
+        total_run_time = end_time - start_time
+        tokens_per_sec = token_count / (end_time - ft_end)
+        print()
+        print(f"Total Time           : {total_run_time:.4f} sec")
+        print(f"Time to First Token  : {first_token_duration:.4f} sec")
+        print(f"Tokens per second    : {tokens_per_sec:.2f} tokens/sec")
+        for _ in range(3):
+            print()
+        # Delete the generator to free the captured graph before creating another one
+        del generator
+        if not interactive:
+            break
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m", "--model_path", type=str, required=True, help="Path to the folder containing the model"
+    )
+    parser.add_argument(
+        "-e", "--execution_provider", type=str, required=True, choices=["cpu", "cuda", "dml"], help="Execution provider to run model"
+    )
+    parser.add_argument(
+        "--image_paths", nargs='*', type=str, required=False, help="Path to the images, mainly for CI usage"
+    )
+    parser.add_argument(
+        '-pr', '--prompt', required=False, help='Input prompts to generate tokens from, mainly for CI usage'
+    )
+    parser.add_argument(
+        '--non-interactive', action=argparse.BooleanOptionalAction, required=False, help='Non-interactive mode, mainly for CI usage'
+    )
+    args = parser.parse_args()
+    run(args)
+```
+
+## Run the Server
+
+You are now ready to run the server to enable chatbot.
+Use the following command in a terminal to start the server:
+
+```python
+python3 phi3v.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu
+```
+
+You should see output similar to the image below when the server starts successfully:
+![server](server.png)
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/image_output.png b/content/learning-paths/servers-and-cloud-computing/onnx/image_output.png
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/output.png b/content/learning-paths/servers-and-cloud-computing/onnx/output.png
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/server.png b/content/learning-paths/servers-and-cloud-computing/onnx/server.png
diff --git a/content/learning-paths/servers-and-cloud-computing/onnx/setup.md b/content/learning-paths/servers-and-cloud-computing/onnx/setup.md
@@ -0,0 +1,88 @@
+---
+# User change
+title: "Build ONNX Runtime and setup Phi-3.5 vision model"
+
+weight: 2
+
+# Do not modify these elements
+layout: "learningpathall"
+---
+
+## Before You Begin
+
+This Learning Path demonstrates how to run quantized Phi models on the Cobalt 100 servers using ONNX Runtime. Specifically, it focuses on deploying the Phi 3.5 vision model on Arm-based servers running Ubuntu 24.04 LTS. The instructions have been tested on a Azure Dpls_v6 - 32 cores instance.
+
+## Overview
+
+In this Learning Path, you will learn how to build and configure ONNX Runtime to enable efficient LLM inference on Arm CPUs.
+
+The tutorial covers the following steps:
+- Building ONNX Runtime, quantizing and converting the Phi 3.5 vision model to the ONNX format.
+- Running the model using a Python script with ONNX Runtime to perform LLM inference on the CPU.
+- Analyze the performance.
+
+By the end of this Learning Path, you will have a complete workflow for deploying and running quantized vision models on Arm-based servers.
+
+## Install dependencies
+
+Install the following packages on your Arm based server instance:
+
+```bash
+    sudo apt update
+    sudo apt install python3-pip python3-venv cmake -y
+```
+
+## Create a requirements file
+
+```bash
+    vim requirements.txt
+```
+
+Add the following dependencies to your `requirements.txt` file:
+
+```python
+    requests
+    torch
+    transformers
+    accelerate
+    huggingface-hub
+    pyreadline3
+```
+
+## Install Python Dependencies
+
+Create a virtual environment:
+```bash
+    python3 -m venv onnx-env
+```
+
+Activate the virtual environment:
+```bash
+    source onnx-env/bin/activate
+```
+
+Install the required libraries using pip:
+```bash
+    pip install -r requirements.txt
+```
+## Clone and build ONNX Runtime
+
+Clone and build the `onnxruntime-genai` repository, which includes the Kleidi AI optimized ONNX Runtime, using the following commands:
+
+```bash
+    git clone https://github.com/microsoft/onnxruntime-genai.git
+    cd onnxruntime-genai/
+    python3 build.py --config Release
+    cd build/Linux/Release/wheel/
+    pip install onnxruntime_genai-0.8.0.dev0-cp312-cp312-linux_aarch64.whl
+```
+
+## Download and Quantize the Model
+
+Navigate to the home directory, download and quantize model using `huggingface-cli`:
+```bash
+    cd ~
+    huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
+```
+
+The Phi 3.5 vision model has now been successfully quantized into the ONNX format. The next step is to run the model using ONNX Runtime.