Merge pull request #2322 from amalaugustinejose/vision-llm-inference-on-android-with-kleidiai-and-mnn

pareenaverma · web-flow · commit f6eb493a9b6b · 2025-09-17T11:27:04.000-04:00
Update to ALP: Vision LLM inference on Android with KleidiAI and MNN
diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md
@@ -46,6 +46,13 @@ pip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)
 If Python 3.x is not the default version, try running `python3 --version` and `pip3 --version`.
 {{% /notice %}}
 
+It is recommended to use a python virtual environment:
+
+```bash
+python3.12 -m venv vision_llm
+source vision_llm/bin/activate
+```
+
 ## Set up Phone Connection
 
 You need to set up an authorized connection with your phone. The Android SDK Platform Tools package, included with Android Studio, provides Android Debug Bridge (ADB) for transferring files. 
@@ -72,7 +79,7 @@ The pre-quantized model is available in Hugging Face, you can download with the
 ```bash
 git lfs install
 git clone https://huggingface.co/taobao-mnn/Qwen2.5-VL-3B-Instruct-MNN
-git checkout 9057334b3f85a7f106826c2fa8e57c1aee727b53
+git checkout a4622194b3c518139e2cb8099e147e3d71975f7a
 ```
 
 ## (Optional) Download and Convert the Model
@@ -81,28 +88,48 @@ If you need to quantize the model with customized parameter, the following comma
 ```bash
 cd $HOME
 pip install -U huggingface_hub
-huggingface-cli download Qwen/Qwen2-VL-2B-Instruct --local-dir ./Qwen2-VL-2B-Instruct/
-git clone https://github.com/wangzhaode/llm-export
-cd llm-export && pip install .
+hf download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ./Qwen2.5-VL-3B-Instruct/
+pip install llmexport
 ```
-Use the `llm-export` repository to quantize the model with these options:
+Use `llmexport` to quantize the model with these options:
 
 ```bash
-llmexport --path ../Qwen2-VL-2B-Instruct/ --export mnn --quant_bit 4 \
-    --quant_block 0 --dst_path Qwen2-VL-2B-Instruct-convert-4bit-per_channel --sym
+llmexport --path ../Qwen2.5-VL-3B-Instruct/ --export mnn --quant_bit 4 \
+    --quant_block 64 --dst_path Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock
 ```
 
+{{% notice Note %}}
+If you run into issues where llmexport is not able to access utils, try the following
+```bash
+# From your project dir (inside the venv)
+cat > llmexport_fixed.py <<'PY'
+import sys, importlib
+# make "utils" resolve to "llmexport.utils"
+sys.modules.setdefault("utils", importlib.import_module("llmexport.utils"))
+
+from llmexport.__main__ import main
+if __name__ == "__main__":
+    main()
+PY
+
+# Use this instead of the entrypoint:
+python llmexport_fixed.py \
+  --path Qwen2.5-VL-3B-Instruct \
+  --export mnn --quant_bit 4 --quant_block 64 \
+  --dst_path Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock
+```
+{{% /notice %}}
+
 The table below gives you an explanation of the different arguments:
 
 | Parameter        | Description | Explanation |
 |------------------|-------------|--------------|
 | `--quant_bit` | MNN quant bit, 4 or 8, default is 4. | `4` represents q4 quantization. |
-| `--quant_block` | MNN quant block, default is 0. | `0` represents per-channel quantization; `128` represents 128 per-block quantization. |
-| `--sym` | Symmetric quantization (without zeropoint); default is False. | The quantization parameter that enables symmetrical quantization. |
+| `--quant_block` | MNN quant block, default is 0. | `0` represents per-channel quantization; `64` represents 64 per-block quantization. |
 
 To learn more about the parameters, see the [transformers README.md](https://github.com/alibaba/MNN/tree/master/transformers).
 
-Verify that the model was built correctly by checking that the `Qwen2-VL-2B-Instruct-convert-4bit-per_channel` directory is at least 1 GB in size.
+Verify that the model was built correctly by checking that the `Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock` directory is at least 2GB in size.
 
 ## Push the model to Android device
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md
@@ -29,7 +29,7 @@ Run the following commands to clone the MNN repository and checkout the source t
 cd $HOME
 git clone https://github.com/alibaba/MNN.git
 cd MNN
-git checkout 282cebeb785118865b9c903decc4b5cd98d5025e
+git checkout a739ea5870a4a45680f0e36ba9662ca39f2f4eec
 ```
 
 Create a build directory and run the build script. 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md
@@ -12,7 +12,7 @@ MNN is a high-performance, lightweight deep learning framework designed for both
 
 **MNN-LLM** is a large language model (LLM) runtime solution built on the MNN engine. It enables local deployment of LLMs across diverse platforms, including mobile devices, PCs, and IoT systems, and supports leading models such as Qianwen, Baichuan, Zhipu, and Llama for efficient, accessible AI-powered experiences.
 
-KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework to enhance the inference performance of LLMs. In this Learning Path, the Android app demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen Vision 2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) model.
+KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework to enhance the inference performance of LLMs. In this Learning Path, the Android app demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen2.5 Vision 3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) model.
 
 ## Vision Transformer (ViT)
 The Vision Transformer (ViT) is a deep learning model designed for image recognition tasks. Unlike traditional convolutional neural networks (CNNs) that use convolutional layers, ViT leverages the transformer architecture originally developed for natural language processing (NLP).