Update to ALP: Vision LLM inference on Android with KleidiAI and MNN

Amal Augustine Jose · Amal Augustine Jose · commit 7549347ada57 · 2025-09-17T13:48:09.000+01:00
1. Fixed inconsistency of using both Qwen2.5-VL-3B and Qwen2-VL-2B, which could confuse the user.
  2. Fixed the broken optional section on how to quantize a model locally. Two original issues:
    a. It used Qwen2-VL-2B while the pre-quantized model was Qwen2.5-VL-3B.
    b. Even with the wrong model, the steps resulted in a model that produced gibberish.
    c. The following changes were made to get local quantization working:
      i. Switched from symmetric quantization → asymmetric quantization.
      ii. Switched from per-channel quantization → block quantization (block size = 64).
    d. Updated deprecated huggingface-cli download -&gt; hf download.
  3. Updated to use llmexport from PyPI rather than the Git repo (which has since been updated).
  4. Added a recommendation to use a Python virtual environment.
  5. Added a wrapper recommendation as a workaround for an issue with llmexport.
diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/1-devenv-and-model.md
@@ -46,6 +46,12 @@ pip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)
 If Python 3.x is not the default version, try running `python3 --version` and `pip3 --version`.
 {{% /notice %}}
 
+It's reccommended to do the changes in a python virtual environment
+```output
+python3.12 -m venv vision_llm
+source vision_llm/bin/activate
+```
+
 ## Set up Phone Connection
 
 You need to set up an authorized connection with your phone. The Android SDK Platform Tools package, included with Android Studio, provides Android Debug Bridge (ADB) for transferring files. 
@@ -72,7 +78,7 @@ The pre-quantized model is available in Hugging Face, you can download with the
 ```bash
 git lfs install
 git clone https://huggingface.co/taobao-mnn/Qwen2.5-VL-3B-Instruct-MNN
-git checkout 9057334b3f85a7f106826c2fa8e57c1aee727b53
+git checkout a4622194b3c518139e2cb8099e147e3d71975f7a
 ```
 
 ## (Optional) Download and Convert the Model
@@ -81,28 +87,48 @@ If you need to quantize the model with customized parameter, the following comma
 ```bash
 cd $HOME
 pip install -U huggingface_hub
-huggingface-cli download Qwen/Qwen2-VL-2B-Instruct --local-dir ./Qwen2-VL-2B-Instruct/
-git clone https://github.com/wangzhaode/llm-export
-cd llm-export && pip install .
+hf download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ./Qwen2.5-VL-3B-Instruct/
+pip install llmexport
+```
+Use the `llmexport` to quantize the model with these options:
+
+```bash
+llmexport --path ../Qwen2.5-VL-3B-Instruct/ --export mnn --quant_bit 4 \
+    --quant_block 64 --dst_path Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock
 ```
-Use the `llm-export` repository to quantize the model with these options:
 
+{{% notice Note %}}
+if you run into issues where llmexport is not able to access utils, try the following
 ```bash
-llmexport --path ../Qwen2-VL-2B-Instruct/ --export mnn --quant_bit 4 \
-    --quant_block 0 --dst_path Qwen2-VL-2B-Instruct-convert-4bit-per_channel --sym
+# From your project dir (inside the venv)
+cat > llmexport_fixed.py <<'PY'
+import sys, importlib
+# make "utils" resolve to "llmexport.utils"
+sys.modules.setdefault("utils", importlib.import_module("llmexport.utils"))
+
+from llmexport.__main__ import main
+if __name__ == "__main__":
+    main()
+PY
+
+# Use this instead of the entrypoint:
+python llmexport_fixed.py \
+  --path Qwen2.5-VL-3B-Instruct \
+  --export mnn --quant_bit 4 --quant_block 64 \
+  --dst_path Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock
 ```
+{{% /notice %}}
 
 The table below gives you an explanation of the different arguments:
 
 | Parameter        | Description | Explanation |
 |------------------|-------------|--------------|
 | `--quant_bit` | MNN quant bit, 4 or 8, default is 4. | `4` represents q4 quantization. |
-| `--quant_block` | MNN quant block, default is 0. | `0` represents per-channel quantization; `128` represents 128 per-block quantization. |
-| `--sym` | Symmetric quantization (without zeropoint); default is False. | The quantization parameter that enables symmetrical quantization. |
+| `--quant_block` | MNN quant block, default is 0. | `0` represents per-channel quantization; `64` represents 64 per-block quantization. |
 
 To learn more about the parameters, see the [transformers README.md](https://github.com/alibaba/MNN/tree/master/transformers).
 
-Verify that the model was built correctly by checking that the `Qwen2-VL-2B-Instruct-convert-4bit-per_channel` directory is at least 1 GB in size.
+Verify that the model was built correctly by checking that the `Qwen2.5-VL-3B-Instruct-convert-4bit-64qblock` directory is at least 2GB in size.
 
 ## Push the model to Android device
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/2-benchmark.md
@@ -29,7 +29,7 @@ Run the following commands to clone the MNN repository and checkout the source t
 cd $HOME
 git clone https://github.com/alibaba/MNN.git
 cd MNN
-git checkout 282cebeb785118865b9c903decc4b5cd98d5025e
+git checkout a739ea5870a4a45680f0e36ba9662ca39f2f4eec
 ```
 
 Create a build directory and run the build script. 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md b/content/learning-paths/mobile-graphics-and-gaming/vision-llm-inference-on-android-with-kleidiai-and-mnn/background.md
@@ -12,7 +12,7 @@ MNN is a high-performance, lightweight deep learning framework designed for both
 
 **MNN-LLM** is a large language model (LLM) runtime solution built on the MNN engine. It enables local deployment of LLMs across diverse platforms, including mobile devices, PCs, and IoT systems, and supports leading models such as Qianwen, Baichuan, Zhipu, and Llama for efficient, accessible AI-powered experiences.
 
-KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework to enhance the inference performance of LLMs. In this Learning Path, the Android app demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen Vision 2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) model.
+KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework to enhance the inference performance of LLMs. In this Learning Path, the Android app demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen2.5 Vision 3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) model.
 
 ## Vision Transformer (ViT)
 The Vision Transformer (ViT) is a deep learning model designed for image recognition tasks. Unlike traditional convolutional neural networks (CNNs) that use convolutional layers, ViT leverages the transformer architecture originally developed for natural language processing (NLP).