Update Vision LLM LP

annietllnd · annietllnd · commit e1b10fe4fc70 · 2025-03-20T09:39:24.000+01:00
- Update performance numbers
- Remove "Known issues" section since it's fixed upstream
diff --git a/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/1-devenv-and-model.md b/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/1-devenv-and-model.md
@@ -78,7 +78,7 @@ The table below gives you an explanation of the different arguments:
 
 To learn more about the parameters, refer to the [transformers README.md](https://github.com/alibaba/MNN/tree/master/transformers).
 
-Verify the model is built correct by checking the size of the resulting model. The `Qwen2-VL-2B-Instruct-convert-4bit-per_channel` directory should be atleast 1 GB in size.
+Verify the model is built correct by checking the size of the resulting model. The `Qwen2-VL-2B-Instruct-convert-4bit-per_channel` directory should be at least 1 GB in size.
 
 Push the model onto the device:
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/2-generate-apk.md b/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/2-generate-apk.md
@@ -16,7 +16,7 @@ A fork of the upstream MNN repository is set up to enable building the app as an
 cd $HOME
 git clone https://github.com/HenryDen/MNN.git
 cd MNN
-git checkout origin/MNN_commit
+git checkout origin/llm_android_demo
 ```
 
 ## Build the app using Android Studio
@@ -30,33 +30,6 @@ This will trigger a build of the project, and you should see a similar output on
 ```output
 BUILD SUCCESSFUL in 1m 42s
 ```
-#### Known build issues
-
-Depending on your Android Studio environment, you may encounter dependency incompatibility with the MNN project. If the build is not successful, you can walk through the following steps to address two known build issues.
-
-1. Add Gradle namespace
-
-For some Gradle versions, you are required to add a `namespace` to your `build.gradle` file.
-
-
-![Gradle Build menu](gradle_build.png)
-
-From the Android menu, open the highlighted file in the above image and add the following to the `android` field.
-
-```output
-namespace "com.mnn.llm"
-```
-
-2. Align dependencies version
-
-You may see an error in dependencies not having aligned version. Open `app/build.gradle` update the `androidTestImplementation` version:
-
-```output
-dependencies {
-    androidTestImplementation 'androidx.test.espresso:espresso-core:3.5.1'
-    androidTestImplementation 'androidx.test.espresso:espresso-idling-resource:3.5.1'
-}
-```
 
 ### Generate and run the APK
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/3-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/3-benchmark.md
@@ -77,12 +77,12 @@ The image features a tiger standing in a grassy field, with its front paws raise
 #################################
 prompt tokens num = 243
 decode tokens num = 70
- vision time = 5.96 s
+ vision time = 5.76 s
   audio time = 0.00 s
-prefill time = 1.80 s
- decode time = 2.09 s
-prefill speed = 135.29 tok/s
- decode speed = 33.53 tok/s
+prefill time = 1.26 s
+ decode time = 2.02 s
+prefill speed = 192.28 tok/s
+ decode speed = 34.73 tok/s
 ##################################
 ```
 
@@ -113,13 +113,27 @@ export LD_LIBRARY_PATH=$PWD
 ./llm_demo models/Qwen-VL-2B-convert-4bit-per_channel/config.json prompt
 ```
 
+The same output should be displayed, with the benchmark printed at the end:
+```output
+#################################
+prompt tokens num = 243
+decode tokens num = 70
+ vision time = 2.91 s
+  audio time = 0.00 s
+prefill time = 0.91 s
+ decode time = 1.56 s
+prefill speed = 266.13 tok/s
+ decode speed = 44.96 tok/s
+##################################
+```
+
 This time, you should see an improvement in the benchmark. Below is an example table showing the uplift on three relevant metrics after enabling the KleidiAI kernels.
 
 | Benchmark           | Without KleidiAI | With KleidiAI |
 |---------------------|------------------|---------------|
-| Vision Process Time | 5.45s            | 5.43 s        |
-| Prefill Speed       | 132.35 tok/s     | 148.30 tok/s  |
-| Decode Speed        | 21.61 tok/s      | 33.26 tok/s   |
+| Vision Process Time | 5.76 s           | 2.91 s        |
+| Prefill Speed       | 192.28 tok/s     | 266.13 tok/s  |
+| Decode Speed        | 34.73 tok/s      | 44.96 tok/s   |
 
 The prefill speed describes how fast the model processes the input prompt. The decode speed corresponds to the rate at which the model generates new tokens after the input is processed
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/background.md b/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/background.md
@@ -12,7 +12,7 @@ MNN is a high-performance, lightweight deep learning framework designed for both
 
 **MNN-LLM** is a large language model (LLM) runtime solution built on the MNN engine, designed to enable local deployment of LLMs across diverse platforms, including mobile devices, PCs, and IoT systems. It supports leading models such as Qianwen, Baichuan, Zhipu, and Llama, ensuring efficient and accessible AI-powered experiences.
 
-KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework, enhancing the inference performance of large language models (LLMs) within MNN. The Android app in this learning path demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen Vision 2B]([https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)) model.
+KleidiAI, a collection of optimized AI micro-kernels, is integrated into the MNN framework, enhancing the inference performance of large language models (LLMs) within MNN. The Android app in this learning path demonstrates Vision Transformer inference using the MNN framework. You will use KleidiAI to speed up inference for the [Qwen Vision 2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) model.
 
 ## Vision Transformer（ViT）
 The ViT is a deep learning model designed for image recognition tasks. Unlike traditional convolutional neural networks (CNNs), which process images using convolutional layers, ViT leverages the transformer architecture originally developed for natural language processing (NLP).
diff --git a/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/gradle_build.png b/content/learning-paths/mobile-graphics-and-gaming/Vision-LLM-inference-on-Android-with-KleidiAI-and-MNN/gradle_build.png