Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:
- Qwen3: 0.6B, 1.7B, 4B...
- Qwen3-VL: 2B, 4B...
- Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
- Qwen2.5-VL: 3B
- DeepSeek-R1-Distill-Qwen: 1.5B
- MiniCPM-DPO/SFT: 1B, 2.7B
- Gemma-3-it: 1B, 4B...
- Phi-4-mini-Instruct: 3.8B
- Llama-3.2-Instruct: 1B
- InternVL-Mono: 2B
- InternLM-3: 8B
- Seed-X: PRO-7B, Instruct-7B
- HunYuan: MT-1.5-1.8B/7B
- 2026/01/04:Update HunYuan-MT-1.5
- 2025/11/11:Update Qwen3-VL.
- 2025/09/07:Update HunYuan-MT.
- 2025/08/02:Update Seed-X.
- 2025/04/29:Update Qwen3.
- 2025/04/05:Update Qwen2.5, InternVL-Mono
q4f32+dynamic_axes. - 2025/02/22:Support loading with low memory mode:
Qwen,QwenVL,MiniCPM_2B_single; Setlow_memory_mode = trueinMainActivity.java. - 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (Please using
Qwen v2.5 Qwen_Export.py)
-
Download Models:
- Quick Try: Qwen3-1.7B-Android
-
Setup Instructions:
- Place the downloaded model files into the
assetsfolder. - Decompress the
*.sofiles stored in thelibs/arm64-v8afolder.
- Place the downloaded model files into the
-
Model Notes:
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
- For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters.
GLRender.java: Line 37, 38, 39project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
-
ONNX Export Considerations:
- It is recommended to use dynamic axes and q4f32 quantization.
- The
tokenizer.cppandtokenizer.hppfiles are sourced from the mnn-llm repository.
- Navigate to the
Export_ONNXfolder. - Follow the comments in the Python scripts to set the folder paths.
- Execute the
***_Export.pyscript to export the model. - Quantize or optimize the ONNX model manually.
- Use
onnxruntime.tools.convert_onnx_models_to_ortto convert models to*.ortformat. Note that this process automatically addsCastoperators that change FP16 multiplication to FP32. - The quantization methods are detailed in the
Do_Quantizefolder.
- Explore more projects: DakeQQ Projects
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Qwen-2-1.5B-Instruct q8f32 |
20 token/s |
| Android 15 | Vivo x200 Pro | MediaTek_9400-CPU | Qwen-3-1.7B-Instruct q4f32 dynamic |
37 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen-3-1.7B-Instruct q4f32 dynamic |
18.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen-2.5-1.5B-Instruct q4f32 dynamic |
20.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen-2-1.5B-Instruct q8f32 |
13 token/s |
| Harmony 3 | 荣耀 20S | Kirin_810-CPU | Qwen-2-1.5B-Instruct q8f32 |
7 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | QwenVL-2-2B q8f32 |
15 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | QwenVL-2-2B q8f32 |
9 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | QwenVL-2.5-3B q4f32 dynamic |
9 token/s |
惑
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Distill-Qwen-1.5B q4f32 dynamic |
34.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B q4f32 dynamic |
20.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B q8f32 |
13 token/s |
| HyperOS 2 | Xiaomi-14T-Pro | MediaTek_9300+-CPU | Distill-Qwen-1.5B q8f32 |
22 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 15 | Nubia Z50 | 8_Gen2-CPU | MiniCPM4-0.5B q4f32 |
78 token/s |
| Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-2.7B q8f32 |
9.5 token/s |
| Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-1.3B q8f32 |
16.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-2.7B q8f32 |
6 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-1.3B q8f32 |
11 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Gemma-1.1-it-2B q8f32 |
16 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Phi-2-2B-Orange-V2 q8f32 |
9.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Phi-2-2B-Orange-V2 q8f32 |
5.8 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Llama-3.2-1B-Instruct q8f32 |
25 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Llama-3.2-1B-Instruct q8f32 |
16 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Harmony 4 | P40 | Kirin_990_5G-CPU | Mono-2B-S1-3 q4f32 dynamic |
10.5 token/s |
| OS | Device | Backend | Model | Inference (1024 Context) |
|---|---|---|---|---|
| Android 15 | Nubia Z50 | 8_Gen2-CPU | MiniCPM4-0.5B q4f32 |
78 token/s |
展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括:
- Qwen3: 0.6B, 1.7B, 4B...
- Qwen3-VL: 2B, 4B...
- Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
- Qwen2.5-VL: 3B
- DeepSeek-R1-Distill-Qwen: 1.5B
- MiniCPM-DPO/SFT: 1B, 2.7B
- Gemma-3-it: 1B, 4B...
- Phi-4-mini-Instruct: 3.8B
- Llama-3.2-Instruct: 1B
- InternVL-Mono: 2B
- InternLM-3: 8B
- Seed-X: PRO-7B, Instruct-7B
- HunYuan: MT-1.5-1.8B/7B
- 2026/01/04:Update HunYuan-MT-1.5
- 2025/11/11:更新 Qwen3-VL。
- 2025/09/07:更新 HunYuan-MT。
- 2025/08/02:更新 Seed-X。
- 2025/04/29:更新 Qwen3。
- 2025/04/05: 更新 Qwen2.5, InternVL-Mono
q4f32+dynamic_axes。 - 2025/02/22:支持低内存模式加载:
Qwen,QwenVL,MiniCPM_2B_single; Setlow_memory_mode = trueinMainActivity.java. - 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (请使用
Qwen v2.5 Qwen_Export.py)。
-
下载模型:
- Quick Try: Qwen3-1.7B-Android
-
设置说明:
- 将下载的模型文件放入
assets文件夹。 - 解压存储在
libs/arm64-v8a文件夹中的*.so文件。
- 将下载的模型文件放入
-
模型说明:
- 演示模型是从 HuggingFace 或 ModelScope 转换而来,并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
- 对于Qwen2VL / Qwen2.5VL,请调整关键变量以匹配模型参数。
GLRender.java: Line 37, 38, 39project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
-
ONNX 导出注意事项:
- 推荐使用动态轴以及
q4f32量化。
- 推荐使用动态轴以及
tokenizer.cpp和tokenizer.hpp文件来源于 mnn-llm 仓库。
- 进入
Export_ONNX文件夹。 - 按照 Python 脚本中的注释设置文件夹路径。
- 执行
***_Export.py脚本以导出模型。 - 手动量化或优化 ONNX 模型。
- 使用
onnxruntime.tools.convert_onnx_models_to_ort将模型转换为*.ort格式。注意该过程会自动添加Cast操作符,将 FP16 乘法改为 FP32。 - 量化方法详见
Do_Quantize文件夹。
- 探索更多项目:DakeQQ Projects

