Native-LLM-for-Android

Overview

Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:

Qwen3: 0.6B, 1.7B, 4B...
Qwen3-VL: 2B, 4B...
Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
Qwen2.5-VL: 3B
DeepSeek-R1-Distill-Qwen: 1.5B
MiniCPM-DPO/SFT: 1B, 2.7B
Gemma-3-it: 1B, 4B...
Phi-4-mini-Instruct: 3.8B
Llama-3.2-Instruct: 1B
InternVL-Mono: 2B
InternLM-3: 8B
Seed-X: PRO-7B, Instruct-7B
HunYuan: MT-1.5-1.8B/7B

Update

2026/01/04：Update HunYuan-MT-1.5
2025/11/11：Update Qwen3-VL.
2025/09/07：Update HunYuan-MT.
2025/08/02：Update Seed-X.
2025/04/29：Update Qwen3.
2025/04/05：Update Qwen2.5, InternVL-Mono q4f32 + dynamic_axes.
2025/02/22：Support loading with low memory mode: Qwen, QwenVL, MiniCPM_2B_single; Set low_memory_mode = true in MainActivity.java.
2025/02/07：DeepSeek-R1-Distill-Qwen: 1.5B (Please using Qwen v2.5 Qwen_Export.py)

Getting Started

Download Models:
- Quick Try: Qwen3-1.7B-Android
Setup Instructions:
- Place the downloaded model files into the assets folder.
- Decompress the *.so files stored in the libs/arm64-v8a folder.
Model Notes:
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
- For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters.
  - GLRender.java: Line 37, 38, 39
  - project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
ONNX Export Considerations:
- It is recommended to use dynamic axes and q4f32 quantization.

Tokenizer Files

The tokenizer.cpp and tokenizer.hpp files are sourced from the mnn-llm repository.

Exporting Models

Navigate to the Export_ONNX folder.
Follow the comments in the Python scripts to set the folder paths.
Execute the ***_Export.py script to export the model.
Quantize or optimize the ONNX model manually.

Quantization Notes

Use onnxruntime.tools.convert_onnx_models_to_ort to convert models to *.ort format. Note that this process automatically adds Cast operators that change FP16 multiplication to FP32.
The quantization methods are detailed in the Do_Quantize folder.

Additional Resources

Explore more projects: DakeQQ Projects

Performance Metrics

Qwen

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Qwen-2-1.5B-Instruct q8f32	20 token/s
Android 15	Vivo x200 Pro	MediaTek_9400-CPU	Qwen-3-1.7B-Instruct q4f32 dynamic	37 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Qwen-3-1.7B-Instruct q4f32 dynamic	18.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Qwen-2.5-1.5B-Instruct q4f32 dynamic	20.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Qwen-2-1.5B-Instruct q8f32	13 token/s
Harmony 3	荣耀 20S	Kirin_810-CPU	Qwen-2-1.5B-Instruct q8f32	7 token/s

QwenVL

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	QwenVL-2-2B q8f32	15 token/s
Harmony 4	P40	Kirin_990_5G-CPU	QwenVL-2-2B q8f32	9 token/s
Harmony 4	P40	Kirin_990_5G-CPU	QwenVL-2.5-3B q4f32 dynamic	9 token/s

惑

DeepSeek-R1

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Distill-Qwen-1.5B q4f32 dynamic	34.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Distill-Qwen-1.5B q4f32 dynamic	20.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Distill-Qwen-1.5B q8f32	13 token/s
HyperOS 2	Xiaomi-14T-Pro	MediaTek_9300+-CPU	Distill-Qwen-1.5B q8f32	22 token/s

MiniCPM

OS	Device	Backend	Model	Inference (1024 Context)
Android 15	Nubia Z50	8_Gen2-CPU	MiniCPM4-0.5B q4f32	78 token/s
Android 13	Nubia Z50	8_Gen2-CPU	MiniCPM-2.7B q8f32	9.5 token/s
Android 13	Nubia Z50	8_Gen2-CPU	MiniCPM-1.3B q8f32	16.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	MiniCPM-2.7B q8f32	6 token/s
Harmony 4	P40	Kirin_990_5G-CPU	MiniCPM-1.3B q8f32	11 token/s

Gemma

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Gemma-1.1-it-2B q8f32	16 token/s

Phi

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Phi-2-2B-Orange-V2 q8f32	9.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Phi-2-2B-Orange-V2 q8f32	5.8 token/s

Llama

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Llama-3.2-1B-Instruct q8f32	25 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Llama-3.2-1B-Instruct q8f32	16 token/s

InternVL

OS	Device	Backend	Model	Inference (1024 Context)
Harmony 4	P40	Kirin_990_5G-CPU	Mono-2B-S1-3 q4f32 dynamic	10.5 token/s

MiniCPM

OS	Device	Backend	Model	Inference (1024 Context)
Android 15	Nubia Z50	8_Gen2-CPU	MiniCPM4-0.5B q4f32	78 token/s

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括：

Qwen3: 0.6B, 1.7B, 4B...
Qwen3-VL: 2B, 4B...
Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
Qwen2.5-VL: 3B
DeepSeek-R1-Distill-Qwen: 1.5B
MiniCPM-DPO/SFT: 1B, 2.7B
Gemma-3-it: 1B, 4B...
Phi-4-mini-Instruct: 3.8B
Llama-3.2-Instruct: 1B
InternVL-Mono: 2B
InternLM-3: 8B
Seed-X: PRO-7B, Instruct-7B
HunYuan: MT-1.5-1.8B/7B

入门指南

下载模型：
- Quick Try: Qwen3-1.7B-Android
设置说明：
- 将下载的模型文件放入 assets 文件夹。
- 解压存储在 libs/arm64-v8a 文件夹中的 *.so 文件。
模型说明：
- 演示模型是从 HuggingFace 或 ModelScope 转换而来，并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
- 对于Qwen2VL / Qwen2.5VL，请调整关键变量以匹配模型参数。
  - GLRender.java: Line 37, 38, 39
  - project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
ONNX 导出注意事项：
- 推荐使用动态轴以及q4f32量化。

分词器文件

tokenizer.cpp 和 tokenizer.hpp 文件来源于 mnn-llm 仓库。

导出模型

进入 Export_ONNX 文件夹。
按照 Python 脚本中的注释设置文件夹路径。
执行 ***_Export.py 脚本以导出模型。
手动量化或优化 ONNX 模型。

量化说明

使用 onnxruntime.tools.convert_onnx_models_to_ort 将模型转换为 *.ort 格式。注意该过程会自动添加 Cast 操作符，将 FP16 乘法改为 FP32。
量化方法详见 Do_Quantize 文件夹。

额外资源

探索更多项目：DakeQQ Projects

Name		Name	Last commit message	Last commit date
Latest commit History 1,784 Commits
Do_Quantize/Dynamic_Quant		Do_Quantize/Dynamic_Quant
Export_ONNX		Export_ONNX
Hunyuan_MT		Hunyuan_MT
InternVL_Mono/Dynamic_Axes		InternVL_Mono/Dynamic_Axes
Llama_ONNX		Llama_ONNX
MiniCPM		MiniCPM
Qwen/v3		Qwen/v3
QwenVL		QwenVL
LICENSE		LICENSE
LLM_Qwen.gif		LLM_Qwen.gif
LLM_QwenVL.gif		LLM_QwenVL.gif
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Native-LLM-for-Android

Overview

Update

Getting Started

Tokenizer Files

Exporting Models

Quantization Notes

Additional Resources

Performance Metrics

Qwen

QwenVL

DeepSeek-R1

MiniCPM

Gemma

Phi

Llama

InternVL

MiniCPM

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

最近更新

入门指南

分词器文件

导出模型

量化说明

额外资源

About

Uh oh!

Releases

Packages

Languages

License

DakeQQ/Native-LLM-for-Android

Folders and files

Latest commit

History

Repository files navigation

Native-LLM-for-Android

Overview

Update

Getting Started

Tokenizer Files

Exporting Models

Quantization Notes

Additional Resources

Performance Metrics

Qwen

QwenVL

DeepSeek-R1

MiniCPM

Gemma

Phi

Llama

InternVL

MiniCPM

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

最近更新

入门指南

分词器文件

导出模型

量化说明

额外资源

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages