05 Mar 14:33

github-actions

6b1db4c

3.4.1 Latest

Latest

MNN 3.4.1 版本发布说明

发布日期: 2026年3月

📌 版本概述

MNN 3.4.1 版本聚焦于 Qwen3.5 模型支持与 Linear Attention 算子、LLM 资源管理优化 与 安全性修复 三大核心主题：

Qwen3.5 支持与 Linear Attention: 全新实现 Linear Attention 算子，覆盖 CPU、Metal、OpenCL、Vulkan 四大后端(性能待优化)，支持 Qwen3.5 系列混合注意力架构；llmexport 新增对应模型导出能力。
LLM 资源管理优化: LLM 实例内置独立 Executor 并在所有公开方法中使用 ExecutorScope，确保计算资源正确作用域化和及时释放，解决 Python 调用场景下的资源泄漏问题。
安全性与稳定性: 修复多个 Shape 算子和执行算子中的内存安全漏洞；修复 HQQ 量化 OOM 和大词表 Embedding 溢出；修复多个 LLM 和 GPU 后端缺陷。

🚀 版本亮点

Qwen3.5 模型支持: 全新支持 Qwen3.5 及 Qwen3.5-MoE 系列模型，包括导出和推理
Linear Attention 算子: CPU/Metal/OpenCL/Vulkan 四端实现（性能待优化），支持 Gated Delta Rule 循环状态更新
Vulkan CoopMat Conv1x1: Vulkan 后端 Conv1x1 算子支持协作矩阵加速，进一步提升矩阵计算性能
LLM Executor 内置化: 每个 LLM 实例自带独立 Executor，Python 绑定下资源释放更可靠
内存安全修复: 修复 7 个 Shape/执行算子中的越界访问、零步长、重复索引等安全漏洞
HQQ 大模型量化修复: 修复 Qwen3.5-27B 等大模型 HQQ 量化 OOM 问题
Sana Diffusion 增强: iOS/Android 全面支持 Sana 风格迁移、Omni 音频输出、视频输入等新功能
Metal 后端增强: MetalConvolutionDepthwise 支持 Clone；修复 INT8/INT4 Conv2D 计算错误

✨ 新功能

LLM/VLM

Qwen3.5 模型支持: 支持 Qwen3.5 和 Qwen3.5-MoE 模型的导出与推理，包括混合线性注意力架构
Linear Attention 算子: 全新实现，包含 Conv1D + SiLU 激活、QKV 拆分、GQA（分组查询注意力）、L2 归一化和 Gated Delta Rule 循环状态更新
- CPU: 完整实现含卷积状态和循环状态缓冲区管理
- Metal: 三个 Compute Pipeline（conv_silu、conv_state_update、gated_delta_rule）
- OpenCL: Buffer 模式实现含专用 OpenCL 内核
- Vulkan: 三个 GLSL Compute Shader 实现
- 以上实现均为功能实现，性能还需优化
LLM Executor 内置化: 每个 Llm 实例构造时创建独立 Executor，所有公开方法自动使用 ExecutorScope，析构时确保资源完整释放
新增 Tokenizer Demo（tokenizer_demo.cpp）

GPU 后端

Vulkan CoopMat Conv1x1: 新增 VulkanConv1x1Coop 实现，包含 C4/COOP 数据布局转换和 INT4/INT8 权重转换 Shader
Metal Clone 支持: MetalConvolutionDepthwise 支持 onClone() 操作

应用与工具

Sana Diffusion（iOS）: 新增风格迁移、Omni 音频输出、视频输入支持；新增批量测试框架
Sana Diffusion（Android）: 新增 Sana 原生 JNI 层和 Kotlin 会话封装；新增 Diffusion 设置界面
Android Debug 工具: 新增 Benchmark/Download/Market/Sana/OpenAPI 多个 Stetho Dumper 插件
Android 冒烟测试: 新增完整的冒烟测试框架，包含环境检查、安装、UI 捕获、回归测试和报告生成
Sana 独立应用: 新增 apps/sana/ 独立脚本工具，支持 Android 和主机端 Benchmark

🔒 安全修复

修复多个算子中的内存安全漏洞：

CPURandomUniform: 增加 size <= 0 边界检查和 low < high 验证；增加类型特定处理
ShapeSliceTf: 修复负值/越界 begin 值导致的越界访问
ShapeSpaceToBatchND: 增加 blockSize + spatialStart 越界检查和 blockData <= 0 验证
ShapeSpaceToDepth: 将 MNN_ASSERT 替换为正确的错误返回；增加 H/W 对 blockSize 的整除检查
ShapeSqueeze: 增加维度数上限检查和轴范围验证
ShapeStridedSlice: 增加零步长检查，替换 MNN_ASSERT 为优雅失败
ShapeTranspose: 增加重复排列索引检测，防止非法内存访问

🐛 缺陷修复

HQQ 量化 OOM: 修复 Qwen3.5-27B 等大模型量化时单 GPU 显存溢出，采用分块量化策略
Embedding 整数溢出: 大词表（~240K tokens）场景下 DiskEmbedding 偏移量从 int 改为 size_t
LLM JSON 合并: 修复 Jinja 配置中 merge_and_clear 的 JSON 递归合并问题
多模态 OOB 崩溃: 修复 llm_bench 中多模态模型的越界崩溃
Reranker Demo 崩溃: 修复未加载模型时 reranker_demo 崩溃
Metal INT8/INT4 Conv2D: 修复 weighti8i4conv2d 算子测试错误
Benchmark 崩溃: 修复 Android Benchmark 崩溃问题
Sana Resize 阈值: 修正 Sana resize 阈值设置

📚 其他改进

CI: 新增 LLM PR Review 自动化流程；pymnn 发布工作流升级至 macOS-14
Android 测试: 新增大量单元测试（ChatPresenter、ChatRouter、ChatInput、ModelListManager、ModelMarket 等）
iOS 应用: 本地化全面更新；新增后端配置 UI；新增本地模型索引管理
Android 应用: 模型市场大幅扩充；新增主设置页面；Debug Activity 增强

🙏 致谢

衷心感谢所有贡献者对本版本的宝贵贡献：

@jxt1234 - Metal Clone 支持与后端修复
@yanxing - LLM Benchmark 修复与 CI 改进
@若遗 - Sana Android 集成与 Benchmark 修复

📦 不兼容变更

LLM 实例现在内置独立 Executor，llm_demo.cpp 中不再需要外部创建 ExecutorScope（已自动移除）

完整变更日志: 3.4.0...3.4.1

MNN 3.4.1 Release Notes

Release Date: March 2026

📌 Overview

MNN 3.4.1 focuses on three core themes: Qwen3.5 Model Support with Linear Attention, LLM Resource Management Optimization, and Security Fixes:

Qwen3.5 Support & Linear Attention: Implements a new Linear Attention operator across CPU, Metal, OpenCL, and Vulkan backends to support the Qwen3.5 series hybrid attention architecture; llmexport adds corresponding model export capabilities.
LLM Resource Management: Each LLM instance now creates its own Executor and uses ExecutorScope in all public methods, ensuring proper resource scoping and timely release, especially in Python binding scenarios.
Security & Stability: Fixes memory safety vulnerabilities in multiple shape and execution operators; fixes HQQ quantization OOM and large-vocabulary embedding overflow; addresses multiple LLM and GPU backend issues.

🚀 Highlights

Qwen3.5 Model Support: Full support for Qwen3.5 and Qwen3.5-MoE series models, including export and inference
Linear Attention Operator: Implemented across all four backends (CPU/Metal/OpenCL/Vulkan) with Gated Delta Rule recurrent state updates
Vulkan CoopMat Conv1x1: Cooperative matrix acceleration for Conv1x1 on Vulkan backend
Built-in LLM Executor: Each LLM instance manages its own Executor, enabling reliable resource cleanup in Python bindings
Memory Safety Fixes: Fixed 7 shape/execution operators with out-of-bounds access, zero-stride, and duplicate index vulnerabilities
HQQ Large Model Fix: Fixed OOM when quantizing large models like Qwen3.5-27B with HQQ
Sana Diffusion Enhancement: iOS/Android support for Sana style transfer, Omni audio output, video input, and more
Metal Backend Enhancement: MetalConvolutionDepthwise Clone support; INT8/INT4 Conv2D fix

✨ New Features

LLM/VLM

Qwen3.5 Model Support: Export and inference support for Qwen3.5 and Qwen3.5-MoE models with hybrid linear attention architecture
Linear Attention Operator: Full implementation including Conv1D + SiLU activation, QKV split, GQA (Grouped Query Attention), L2 normalization, and Gated Delta Rule recurrent state update
- CPU: Complete implementation with convolution state and recurrent state buffer management
- Metal: Three compute pipelines (conv_silu, conv_state_update, gated_delta_rule)
- OpenCL: Buffer mode implementation with dedicated OpenCL kernel
- Vulkan: Three GLSL compute shader implementation
Built-in LLM Executor: Each Llm instance creates its own Executor at construction, all public methods use ExecutorScope automatically, destructor ensures complete resource cleanup
Added Tokenizer Demo (tokenizer_demo.cpp)

GPU Backends

Vulkan CoopMat Conv1x1: New VulkanConv1x1Coop with C4/COOP data layout conversion and INT4/INT8 weight conversion shaders
Metal Clone Support: MetalConvolutionDepthwise supports onClone() operation

Apps & Tools

Sana Diffusion (iOS): Style transfer, Omni audio output, video input support; batch test framework
Sana Diffusion (Android): Native JNI layer and Kotlin session wrapper; Diffusion settings UI
Android Debug Tools: Multiple Stetho dumper plugins for Benchmark/Download/Market/Sana/OpenAPI
Android Smoke Tests: Complete smoke testing framework with environment checks, installation, UI capture, regression tests, and report generation
Sana Standalone App: New apps/sana/ standalone scripts for Android and host benchmarking

🔒 Security Fixes

Fixed memory safety vulnerabilities in multiple operators:

CPURandomUniform: Added size <= 0 bounds check and low < high validation; type-specific handling
ShapeSliceTf: Fixed out-of-range begin values causing out-of-bounds access
ShapeSpaceToBatchND: Added blockSize + spatialStart overflow check and blockData <= 0 validation
ShapeSpaceToDepth: Replaced MNN_ASSERT with proper error returns; added H/W divisibility checks
ShapeSqueeze: Added dimension count upper bound and axis range validation
ShapeStridedSlice: Added zero-stride checks, replacing MNN_ASSERT with graceful failure
ShapeTranspose: Added duplicate permutation index detection to prevent invalid memory access

🐛 Bug Fixes

HQQ Quantization OOM: Fixed single-GPU OOM when quantizing large models (e.g., Qwen3.5-27B) using chunk-based quantization
Embedding Integer Overflow: Changed DiskEmbedding offset from int to size_t for large vocabularies (~240K tokens)
LLM JSON Merge: Fixed recursive JSON merge issue in merge_and_clear for jinja config
Multimodal OOB Crash: Fixed out-of-bounds crash for multimodal models in llm_bench
Reranker Demo Crash: Fixed crash when model is not loaded in reranker_demo
Metal INT8/INT4 Conv2D: Fixed weighti8i4conv2d op test errors
Benchmark Crash: Fixed Android benchmark crash
Sana Resize Threshold: Fixed Sana resize threshold value

📚 Other Improvements

CI: Added LLM PR Review automation; upgraded pymnn release workflow to macOS-14
Android Tests: Added extensive unit tests (ChatPresenter, ChatRouter, ChatInput, ModelListManager, ModelMarket, etc.)
iOS App: Comprehensive localization updates; backend configuration UI; local model index management
Android App: Model market expansion; new main settings page; enhanced Debug Activity

🙏 Acknowledgements

We sincerely thank all contributors for their valuable contributions to this release:

@jxt1234 - Metal Clone support and backend fixes
@yanxing - LLM benchmark fixes and CI improvements
@若遗 - Sana Android integration and benchmark fixes

📦 Breaking Changes

LLM instances now have a built-in Executor; external ExecutorScope creation in llm_demo.cpp is no longer needed (automatically removed)

Full Changelog: ht...

Contributors

yanxing and jxt1234

Assets 7

07 Feb 06:22

github-actions

3.4.0

a874b30

3.4.0

MNN 3.4.0 版本发布说明

发布日期: 2026年2月

📌 版本概述

MNN 3.4.0 版本聚焦于 GPU/QNN 后端能力深化、Attention 计算及长文本内存优化 与 GPU 线上运行稳定性 三大核心主题：

GPU/QNN 能力深化: Vulkan 后端新增 LLM 推理支持并引入 CoopMat 矩阵加速指令；Metal 后端支持 TensorAPI 和 Flash Attention；QNN 后端扩展支持 Qwen3 系列和 VL 模型，并新增 Python 直接导出和 OmniQuant 量化能力。
Attention 与长文本内存优化: CPU 和 Metal 后端全面支持 Flash Attention；CPU 支持 KV Cache 量化；新增 Prefix KV Cache 支持；新增统一的 attention_mode 配置选项，为长文本场景显著降低内存占用。
GPU 线上运行稳定性: 新增 iOS 后台检测机制，当 APP 切到后台时 GPU 计算会被系统拒绝，现在会正确返回错误码；修复多个 GPU 后端的稳定性问题。

🚀 版本亮点

Vulkan LLM 支持: Vulkan 后端新增 LLM 推理支持，覆盖更多 Android 设备
Vulkan CoopMat 加速: Vulkan 支持协作矩阵 (CoopMat) 指令，大幅加速矩阵乘法运算
Metal TensorAPI 支持: Metal 后端支持 TensorAPI，M5 芯片性能大幅提升
Metal Flash Attention: Metal 后端实现 Flash Attention，显著降低内存占用
CPU Flash Attention: CPU 后端支持 Flash Attention，新增 attention_mode 统一配置选项
CPU KV Cache 量化: CPU Attention 支持 KV Cache 量化，降低内存占用
Prefix KV Cache: 支持 Prefix KV Cache，提升长文本推理效率
投机解码增强: Metal 支持 Eagle3 和 Lookahead 投机解码；提供 Qwen3 系列 Eagle3 权重
QNN 增强: QNN 后端支持 Qwen3 系列和 VL 模型；支持 Python 直接导出；支持 OmniQuant 量化
混合精度量化: llmexport 支持通过配置文件进行混合精度量化
llmexport 重构: 重构模型导出逻辑，模型抽象更加完善
Loop 算子 GPU 优化: OpenCL 和 Metal 优化 Loop 算子，支持纯 GPU 实现无需回退 CPU
RISC-V Vector (RVV) 优化: 全面使用向量内置函数优化核心算子
Vulkan Buffer FP16: Vulkan buffer 模式完整支持 FP16 计算路径
KleidiAI 集成: 新增编译选项，默认启用 KleidiAI fp32 深度卷积内核

✨ 新功能

LLM/VLM

Flash Attention: CPU 和 Metal 后端支持 Flash Attention，Metal 支持三种模式可选
KV Cache 量化: CPU Attention 支持 Query/Key/Value 的 8bit 非对称量化
Prefix KV Cache: 支持 Prefix KV Cache，提升长文本推理效率
投机解码: Metal 后端支持 Eagle3 和 Lookahead 投机解码；提供 Qwen3 系列 Eagle3 权重
llmexport 重构: 重构模型导出逻辑，支持混合精度量化配置和无权重导出
新增 attention_mode 选项: 统一配置 Attention 行为，废弃 quant_qkv 选项
- CPU: 0, 1, 2, 8, 9, 10（默认 8），控制 Flash Attention 和 QKV 量化
- GPU (Metal): 0, 8, 16（默认 8），控制 Flash Attention 实现方式
新增 Fun-Audio-Chat-8B 模型支持；LoRA 支持克隆 LayerNorm；llm_bench 支持 JSON 输出

GPU、QNN 后端

Vulkan: 新增 LLM 推理支持；CoopMat 矩阵加速；buffer 模式完整支持 FP16
Metal: TensorAPI 支持（M5 芯片性能提升）；iOS 后台检测机制；Loop 算子优化
OpenCL: Loop 算子纯 GPU 实现；mmap 权重存储
QNN: 支持 Qwen3 系列和 VL 模型；Python 直接导出；OmniQuant 量化

工具与应用

MNNConvert: 新增 dump pass 和 ConvPad 融合优化
mnncli 重构: 增加测试、Linux 构建、QNN 库下载
新增 Supertonic TTS、Sana diffusion、Sherpa-MNN TTS Demo 支持
MNN Chat (Android): 模型市场优化、多图片输入、Debug 工具

⚡ 性能优化

CPU: LayerNorm/MatMul/二元广播优化；线程池开销降低；Attention Softmax 缓存优化
RISC-V Vector: 全面优化 pack/unpack、转置、数学、卷积、Softmax、CV 等核心函数
KleidiAI: 新增编译选项默认启用 fp32 深度卷积内核

🐛 缺陷修复

核心: 修复 axis() 返回 nullptr 崩溃；新增 _SSE_MNNSoftmax 修复 AVX2 禁用时崩溃
GPU: 修复 iOS LLM 后台执行报错；修复 OpenCL/Vulkan/Metal 多个计算和稳定性问题
LLM: 修复 FastVLM/Eagle 导出问题；修复 CPU Attention 量化溢出
工具: 修复 MNN2QNNModel、diffusion 导出、mnncli 等问题

📚 其他改进

文档: 新增 NPU 导出和 Attention 量化参数文档
CI: 新增 GitHub/内部代码自动同步脚本；SmolVLM 测试；LLM 夜间测试
构建: 默认启用 KleidiAI；RVV 宏跨平台支持；Linux mnncli 构建

🙏 致谢

衷心感谢所有外部贡献者对本版本的宝贵贡献：

@ihb2032 - 全面的 RISC-V Vector (RVV) 优化
@jxt1234 - 性能优化与缺陷修复
@HenryDen - KleidiAI 集成
@Juude - TTS/App 优化与 mnncli 重构
@vra - Supertonic TTS 支持
@zlaazlaa - Diffusion 导出修复
@bolun365 - MNN 库构建脚本
@JunKnows - OpenCL/Vulkan 修复
@rainyl - CMake 改进
@LudovicoYIN - 量化工具修复
@codefuturedalao - 代码优化
@EricMoin - 变量初始化修复
@Edward-Elric233 - MinGW 构建支持
@AliasJeff - 文档和 CLI 改进
@jules-ai - 文档修正
@vvverily - App 文案修正

📦 不兼容变更

quant_qkv 选项已废弃，请使用 attention_mode

完整变更日志: 3.3.0...3.4.0

MNN 3.4.0 Release Notes

Release Date: February 2026

📌 Overview

MNN 3.4.0 focuses on three core themes: Deepening GPU/QNN Backend Capabilities, Optimizing Attention Computation & Long-Context Memory Usage, and GPU Runtime Stability:

GPU/QNN Capability Enhancement: Vulkan backend adds LLM inference support with CoopMat matrix acceleration; Metal backend supports TensorAPI and Flash Attention; QNN backend extends to support Qwen3 series and VL models with direct Python export and OmniQuant quantization.
Attention & Long-Context Memory Optimization: CPU and Metal backends fully support Flash Attention; CPU supports KV Cache quantization; Prefix KV Cache support added; new unified attention_mode configuration option significantly reduces memory usage for long-context scenarios.
GPU Runtime Stability: Added iOS background detection mechanism - when app goes to background, GPU computation is rejected by the system, now correctly returns error code; fixed multiple GPU backend stability issues.

🚀 Highlights

Vulkan LLM Support: Vulkan backend now supports LLM inference, extending coverage to more Android devices
Vulkan CoopMat Acceleration: Vulkan supports Cooperative Matrix (CoopMat) instructions for significantly faster matrix multiplication
Metal TensorAPI Support: Metal backend supports TensorAPI, significant performance boost on M5 chips
Metal Flash Attention: Metal backend implements Flash Attention with significantly reduced memory usage
CPU Flash Attention: CPU backend supports Flash Attention with new unified attention_mode option
CPU KV Cache Quantization: CPU Attention supports KV Cache quantization to reduce memory usage
Prefix KV Cache: Support for Prefix KV Cache, improving long-text inference efficiency
Speculative Decoding Enhancement: Metal supports Eagle3 and Lookahead speculative decoding; Qwen3 Eagle3 weights provided
QNN Enhancement: QNN backend supports Qwen3 series and VL models; direct Python export; OmniQuant quantization support
Mixed Precision Quantization: llmexport supports mixed precision quantization via config file
llmexport Refactoring: Refactored model export logic with improved model abstraction
Loop Op GPU Optimization: OpenCL and Metal optimize Loop operator, pure GPU implementation without CPU fallback
RISC-V Vector (RVV) Optimization: Comprehensive intrinsic optimization across core operations
Vulkan Buffer FP16: Full FP16 compute path for Vulkan buffer mode
KleidiAI Integration: Added compile option to enable KleidiAI fp32 depth-wise kernels by default

✨ New Features

LLM/VLM

Flash Attention: CPU and Metal backends support Flash Attention, Metal with three selectable modes
KV Cache Quantization: CPU Attention supports 8-bit asymmetric quantization for Query/Key/Value
Prefix KV Cache: Support for Prefix KV Cache, improving long-text inference efficiency
Speculative Decoding: Metal supports Eagle3 and Lookahead; Qwen3 Eagle3 weights provided
llmexport Refactoring: Refactored model export logic, supports mixed precision quantization and weight-free export
New attention_mode Option: Unified configuration for Attention behavior, deprecates quant_qkv
- CPU: 0, 1, 2, 8, 9, 10 (default 8), controls Flash Attention and QKV quantization
- GPU (Metal): 0, 8, 16 (default 8), controls Flash Attention implementation
Added Fun-Audio-Chat-8B support; LoRA supports cloning LayerNorm; llm_bench supports JSON output

GPU/QNN Backends

Vulkan: Added LLM inference support; CoopMat matrix acceleration; full FP16 buffer mode
Metal: TensorAPI support (M5 chip performance boost); iOS background detection; Loop op optimization
OpenCL: Pure GPU Loop operator implementation; mmap weight storage
QNN: Supports Qwen3 series and VL models; direct Python export; OmniQuant quantization

Tools & Apps

MNNConvert: Added dump pass and ConvPad fusion optimization
mnncli refactoring: Added tests, Linux build, QNN library download
Added Supertonic TTS, Sana diffusion, Sherpa-MNN TTS Demo support
MNN Chat (Android): Model market optimization, multi-image input, Debug tools

⚡ Performance Optimizations

CPU: LayerNorm/MatMul/binary broadcast optimization; ThreadPool overhead reduction; Attention Softmax cache optimization
RISC-V Vector: Comprehensive optimization for pack/unpack, transpose, math, conv, Softmax, CV functions
KleidiAI: Added compile option to enable fp32 depth-wise kernels by default

🐛 Bug Fixes

Core: Fixed axis() nullptr crash; added _SSE_MNNSoftmax for AVX2 disabled crash
GPU: Fixed iOS LLM background execution error; fixed multiple OpenCL/Vulkan/Metal computation and stability issues
LLM: Fixed FastVLM/Eagle export issues; fixed CPU Attention quant overflow
Tools: Fixed MNN2QNNModel, diffusion export, mnncli issues

📚 Other Improvements

Documentation: Added NPU export and Attention quantization parameter docs
CI: Added GitHub/internal code auto-sync script; SmolVLM test; LLM nightly test
Build: KleidiAI enabled by default; RVV macro cross-platform support; Linux mnncli build

🙏 Acknowledgements

We sincerely thank all external contributors for their valuable contributions to this release:

@ihb2032 - Comprehensive RISC-V Vector (RVV) optimization
@jxt1234 - Performance optimizations and bug fixes
@HenryDen - KleidiAI integration
@Juude - TTS/App optimization and mnncli refactoring
@vra - Supertonic TTS support
@zlaazlaa - Diffusion export fixes
@bolun365 - MNN library build script
@JunKnows - OpenCL/Vulkan fixes
@rainyl - CMake ...

Contributors

Juude, jxt1234, and 14 other contributors

Assets 7

31 Oct 05:54

jxt1234

3.3.0

5047919

3.3.0 NPU 支持 / SME2 指令加速 / EAGLE 投机解码加速

MNN 3.3 Release Note

一、大语言模型（LLM）能力增强

新增模型支持：
- 支持 Qwen2.5-Omni、Qwen3-VL、GPT-OSS、MiniCPM-4、Fast VLM 、GTE Reranker 等主流开源模型。
- 支持 Attention Sink、Sliding Window Attention
LLM 推理优化：
- 新增 EAGLE-3 投机解码 支持，Mac CPU 上 Llama3-8B decode 性能提升达 2.24x。
- 完善 Python 接口，支持多模态推理、逐步推理、context 信息获取等能力。
量化与精度：
- 集成 HQQ 量化算法，与AWQ算法精度接近，但量化时间远低于AWQ（Qwen 2.5 - 0.5B 各类量化方案 ppl 数据：原始： 17.83 ；awq: 17.08 ；hqq : 16.85）
- 支持 SmoothQuant 与 输入 per-tensor 量化。
- 支持 DiskEmbedding 4/8bit 量化，并优化缓存读取速度。

二、硬件加速与 NPU 支持

CPU 加速：
- 实现 SME（Scalable Matrix Extension）指令集 支持，在 Armv9 设备上大幅提升 LLM 与 CV 模型性能。
  - Qwen2.5-1.5B 在 SME2 上 Prefill 性能相比 Arm86 提升 2~3 倍。
  - ResNet50 FP16 单线程推理加速 3 倍。
CUDA 后端支持LLM：
- Cutlass 升级至 4.0.0（LLM 模式下），并改为编译时下载。
- 新增 CUDA Attention 算子，支持 LLM 模型运行。
- 支持 int4 / int8 权重低内存计算，以降低LLM显存占用。
GPU 后端修复：
- 修复 OpenCL 运行 Qwen Omni 出错问题，修正 OpenCL 在 MTK 部分芯片上计算 Attention 出错问题。
- 新增 OpenCL Kernel MD5 校验，避免缓存污染。
NPU 支持LLM：
- 支持 高通 QNN（NPU） 运行 LLM 与 Vision 模型。
- 新增 联发科（MTK）NPU 对 LLM 的支持。

NPU 参考性能数据：

小米14 - Qwen3-4B-int4

天玑9300-计算盒 - Qwen3-4B-int4

三、框架功能与稳定性提升

核心框架改进：
- 新增 set_order 接口，支持动态修改 VARP 布局。
- 修复多个 crash 问题，包括：
  - StridedSlice 在 zero shape + 缩轴到 scalar 时崩溃；
  - Module 加载输入不足时崩溃；
  - Arm82 后端 Resize Error（因 CPUResizeCache 管理问题）。
Python 兼容性：
- 修复 PyMNN 在 Python 3.13+ 编译失败 的问题。
模型转换优化：
- 修正 RemoveUnuseFul / RemoveCopy Pass 可能导致输入/输出丢失的问题。
- 支持 GRU / LSTM 量化：通过分解算子为控制流 + Convolution 实现。

四、开源社区与兼容性

修复多个社区反馈问题（Issue #3623、#3632、#3690、#3701、#3774、#3780、#3850 等）。
提升跨平台兼容性，包括 Windows ARM、macOS、Android、iOS、鸿蒙等。

MNN 3.3 版本持续聚焦 端侧大模型高效推理 与 多硬件平台统一部署，并积极回馈开源社区。

Assets 7

16 Oct 10:53

github-actions

3.2.5

9a0546b

3.2.5 Pre-release

Pre-release

MNN 3.2.5 Release Note

核心功能更新

1. 新增HQQ量化算法支持

在MNNConvert工具中集成HQQ量化算法，可通过--hqq参数启用
HQQ量化支持非对称量化，能显著提升量化模型的精度
支持与分块量化结合使用，进一步优化模型精度

2. 支持EAGLE-3推测解码算法

新增EAGLE-3推测解码算法实现，提升大语言模型推理效率
实现了EagleGeneration类，支持基于草稿模型的推测解码
提供了Eagle模型导出工具，支持导出eagle、eagle_fc和eagle_d2t三个组件

3. Qwen系列模型增强支持

修复并优化Qwen3-Embedding模型的推理问题
新增对Qwen3-VL多模态大模型的支持
完善了llmexport工具对Qwen系列模型的导出支持

详细变更内容

模型推理优化

重构了LLM模型加载逻辑，在Llm::load()方法中增加了更完善的错误处理
优化了KV Cache管理器的实现，提升了推理过程中的内存管理效率
改进了Metal后端的注意力机制实现
优化了OpenCL后端的卷积执行效率

量化工具改进

在WeightQuantAndCoding.cpp中集成了HQQ量化器，支持更精确的权重量化
优化了量化参数配置逻辑，当启用HQQ时自动设置非对称量化
修复了量化过程中的一些bug，提升了量化稳定性

模型导出增强

完善了llmexport工具的错误处理和日志输出
优化了模型导出流程，提升了导出稳定性
修订了压缩工具相关文档，增加了HQQ量化使用说明

Core Feature Updates

1. Added Support for HQQ Quantization Algorithm

Integrated HQQ quantization algorithm into MNNConvert tool, which can be enabled via the --hqq parameter
HQQ quantization supports asymmetric quantization, significantly improving the accuracy of quantized models
Supports combination with block-wise quantization to further optimize model accuracy

2. Added Support for EAGLE-3 Speculative Decoding Algorithm

Implemented EAGLE-3 speculative decoding algorithm to improve large language model inference efficiency
Implemented EagleGeneration class to support draft model-based speculative decoding
Provided Eagle model export tools supporting export of three components: eagle, eagle_fc, and eagle_d2t

3. Enhanced Support for Qwen Series Models

Fixed and optimized inference issues with Qwen3-Embedding model
Added support for Qwen3-VL multimodal large model
Improved llmexport tool's export support for Qwen series models

Detailed Changes

Model Inference Optimization

Refactored LLM model loading logic with enhanced error handling in the Llm::load() method
Optimized KV Cache manager implementation to improve memory management efficiency during inference
Improved attention mechanism implementation in Metal backend
Optimized convolution execution efficiency in OpenCL backend

Quantization Tool Improvements

Integrated HQQ quantizer in WeightQuantAndCoding.cpp for more precise weight quantization
Optimized quantization parameter configuration logic to automatically set asymmetric quantization when HQQ is enabled
Fixed bugs in the quantization process, improving quantization stability

Model Export Enhancements

Improved error handling and log output in llmexport tool
Optimized model export process to improve export stability
Revised compression tool documentation with added HQQ quantization usage instructions

Assets 7

30 Sep 02:10

github-actions

3.2.4

d2dd1e9

3.2.4 Pre-release

Pre-release

Merge pull request #3906 from alibaba/feature/sync

MNN:Sync: Sync Internal 3.2.4

Assets 7

05 Aug 01:43

github-actions

3.2.2

a739ea5

3.2.2 Pre-release

Pre-release

Merge pull request #3747 from alibaba/feature/sync

MNN:Sync: Sync Internal 3.2.2

Assets 7

06 Jun 02:38

github-actions

3.2.0

ebdada8

3.2.0 MoE 架构和Omni支持 / 投机解码实现 / 新增模板引擎 / KlediAI 功能更新 / 启动性能优化

3.2.0 版本更新说明
感谢社区和开发团队的努力，MNN 3.2 版本现已正式发布！本次版本在性能优化、功能扩展以及问题修复方面进行了多项改进，尤其针对大语言模型（LLM）的支持和推理加速进行了重点优化。以下是主要更新内容：
🚀 新特性支持

LLM 相关
○ 支持 Qwen Omni 模型运行。
○ 支持 Qwen3 MoE 模型。
○ 支持 Gemma3 (4B/1B) 和 InternVL2.5-1B / deepseek vl 模型导出及 GGUF 转换。
○ 支持 FastVLM 和 SmolVLM 模型。
○ LLM Export 工具新增其他低比特量化（2, 3, 5, 6, 7 bit）支持。
○ 新增 jinja 模板引擎，方便 LLM 对话模板管理。
○ LLM Export 工具中默认量化的Block修改为64 （经测试体积和Q4_1一致，但精度更高）
○ 支持 GGUF 模型格式转换为 MNN。
集成 KlediAI 相关功能更新
○ 新增 Q4 非对称量化的 Kernel 实现
○ 新增 FP16 / FP32 的 SME Kernel 实现
LLM 推理加速
○ 支持基于用户输入的投机解码 (lookahead算法)，在典型场景如代码补全情况下，可提升2-3倍解码效率。
模型转换增强
○ 支持 ONNX 元数据（metadata）转换及读取，并保持输入顺序一致。
硬件加速支持
○ 新增高通 QNN 后端，目前仅支持 CV 模型运行。
库大小裁剪功能
○ 新增 MNN_REDUCE_SIZE宏，开启后可以关闭部分优化分支，减少库大小
⚙️ 性能优化及精度优化
启动优化
○ CPU 后端支持 mmap 生成缓存文件后删除原始 weight 文件，以便节省存储空间。
○ CPU Int8/Int4 模式下优化权重重排和加载时间，LLM加载耗时降低到原来的1/3，低于填充 200 token 的耗时。
○ OpenCL 后端通过内存免拷贝进一步优化加载时间，减少至原来的50%，同样低于填充 200 token 的耗时。
○ Metal 后端通过内存免拷贝及Kernel改写，降低Int8/Int4权重重排和加载时间，LLM模型加载耗时降低到原先的1/7
CPU算子优化
○ 调整了 Weight Block-wise 量化场景的 Weight 内存布局，并对相关Kernel进行了重写，以减少多线程运行时 GEMV 的 Cache Miss，对应地在 PC CPU 上，LLM 多线程解码性能提升1倍以上。
○ 新增 input block-wise 量化支持，并且原有的 Weight Block-wise 量化新增对非 1x1 卷积的支持,可较大地提升相关CV模型量化后的精度。比如 Diffusion 场景，CPU生图的效果提升。
GPU算子优化
○ Metal 后端针对 CV/Diffusion 模型使用 simdgroup 指令进行加速，提升1倍左右效率。
○ Metal 后端新增 ScatterND 算子实现（目前只支持无 overlap 的情况）
○ OpenCL 后端针对 GEMV 扩充场景（[e,l,h]矩阵乘参数中，l, h 很大，但 e>1 且 e <16）进行了优化，提升性能1倍左右，以配合投机解码提速需求。

🐞 问题修复
包括但不限于：

修复 Windows 平台上部分情况下模型转换工具 crash 的问题。
修复 CUDA 后端运行 BatchMatMul 出错的问题。
修复 OpenCL 运行 LLM 时，多模态模型使用 CPU 导致出错的问题。
修复 Qwen2/Qwen2.5 的 M-RoPE 支持问题。
修复 Qwen2-7B-Audio 长语音处理出错的问题。
修复 Space2Depth 算子中断言 blockSize > 1 不正确的问题。
修复 Armv7a 架构上 LLM 运行出错的问题。
修复 ONNX 部分量化模型转换失败的问题。
修正 HIAI 部分算子实现错误的问题。

3.2.0 Release Note
Thanks to the community and development team's efforts, MNN 3.2 has been officially released! This version introduces significant improvements in performance optimization, feature expansion, and bug fixes, with particular emphasis on enhanced support for Large Language Models (LLM) and inference acceleration. Below are the key updates:
🚀 New Feature Support

LLM-related Enhancements
a. Added support for running Qwen Omni models
b. Added support for Qwen3 MoE models
c. Add support for Gemma3 (4B/1B) and InternVL2.5-1B/deepseek vl models
d. Added support for FastVLM and SmolVLM models
e. LLM Export tool now supports additional low-bit quantization (2, 3, 5, 6, 7 bits)
f. Implemented jinja2 templating engine for LLM conversation template management
g. Default quantization block size in LLM Export tool changed to 64 (tested to maintain same size as Q4_1 with higher accuracy)
h. Added support for converting GGUF model format to MNN
KlediAI Integration Updates
a. Added asymmetric Q4 quantization kernel implementation
b. Added FP16/FP32 SME kernel implementation
LLM Inference Acceleration
a. Added speculative decoding (lookahead algorithm) based on user input, improving decoding efficiency by 2-3x in typical scenarios like code completion
Model Conversion Enhancements
a. Added ONNX metadata conversion and reading support while preserving input order consistency
Hardware Acceleration Support
a. Added Qualcomm QNN backend (currently supports CV models only)
Library Size Reduction Feature
a. Added the MNN_REDUCE_SIZE macro. When enabled, it can disable certain optimization branches to reduce the library size.

⚙️ Performance & Accuracy Optimization

Startup Optimization
a. CPU backend supports mmap-based cache file generation with automatic deletion of original weight files to save storage space
b. Optimized weight rearrangement and loading time for CPU Int8/Int4 modes, reducing LLM loading time to 1/3 of original (faster than filling 200 tokens)
c. OpenCL backend further optimizes loading time via zero-copy memory, reducing to 50% of original (faster than filling 200 tokens)
d. Metal backend reduces Int8/Int4 weight rearrangement/loading time through zero-copy memory and kernel rewrites, achieving 1/7 of original LLM loading time
CPU Operator Optimization
a. Redesigned weight memory layout for block-wise quantization scenarios with rewritten kernels to reduce GEMV cache misses during multi-threading, achieving over 1x improvement in LLM multi-threaded decoding performance on PC CPUs
b. Added input block-wise quantization support; existing weight block-wise quantization now supports non-1x1 convolutions, significantly improving quantization accuracy for CV models (e.g., enhanced image generation quality in Diffusion scenarios)
GPU Operator Optimization
a. Metal backend accelerated CV/Diffusion models using simdgroup instructions, improving efficiency by ~1x
b. Added ScatterND operator implementation in Metal backend (supports non-overlapping cases only)
c. OpenCL backend optimized GEMV expansion scenarios ([e,l,h] matrix multiplication parameters with large l/h values but small e [1<e<16]), improving performance by ~1x to support speculative decoding acceleration
🐞 Bug FixesIncluding but not limited to:
Fixed model conversion tool crashes on Windows under certain scenarios
Fixed BatchMatMul execution errors in CUDA backend
Fixed CPU-related errors in multimodal models during LLM execution on OpenCL backend
Fixed M-RoPE support issues for Qwen2/Qwen2.5
Fixed long audio processing errors in Qwen2-7B-Audio
Fixed incorrect assertion in Space2Depth operator requiring blockSize > 1
Fixed LLM execution errors on Armv7a architecture
Fixed ONNX quantized model conversion failures
Corrected HIAI operator implementation errors

Assets 7

24 Feb 09:54

github-actions

3.1.0

27aaf79

3.1.0 新增LLM移动端应用

LLM 相关更新

App 发布

Android
- 新增Android的大模型应用（LLM / Diffusion），详见 apps/Android/MnnLlmChat/README.md
iOS
- 新增 iOS 的大模型应用，详见 apps/iOS/MNNLLMChat/README.md

新特性

模型支持
- 支持 Qwen2-VL / Qwen2-Audio / Qwen2.5-VL 等多模态大模型的导出和运行。
- 支持 DeepSeek-R1-1.5B-Qwen 和 DeepSeek-R1-7B-Qwen
LLM 单步执行支持
- 新增了对 LLM 单步执行的支持，便于调试和优化模型推理过程。
LLM 扩展采样算法支持
- 扩展了采样算法的支持，提升了模型推理的灵活性和多样性。
LLM 导出模型时支持输出历史 Token 的概率值
- 在导出 LLM 模型时，新增了输出历史 Token 概率值的功能，便于后续分析和优化。
LLM-CPU 支持 mmap 文件缓存
- 新增了对 mmap 文件缓存的支持，使二次加载模型时避免内存重排，提升了加载效率。

性能优化

CPU 多线程性能进一步提升
- 对 LLM 多线程性能进行了优化，进一步提升了整体推理速度。
CPU prefill 性能优化
- 优化了 CPU 上 prefill 性能，使 1024 token 输入和 512 token 输入的 prefill 性能持平。
GPU 优化 OpenCL AutoTuning 耗时
- 优化了 OpenCL AutoTuning 的耗时，避免了首次推理耗时过长的问题。
GPU 优化：Metal / OpenCL 支持 fp16 的 scale/bias，支持对称量化优化
- Metal / OpenCL 后端支持 fp16 的 scale/bias 和对称量化优化的支持，提升了推理效率。
LLM 加速：Metal / OpenCL 后端支持 scale/bias 使用 fp16
- Metal / OpenCL 支持 scale/bias 使用 fp16（设置 precision = low 时启用），进一步优化了 GPU 推理性能。

3.1.0 对比 3.0.0 在小米14 设备上的性能对比，千问 2.5-7B-Int4-128 ，输入 512 token

Bug 修复

LLM Lora 分离加载的模式存在误差
- 修复了 Lora 分离加载模式存在误差的问题
开源反馈 #3091: LLM Export 8Bit 量化出错
- 修复了 LLM Export 8Bit 量化出错的问题，确保了量化功能的稳定性。

重构与改进

LLM 导出模型时支持输出历史 Token 的概率值
- 重构了 LLM 导出模型的功能，支持输出历史 Token 的概率值，便于模型分析和优化。

引擎其他更新

新特性

Pymnn 支持无 numpy 环境安装
- 现在 Pymnn 可以在没有 numpy 的环境下安装，解决了之前必须依赖 numpy 的问题。
数字人支持 KV Cache 回溯
- 新增了对 KV Cache 回溯的支持，提升了数字人模型的推理效率。
OpenCL 支持大模型分块计算
- 新增了对大模型分块计算的支持，避免了渲染卡顿问题。

性能优化

Vulkan Image 分支性能优化
- 在 8Gen 系列芯片上，Vulkan 后端推理性能与 OpenCL 后端持平；在中低端设备上，整体性能提升了 30% 左右。
Convolution3D 转换为 Convolution2D
- 将 Convolution3D 转换为 Convolution2D，降低了 Convolution3D 的耗时和内存占用。
OpenCL 降低非对齐情况下申请的内存大小
- 优化了 OpenCL 在非对齐情况下申请的内存大小，减少了内存占用。

Bug 修复

Arm v8 架构下 Debug 和 Release 模式 layernorm 计算有微小误差
- 修复了 Arm v8 架构下 Debug 和 Release 模式 layernorm 计算存在微小误差的问题。
NNAPI 修正 Binary 不支持 relu fuse 的问题
- 修正了 NNAPI 中 Binary 不支持 relu fuse 的问题。
MNN:GPU 模型串联运行时内存访问越界问题
- 修复了在模型串联运行时可能出现的内存访问越界问题，原因是 backend 没有调用 onExecuteBegin 和 onExecuteEnd。
ConvTranspose3D 占用内存过大
- 通过把 **ConvTranspose3D 转换为 ConvTranspose2D ，**修复了 ConvTranspose3D 占用内存过大的问题。
Onnx 的 Cast 算子输出 fp16 时报 not support type 错误
- 修复了 Onnx 的 Cast 算子输出 fp16 时报 not support type 错误的问题，提升了模型转换的兼容性。

开源反馈修复

开源反馈 #3061: ConvTranspose CPU 运行时占用内存大
- 修复了 ConvTranspose CPU 运行时占用内存过大的问题。
开源反馈 #3073: 离线量化工具出错
- 修复了离线量化工具出错的问题。
开源反馈 #3076: MNN CV 处理出错
- 修复了 MNN CV 处理出错的问题，并进行了相关沟通确认。
开源反馈 #3156: Metal 改变输入形状后仍需要 auto-tuning
- 修复了 Metal 改变输入形状后仍需要 auto-tuning 的问题。

重构与改进

OpenCL 内存共享方案切换为 HardwareBuffer
- 将 OpenCL 内存共享方案切换为 HardwareBuffer，提升了内存管理效率。
MNN:Geometry Conv3D 的 im2col 产出的 region 数量太多
- 优化了 Conv3D 的 im2col 实现，减少了产出的 region 数量。

本文由 DeepSeek 生成后修改而得

Assets 7

24 Jan 03:41

jxt1234

3.0.4

b23b55b

3.0.4: Merge pull request #3170 from alibaba/feature/sync Pre-release

Pre-release

MNN:Sync: Sync Internal 3.0.4

Assets 2

20 Nov 01:43

jxt1234

3.0.0

707b8a4

3.0发布【多模态大模型支持、动态量化功能完善、Web SIMD支持及其他Bug修复】

大模型推理相关

LLM支持
- 新增了多模态模型Qwen2VL的支持;
- 优化了LLM模型导出流程，现直接导出 MNN 相比先导出 onnx 再转 MNN 的流程快了十倍以上
- 重构了LoRA支持方案，现支持LoRA动态加载与多LoRA并存
- 新增GPTQ、AWQ量化支持；
- 支持tie_embedding参数加载；
- 多轮对话支持kv cache复用，多轮对话时相应延迟降低10倍；
- 支持python接口，支持ceval, ppl测试；
- 支持混合精度量化，支持指定block size量化；
- OpenCL与Metal后端支持，OpenCL性能优化；
- 增加了 Arm KleidiAI 库的集成，目前仅支持对称量化的LLM推理
- 废弃模型分段导出与推理；
Stable Diffusion支持：实现了标准sd1.5模型的支持以及Attention算子插件。
ModelZoo: ModelScope与Huggingface

新功能与改进

架构优化：现在支持多个模型串联运行时复用动态内存池。
动态量化：重构了动态量化相关的代码，以支持任意类型的卷积和模型。
OpenCL 支持：增加了对五维 gridsampler 的支持，并提升了大尺寸1x1卷积、batchMatMul的计算性能。
加载时间优化：针对量化模型，对CPU和GPU的加载时间进行了优化。
Softmax/LayerNorm/GRU算子优化：在fp16/x86模式下运行时优化了Softmax/LayerNorm/GRU内存占用与计算性能。
几何计算：增强了区域合并逻辑，支持裁剪和多对多处理。
多线程性能提升：添加了锁核支持，支持按CPU算力进行多核任务划分，对于大中小核架构的CPU，最高提升2线程性能30%。
内存管理：引入了Mmap模式，在内存不足时可以迁移到磁盘文件。
编译增强：提供了Windows-ARM设备的编译支持，基于 SSE4.1 支持编译 WebAssembly 开启 SIMD。

Bug修复

修复了部分模型量化后在ARM上运行出错的问题。
解决了某些机型如天矶9000上读取cachefile后再重新生成时cachefile大小变化的问题。
修正了ONNX ConvTranspose 带分组属性时(group)转换后的计算错误。
修正了Transpose + Slice操作可能越过原始大小时未正确分割region的问题。
修复了LaMa Inpainting Model在使用GPU(Metal)运行时输出为0的问题。
修正了Softmax在低版本onnx情况下转换后reshape error的问题。

关联 Issue 修正

包括但不限于

2.0-3.0更新说明

对比 2.0 版本，3.0版本更新主要包括如下内容

支持大模型推理

增加 mnn-llm 模块，支持大语言模型的推理，详见https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html
增加 mnn-diffusion 模块，支持文生图模型的推理，详见https://mnn-docs.readthedocs.io/en/latest/transformers/diffusion.html

量化算子补充及支持动态量化

增加了 ConvTranspose / Unary / Binary 等算子的量化计算实现
支持动态量化，对于仅权重量化模型，也可以对核心算子使用量化计算，以减少内存占用。CPU 后端会将浮点输入量化后使用量化指令计算，GPU后端则将权重在推理过程中反量化回浮点计算

内存优化及形变缓存机制

支持了MMAP机制，可以将MNN运行所用内存切换到磁盘，降低内存不足时的崩溃风险
支持了延迟内存分配机制，目前仅限CPU后端支持，内部模型测试平均降低内存占用 19.13%，详见 2.7.0 Release Note
支持预推理缓存机制，部分输入形状不定的模型（如翻译模型）性能提升10%左右，使用方法详见 2.9.0 Release Note

基于新硬件特性的优化

适配了 Arm v8.6 的新指令 smmla 和 bfmmla ，相比 sdot 和 fmla (fp16) ，提升 1倍性能，详见 2.2.0 Release Note ，此外，由于 BF16 计算精度较低，修改为仅对矩阵乘启用BF16 。
基于 Intel Subgroup 特性，重新实现了OpenCL 后端相关的算子的，使在支持相应特性的 Intel 显卡上性能提升 70%-100% ，详见 2.5.0 Release Note
适配 Adreno GPU 的 Recordable Queue 特性，降低了高通芯片上 OpenCL 后端运行结构复杂但计算量相对较小模型的耗时，最高可能降低40%，详见 2.6.0 Release Note
增加了 NNAPI 后端，并对原有的 CoreML 后端和 HIAI 进行了算子扩充

算子整体优化与补充，FP16计算性能提升

Assets 7

Releases: alibaba/MNN

3.4.1

MNN 3.4.1 版本发布说明

📌 版本概述

🚀 版本亮点

✨ 新功能

LLM/VLM

GPU 后端

应用与工具

🔒 安全修复

🐛 缺陷修复

📚 其他改进

🙏 致谢

📦 不兼容变更

MNN 3.4.1 Release Notes

📌 Overview

🚀 Highlights

✨ New Features

LLM/VLM

GPU Backends

Apps & Tools

🔒 Security Fixes

🐛 Bug Fixes

📚 Other Improvements

🙏 Acknowledgements

📦 Breaking Changes

Contributors

Uh oh!

3.4.0

MNN 3.4.0 版本发布说明

📌 版本概述

🚀 版本亮点

✨ 新功能

LLM/VLM

GPU、QNN 后端

工具与应用

⚡ 性能优化

🐛 缺陷修复

📚 其他改进

🙏 致谢

📦 不兼容变更

MNN 3.4.0 Release Notes

📌 Overview

🚀 Highlights

✨ New Features

LLM/VLM

GPU/QNN Backends

Tools & Apps

⚡ Performance Optimizations

🐛 Bug Fixes

📚 Other Improvements

🙏 Acknowledgements

Contributors

Uh oh!

3.3.0 NPU 支持 / SME2 指令加速 / EAGLE 投机解码加速

MNN 3.3 Release Note

一、大语言模型（LLM）能力增强

二、硬件加速与 NPU 支持

三、框架功能与稳定性提升

四、开源社区与兼容性

Uh oh!

3.2.5

MNN 3.2.5 Release Note

核心功能更新

1. 新增HQQ量化算法支持

2. 支持EAGLE-3推测解码算法

3. Qwen系列模型增强支持

详细变更内容

模型推理优化

量化工具改进

模型导出增强

Core Feature Updates

1. Added Support for HQQ Quantization Algorithm

2. Added Support for EAGLE-3 Speculative Decoding Algorithm

3. Enhanced Support for Qwen Series Models

Detailed Changes

Model Inference Optimization

Quantization Tool Improvements

Model Export Enhancements

Uh oh!