Skip to content

3.2.5

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 16 Oct 10:53
9a0546b

MNN 3.2.5 Release Note

核心功能更新

1. 新增HQQ量化算法支持

  • 在MNNConvert工具中集成HQQ量化算法,可通过--hqq参数启用
  • HQQ量化支持非对称量化,能显著提升量化模型的精度
  • 支持与分块量化结合使用,进一步优化模型精度

2. 支持EAGLE-3推测解码算法

  • 新增EAGLE-3推测解码算法实现,提升大语言模型推理效率
  • 实现了EagleGeneration类,支持基于草稿模型的推测解码
  • 提供了Eagle模型导出工具,支持导出eagle、eagle_fc和eagle_d2t三个组件

3. Qwen系列模型增强支持

  • 修复并优化Qwen3-Embedding模型的推理问题
  • 新增对Qwen3-VL多模态大模型的支持
  • 完善了llmexport工具对Qwen系列模型的导出支持

详细变更内容

模型推理优化

  • 重构了LLM模型加载逻辑,在Llm::load()方法中增加了更完善的错误处理
  • 优化了KV Cache管理器的实现,提升了推理过程中的内存管理效率
  • 改进了Metal后端的注意力机制实现
  • 优化了OpenCL后端的卷积执行效率

量化工具改进

  • 在WeightQuantAndCoding.cpp中集成了HQQ量化器,支持更精确的权重量化
  • 优化了量化参数配置逻辑,当启用HQQ时自动设置非对称量化
  • 修复了量化过程中的一些bug,提升了量化稳定性

模型导出增强

  • 完善了llmexport工具的错误处理和日志输出
  • 优化了模型导出流程,提升了导出稳定性
  • 修订了压缩工具相关文档,增加了HQQ量化使用说明

Core Feature Updates

1. Added Support for HQQ Quantization Algorithm

  • Integrated HQQ quantization algorithm into MNNConvert tool, which can be enabled via the --hqq parameter
  • HQQ quantization supports asymmetric quantization, significantly improving the accuracy of quantized models
  • Supports combination with block-wise quantization to further optimize model accuracy

2. Added Support for EAGLE-3 Speculative Decoding Algorithm

  • Implemented EAGLE-3 speculative decoding algorithm to improve large language model inference efficiency
  • Implemented EagleGeneration class to support draft model-based speculative decoding
  • Provided Eagle model export tools supporting export of three components: eagle, eagle_fc, and eagle_d2t

3. Enhanced Support for Qwen Series Models

  • Fixed and optimized inference issues with Qwen3-Embedding model
  • Added support for Qwen3-VL multimodal large model
  • Improved llmexport tool's export support for Qwen series models

Detailed Changes

Model Inference Optimization

  • Refactored LLM model loading logic with enhanced error handling in the Llm::load() method
  • Optimized KV Cache manager implementation to improve memory management efficiency during inference
  • Improved attention mechanism implementation in Metal backend
  • Optimized convolution execution efficiency in OpenCL backend

Quantization Tool Improvements

  • Integrated HQQ quantizer in WeightQuantAndCoding.cpp for more precise weight quantization
  • Optimized quantization parameter configuration logic to automatically set asymmetric quantization when HQQ is enabled
  • Fixed bugs in the quantization process, improving quantization stability

Model Export Enhancements

  • Improved error handling and log output in llmexport tool
  • Optimized model export process to improve export stability
  • Revised compression tool documentation with added HQQ quantization usage instructions