Skip to content

Releases: PaddlePaddle/Paddle

PaddlePaddle 1.3.0

21 Feb 05:48
4b3f9e5

Choose a tag to compare

Release Notes

重要更新

  • 统一Executor和ParallelExecutor接口,用户只需通过CompiledProgram将单卡模型转化多卡模型,并利用Executor进行训练或者预测。
  • 正式发布AnalysisConfig 预测接口,支持计算图分析、算子融合等优化,并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速.
  • 模型库新增发布PaddlePaddle视频模型库,提供5个视频分类经典模型以及适合视频分类任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
  • 新增支持NLP语义表示BERT模型,支持多机多卡训练,支持混合精度训练,训练速度对比主流实现提升50%+,提供完整部署示例。
  • 大规模稀疏参数服务器Benchmark发布, CPU多机异步训练发布显著提升点击率预估任务IO吞吐的built-in reader,多机多卡训练性能多方面提升。
  • 新增支持Intel Deep Learning Boost(VNNI指令集)。在新一代的Intel Xeon Scalable Processor上,使用这个特性的一些模型,INT8预测性能可以达到FP32的2倍。

基础框架

  • 安装
    • 新增Linux和MacOS下的中文版本辅助安装脚本,提供交互式安装方式,协助用户在复杂环境下快速完成PaddlePaddle安装。
    • Windows支持优化:新增cuda8,cudnn7的GPU支持,新增AVX指令集、MKLDNN、mnist数据集支持。修复Windows加载Linux/Mac下同版本paddle训练模型的问题。
  • 增加动态图基础功能
    • 动态图tracer、 autograd、python Layer/PyLayer,动态图支持MLP、GAN、ptbRNN、Resnet模型,动态图支持Optimizer、GPU训练。
  • Executor和ParallelExecutor接口优化
    • 对Executor和ParallelExecutor接口进行统一,用户只需通过CompiledProgram将单卡模型转化多卡模型,并利用Executor进行训练或者预测。
    • ParallelExecutor优化
      对MultiDevSSAGraphBuilder进行重构,使得MultiDevSSAGraphBuilder更易扩展。
      去除ParallelExecutor中的设备锁,提升ParallelExecutor多卡调度性能。
  • 中间表达IR和Pass方面的优化
    • 完善C++ IR graph的python接口以及C++ IR pass的python接口。
    • 在framework.py中新增IRGraph类,为在Python层编写IR Pass做准备。
    • 新增支持网络无锁更新的Pass。
    • 新增QuantizationTransformPass,此为Quantization Aware Training量化模式训练前的图修改操作部分。
  • 内存和显存方面的优化
    • 新增支持在编译时加入 Jemalloc 作为动态链接库,提升内存管理的性能,降低基础框架内存管理开销
    • 新增memory optimize,inplace pass, memory pool early deletion等显存优化策略。
    • 新增支持网络无锁更新的Pass。
    • 新增QuantizationTransformPass,此为Quantization Aware Training量化模式训练前的图修改操作部分。
  • Operator整体层面的优化
    • 每个op在执行前只做一次scope查询,减少读写锁操作(原来需要做1~5次scope查询)
    • 新增Temporary Allocator,减少op中的同步操作
    • 新增py_func operator,支持python op接入,用户可以借助py_func Operator快速实现所需要的特有操作
  • 重构DDim,Variable Type等,降低基础框架调度开销。
  • INTEL FP32计算相关优化
    • 优化density_prior_box operator,单op四线程提速3倍。
    • 优化Stack operator,单op提速16倍。
    • 开发Transpose,Concat和Conv3d三个基于MKLDNN的kernel。
    • 修复lrn operator中MKLDNN kernel精度bug,同时单op提速1.3倍。
    • 修复MKLDNN初始化占用5G内存的问题,目前初始化占用500MB。
    • 减少从MKLDNN OP kernel到非MKLDNN OP kernel时不必要的reorder。
  • 完善CPU JitKernel
    • sequence pooling 的jitkernel,纯op提升2倍。
    • softmax 的jitkernel,纯op提升2倍,同时使得Bert模型CPU预测提升26%。
    • 常见的基本逻辑:向量的每个元素求平方kVSquare、矩阵乘法kMatMul、向量的最大值kHMax、向量所有元素的和kHSum。

预测引擎

服务器预测

  • 正式发布AnalysisConfig 预测接口,支持计算图分析、算子融合等优化,并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速。
  • 预发布 intel CPU上的 预测 INT8 离线量化方案
    • 开发Conv2D,Pool2D,Quantize,Dequantize四个基于MKL-DNN的INT8 kernel。
    • 预发布Calibration的3个核心Python API(paddle.fluid.contrib.Calibrator)。
    • 开发Calibration工具,保证FP32和INT8的精度在ResNet-50和MobileNet-V1在ImageNet验证数据集上相差在1%内。
    • 支持Intel Xeon CascadeLake Server(VNNI指令)及Intel Xeon SkyLake Server,性能提升约为1.33倍。
  • CPU预测速度提升
    • fuse sequence pooling concatop,支持N (<200)个sequence_pooling op concat起来组成一个新op,整体使得seqpool模型 CPU预测提升56%。
    • fuse 连续重复的fc op为一个大op,使得seqpool模型CPU预测速度提升15%。
    • fuse 逻辑为((X * Y).^2 - (X.^2 * Y.^2) ) .* scalar的op组合 , 使得seqpool模型CPU预测速度提升8.2%。
    • 针对输入tensor元素个数为1的情况,优化compare_op的CPU Kernel。
  • 新增Paddle-TRT 对Calibration INT8的支持,GPU预测速度提升
    • 模型VGG,Resnet50上预测速度达到了Paddle-TRT float32的两倍性能。
    • 模型VGG,Resnet50在imagenet数据集上测试,精度下降0.3%以内。
  • 算子融合
    • 增加 fc和 con 相关两个 fuse,作用于 conv_op CUDNN kernel。
    • 新增Conv+Affine Channel的融合pass,Faster RCNN运行的性能提升26.8%。
    • 新增Transpose+Flatten+Concat 融合pass,MobilenetSSD模型性能提升15%。
    • 实现beam_search operator的CUDA Kernel,并且将相应的top-k、elementwise_add、reshape、log计算融合到beam_search operator中。
  • 功能完善及易用性提升
    • 新增C++ IR graph的Python接口。
    • 新增预测库的Python接口。
    • 服务端预测支持从内存加载模型。
  • 其他
    • 删除legacy V2代码。从1.3版本起,不再支持V1&V2老版本功能。
    • 修复Paddle-TRT elementwise-mul模型运行出现问题的bug。
    • 修复Paddle-TRT trt_engine stream多个连续输入情况下模型输出结果异常的bug。

移动端预测

  • 效率优化,常见模型预测速度提升
    • int8预测支持dequantize和其他op(batch normalization/relu/elementwise add)进行自动kernel融合。
    • transpose2 operator对于shuffle channel操作进行优化。
    • gru operator使用neon指令进行优化,并针对batch size为1时进行优化。
    • 优化和实现pooling,支持任意的padding。
    • 优化和实现batch normalization、softmax、elementwise add。
  • 新增支持多个输入和多个输出的模型预测。
  • 新增实现prelu6 operator、cast operator、top_k operator。
  • 修复int8 offline量化溢出结果不对的问题。
  • 修复winograd实现在输入feature map的height和width不相等时结果可能为0的bug。

模型建设

  • PaddleCV 智能视觉
    • 新增发布PaddlePaddle视频模型库,包括五个视频分类模型:Attention Cluster、NeXtVLAD、LSTM,、stNet、TSN。提供适合视频分类任务的通用骨架代码,包括数据读取和预处理、训练和预测、网络模型以及指标计算等多个模块。用户根据需要添加自己的网络模型,直接复用其他模块的代码,快速部署模型。
    • 新增支持目标检测Mask R-CNN模型,效果与主流实现打平。
    • 语义分割DeepLabV3+模型,depthwise_conv op融合,显存优化,显存占用对比上一版本减少40%。
  • PaddleNLP 智能文本处理
    • 新增支持NLP语义表示BERT模型,支持多机多卡训练,支持混合精度训练,训练速度对比主流实现提升50%+,提供完整部署示例。
    • 机器翻译Transformer模型优化解码计算,decoder中加入对encoder output计算结果的cache,预测速度提升一倍。
  • PaddleRec 智能推荐
    • Sequence Semantic Retrieval 新增单机多线程、单机多卡运行示例,添加预测功能、数据预处理优化,完善部署示例。
    • GRU4Rec新增负采样功能,使用bpr loss和cross entropy loss的效果与原作打平。

分布式训练

  • 大规模稀疏参数服务器Benchmark发布
    • 测试真实业务场景下,特征规模百亿、样本平均特征数1k的点击率预估任务,在batch=512情况下,100worker加速比90.5,吞吐量1.36M/s 。
  • CPU多机异步训练
    • 发布面向点击率预估任务的built-in reader,Criteo数据集下IO总吞吐提升1300%。
  • GPU多机多卡水平扩展性能提升
    • 新增并行模式:PG(ParallelGraph)、MP(Multi-Process),独立GPU卡之间的计算,提升性能同时,不影响模型精度。
    • 在ResNet50模型,单机8卡V100下,PG, MP模式提升训练性能30%以上;4机32卡,PG模式提速46%,MP模式提速60%。
    • 在BERT模型,8卡V100下,PG, MP模式提升训练性能26%。
    • Multi-Process模式相比Parallel-Graph模式对Reader速度敏感度不高。
  • GPU多机多卡垂直扩展性能提升
    • 新增功能:fp16和混合精度训练
    • Fp16单机单卡加速情况:ResNet50提速约87%,BERT提速约70%。
    • BERT同时开启PG和混合精度,单机8卡下单位时间吞吐提升120%。
    • ResNet50同时开启混合精度训练和MP模式,在V100单机8卡、4机32卡下,单位时间吞吐提升100%。
  • 典型模型收敛速度优化
    • 新增功能:动态Batch Size,动态Image Resize方法。
    • Resnet50 on Imagenet数据集:训练收敛轮数下降为标准训练方法的1/3左右。

VisualDL

  • VisualDL graph支持Paddle fluid保存的模型可视化展示。

Release Notes

Highlights

  • Executor and ParallelExecutor interfaces are unified so that users just need to convert the single card model into multi-card model through CompiledProgram, and use Executor for training or inference.
  • This version officially releases AnalysisConfig inference interface, which supports optimization of computational graph analysis, operator fusion, etc., and supports the acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
  • The model library has initially released PaddlePaddle video model library, which provides 5 classic video classification models and generic structure code suitable for video classification tasks. Users can configure and evaluate the model with efficient configuration in one-click.
  • We added support for NLP semantic representation model BERT, which supports multi-card training on multiple machines and mixed-precision training. It improves training speed by 50%+ compared with mainstream implementation, and a complete deployment example is available.
  • Large-scale sparse parameter server Benchmark has been released. Asynchronous multi-machine training on CPU releases a built-in reader to significantly improve the IO throughput of click-rate estimation tasks. Performance of multi-machine multi-card training has enhanced in various aspects.
  • We added support for Intel Deep Learning Boost(VNNI) on next generation of Intel Xeon Scalable Processors . With that, INT8 inference performance could be improved by 200% over FP32 on some models.

Updates on Basic framework

  • Installation
    • Chinese version of the auxiliary installation script is available for Linux and MacOS with an interactive installation method to help users quickly complete PaddlePaddle installation in complex environments.
    • Better support for Windows:cuda8, ​​cudnn7 GPU support and new AVX instruction set, MKLDNN, mnist dataset support are incorporated. The problem is fixed which is incurred when Windows loads the training model with the paddle of the same version from Linux/Mac platform.
  • New basic functions for Dynamic Computational Graphs
    • tracer, autograd, python Layer/PyLayer can be carried out for Dynamic Computational Graphs. Dynamic Computational Graphs can run models of MLP, GAN, ptbRNN, Resnet. Dynamic Computational Graphs can perform training through Optimizer and support GPU training.
  • Reformed interfaces of Executor and ParallelExecutor
    • The Executor and ParallelExecutor interfaces are unified. Users only need to convert the single card model into a multi-card model through CompiledProgram and use Executor for training or inference.
    • Improved ParallelExecutor
      • Reconstructing MultiDevSSAGraphBuilder makes MultiDevSSAGraphBuilder easier to extend.
      • The improved has removed device locks in ParallelExecutor to promote performance of multi-card scheduling on ParallelExecutor.
  • Optimization for intermediate expression IR and Pass
    • Improve the Python interface of C++ IR graph and the Python interface of C++ IR pass.
    • IRGraph class is created in framework.py to prepare for writing IR Pass in Python layer.
    • The new Pass is added which supports unpinned network updates.
    • QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
  • Optimization of memory and video memory
    • Jemalloc is integrated as a dynamic link library at compile time, to improve performance of memory management and reduce overhead of underlying framework memory management.
    • New video memory optimization strategies are accepted such as memory optimization, inplace pass, memory pool early deletion.
    • A new Pass is supported which supports unpinned network updates.
    • QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
  • Overall optimization for Operator
    • Each op only does a single scope query before execution, reducing the read-write lock operations (originally it needs to do 1~5 scope queries)
    • Temporary Allocator is integrated to reduce synchronization in op.
    • py_func operator is realised to accept python op. Users can quickly carry out custom unique operations with the aid of py_func Operator.
  • Reconstruct DDim, Variable Type and more to reduce the underlying ...
Read more

PaddlePaddle 1.2.1

16 Jan 11:09
afc885d

Choose a tag to compare

Release Notes

Framework

  1. Loss function huber\_loss is included.
  2. Promotion of training performance on small models.
  3. Tensor type implementation changed from typeindex to enum.
  4. Issue fixed: numpy calculation fails because of the failure of import cv2;

Inference

  1. Paddle-TRT support multi-thread inference.
  2. Include TRT automatically when generating inference lib.
  3. Issue fixed: Paddle-TRT pool2d converter bug in ceil mode.

Framework

1、支持 huber loss op
2、针对GPU占比较小模型,提升框架执行时间
3、使用 enum 替换 typeindex 作为 tensor type,提升执行效率
4、Paddle-TRT支持多线程预测
5、生成预测库时,将TRT的lib自动加入到thrid_party中

Bug Fix

1、修复Paddle-TRT pool2d converter 在ceil mode下的bug
2、修复python3下import cv2失败会影响numpy计算的bug

PaddlePaddle 1.2.0

06 Dec 11:51
81b66db

Choose a tag to compare

Release Notes

Framework

  • new pip installation package is available, which can be run on Windows CPU environment.
  • support of python3.6、python3.7
  • Reconstruction of memory allocator modular :Allocator. Improvement on memory allocating strategy in CPU environment.
    Increase in utility ratio of video memory (disabled by default, use FLAGS_allocator_strategy to enable it).
  • Restriction to the usage of SelectedRows, and fix made to bugs on sparse regulation and sparse optimization.
  • Tensor supports DLPack,to facilitate integration of other frameworks or into them.
  • OP
    • Issues on inference of expand op shape have been resolved.
    • Activation function Selu is included.

Inference Engine

  • Server Prediction
    • GPU supports image fusion, and cooperation with TensorRT to realize image modifying. In common image processing models like Resnet50 and Googlenet, with bs=1, the performance has reached a level 50~100% higher.
    • GPU supports DDPG Deep Explore prediction.
    • Paddle-TRT supports more models, including Resnet, SE-Resnet, DPN,GoogleNet.
    • CPU, GPU, TensorRT and other accelerators are merged into AnalysisPredictor,collectively controlled by AnalysisConfig.
    • Add interfaces to call multi-thread mathematic library.
    • Support for TensorRT plugins,including split operator , prelu operator , avg_pool operator , elementwise_mul operator .
    • This version has included JIT CPU Kernel, which is able to perform basic vector operations, partial implementation of common algorithms including ReLU,LSTM and GRU, and automatic runtime switch between AVX and AVX2 instruction set.
    • FDSFDF optimized CRF decoding and implementation of LayerNorm on AVX and AVX2 instruction set.
    • Issue fixed: AnalysisPredictor on GPU or in the transition from CPU to GPU cannot delete transfer data.
    • Issue fixed: Variable has consistent increase of occupied memory of container.
    • Issue fixed: fc_op cannot process 3-D Tensor
    • Issue fixed: on GPU, when running pass, issues happened to Analysis predictor
    • Issue fixed: GoogleNet problems on TensorRT
    • Promotion of prediction performance
      • Max Sequence pool optimization,with single op performance 10% higher.
      • Softmax operator optimization,with single op performance 14% higher.
      • Layer Norm operator optimization, inclusive of AVX2 instruction set, with single op performance 5 times higher.
      • Stack operator optimization,with single op performance 3.6 times higher.
      • add depthwise_conv_mkldnn_pass to accelerate MobileNet prediction.
      • reduce image analysis time in analysis mode, and the velocity is 70 times quicker.
      • DAM open-source model,reached 118.8% of previous version.
  • Mobile Endpoint Prediction
    • This version has realized winograd algorithm, with the help of which the performance of GoogleNet v1 enjoys a dramatic promotion of 35%.
    • improvement on GoogleNet 8bit,14% quicker compared with float.
    • support for MobileNet v1 8bit, 20% faster than float.
    • support for MobileNet v2 8bit, 19% faster than float.
    • FPGA V1 has developed Deconv operator
    • Android gpu supports mainstream network models like MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet.

Model

  • CV image classifying tasks publish pre-trained models: MobileNet V1, ResNet101, ResNet152,VGG11
  • CV Metric Learning models are extended with loss function arcmargin, and the training method is altered. The new method is to adopt element-wise as pre-trained model, and use pair-wise to make further slight adjustment to improve precision.
  • NLP model tasks are newly equipped with LSTM implementation based on cudnn. Compared with the implementation based on PaddingRNN, the cudnn method is 3~5 times quicker under diverse argument settings.
  • Distributed word2vec model is included,including the new tree-based softmax operator,negative sampling,in line with classic word2vec algorithms.
  • Distributed settings of GRU4Rec、Tag-Space algorithms are added.
  • Multi-view Simnet model is optimized, with an additional inference setting.
  • Reinforcement learning algorithm DQN is supported.
  • Currently compatible python3.x models: Semantic model DAM, reading comprehension BiDAF, machine translation Transformer, language model, reinforcement learning DQN, DoubleDQN model, DuelingDQN model, video classification TSN, Metric Learning, character recognition in natural scenes CRNN-CTC 、OCR Attention,Generative Adversarial Networks ConditionalGAN, DCGAN, CycleGAN, Semantic segmentation ICNET, DeepLab v3+, object detection Faster-RCNN, MobileNet-SSD, PyramidBox, iSE-ResNeXt, ResNet, customized recommendation TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet.

Distributed training

  • multi-CPU asynchronous training
    • Asynchronous concurrent workers: AsyncExecutor is added. With a executive granularity of single training file, it supports lock-less asynchronous worker-end computation in distributed training, and single machine training. Take CTR task as an example, general throughput from single machine training is 14 times larger.
    • IO optimization:This version has added compatibility with AsyncExecutor to DataFeed; enabled customized universal classification task formats; incorporated CTRReader for CTR tasks to linearly elevate speed of reading data. In PaddleRec/ctr tasks,the general throughput increases by 2 times.
    • Better data communication: As for sparse access Dense arguments, like Embedding, the sparse data communication mechanism is adopted. Take tasks of semantic matching for instance, the amount of fetched arguments can be compressed to 1% and below. In searching groundtruth data, the general output reached 15 times more.
  • multi-GPU synchronous training
    • Issue fixed: In Transformer、Bert models, P2P training mode may be hung.

Documentation

  • API
    • Add 13 api guides
    • Add 300 entries of Chinese API Reference
    • Improve 77 entries of English API Reference, including Examples and argument explanation and other adjustable sections.
  • Documentation about installation
    • Add installing guide on python3.6、python3.7.
    • Add installing guide on windows pip install.
  • Book Documentation
    • Code examples in Book documentation are substituted with Low level API.

基础框架

提供新pip安装包,支持Windows下CPU执行。

新增对python3.6、python3.7的支持。

重构内存分配模块Allocator,提升CPU下内存分配策略,提升显存利用率(默认关闭,需要使用FLAGS_allocator_strategy)。

限制SelectedRows的使用。修复了稀疏正则和稀疏优化器的bug。

Tensor支持DLPack,方便被其他框架集成和集成其他训练框架。

修复 expand op shape 推理错误的bug。

支持 Selu 激活函数。

预测引擎

服务器预测

GPU 支持图融合,且支持和 TensorRT引擎混合改图,在Resnet50和Googlenet等图像通用模型上bs=1下性能提升 50%~100%。

GPU支持DDPG Deep Explore预测。

Paddle-TRT对更多模型的支持,其中包括Resnet, SE-Resnet, DPN,GoogleNet。

CPU, GPU, TensorRT 等加速引擎合并入 AnalysisPredictor,统一由 AnalysisConfig 控制。

增加调用多线程数学库的接口。

新增TensorRT plugin的支持,包括split operator, prelu operator, avg_pool operator, elementwise_mul operator。

增加了JIT CPU Kernel,支持基本的向量操作,以及常见的算法包括ReLU,LSTM和GRU的部分实现,可以实现在AVX和AVX2指令集之间自动runtime切换。

优化CRF decoding和LayerNorm在AVX以及AVX2指令集上的实现。

修复了 AnalysisPredictor 在GPU,在CPU 到 GPU 的 transfer data 不删除的问题。

修复了 Variable 中包含 container 内存持续增长的问题。

修复fc_op不支持3-D Tensor的问题。

修复了Analysis predictor 在GPU下执行pass时的问题。

修复了TensorRT下运行GoogleNet的问题。

预测性能提升

Max Sequence pool optimization,单op提高10%。

Softmax operator 优化,单op提升14%。

Layer Norm operator优化,支持avx2指令集,单op提升5倍。

Stack operator 优化,单op提升3.6倍。

增加depthwise_conv_mkldnn_pass,加速MobileNet预测。

加速analysis模式的图分析时间,提升70倍。

DAM开源模型,提升118.8%。

移动端预测

实现winograd算法, GoogleNet v1性能大幅提升35%。

GoogleNet 8bit优化,相比float加速14%。

MobileNet v1 8bit支持,相比float加速20%。

MobileNet v2 8bit支持,相比float加速19%。

FPGA V1 开发了Deconv算子。

android gpu支持MobileNet、MobileNetSSD、GoogleNet、SqueezeNet、YOLO、ResNet等主流的网络模型。

模型建设

CV图像分类任务发布MobileNet V1, ResNet101, ResNet152,VGG11预训练模型。

CV Metric Learning模型新增arcmargin损失,并调整训练方式,采用element-wise作为预训练模型,pair-wise继续微调的训练方式提升精度。

NLP语言模型任务新增基于cudnn的LSTM实现,对比PaddingRNN的实现方式,在不同参数配置下速度提升3~5倍。

增加分布式word2vec模型,包括新增的tree-based softmax operator,negative sampling等,与经典word2vec算法对齐。

新增GRU4Rec、Tag-Space算法的分布式配置。

完善Multi-view Simnet模型,并增加inference配置。

支持强化学习算法 DQN。

现已支持python3.5及以上的模型:语义匹配DAM,阅读理解BiDAF,机器翻译Transformer,语言模型,强化学习DQN、DoubleDQN模型、DuelingDQN模型,视频分类TSN,度量学习Metric Learning,场景文字识别CRNN-CTC 、OCR Attention,生成式对抗网络ConditionalGAN 、DCGAN、CycleGAN,语义分割ICNET、DeepLab v3+,目标检测Faster-RCNN、MobileNet-SSD 、PyramidBox ,图像分类SE-ResNeXt、ResNet等,个性化推荐TagSpace、GRU4Rec、SequenceSemanticRetrieval、DeepCTR、Multiview-Simnet。

分布式训练

CPU多机异步训练

worker异步并发:增加AsyncExecutor,以训练文件作为执行粒度,支持分布式训练中的worker端计算异步无锁计算,同时支持单机训练。以CTR任务为例,单机训练速度,在充分利用单机线程的情况下,整体吞吐提升14倍。

IO优化:增加支持AsyncExecutor的DataFeed,支持可定制化的通用分类任务格式。面向CTR任务,增加CTRReader,使数据读取速度线性提升,在PaddleRec/ctr任务中,整体吞吐提升1倍。

通信优化:针对稀疏访问的Dense参数例如Embedding,增加稀疏通信机制,以语义匹配任务为例,获取参数的总量可以压缩到1%以下,在搜索真实场景的数据下,整体训练吞吐可以提升50倍。

GPU多机同步训练

修复Transformer、Bert模型下P2P训练模式会Hang住的问题。

文档

API

新增13篇API​使用指南。

新增300个API Reference中文文档。

优化77个API Reference英文文档:包括代码示例、参数说明等。

安装文档

新增python3.6、python3.7安装说明。

新增windows pip install安装说明。

Book文档

Book文档中的代码示例更改为Low level API。

使用文档

新增《Operator相关注意事项》,更新《保存与载入模型变量》、《C++预测API介绍》、《使用TensorRT库预测》、《如何贡献代码》等多篇使用文档。

PaddlePaddle 1.1.0

31 Oct 02:48
66024e9

Choose a tag to compare

Release Notes

Major New Features and Improvements

Framework

  • Memory optimization strategy "eager deletion" now supports sub-block in control flow operators (e.g. if-else, while). Significantly reduce memory consumption of models with control flow operators.

  • Optimize split operator, significantly improve performance.

  • Extend multiclass_nms operator, supports polygon bounding box.

  • Added generate_proposals operator CUDA implementation, significantly improve performance.

  • Support fusing affine_channel operator and batch_norm operator, significantly improve performance.

  • Optimize depthwise_conv operator, significantly improve performance.

  • Optimize reduce_mean operator, significantly improve performance.

  • Optimize sum operator, significantly improve performance.

  • Optimize top_k operator, significantly improve performance.

  • Added new sequence_slice operator. For a sequence, slice sub-sequence based on specified start and length.

  • Added new sequence_unpad operator. Support padding Tensor to LoDTensor conversion.

  • Added new sequence_reverse operator. roi_align operator, affine_channel operator.

Server Inference

  • Added avx, noavx auto switch feature, allow major models to automatically switch among avx, avx2, avx512.

  • Improve inference usability: Only need to include 1 header and 1 library.

  • Significantly improve inference performance.

Mobile Inference

  • Added Mali GPU and Andreno GPU support for mobilenet v1 model.

  • Added ZU5, ZU9 FPGA support for resnet34 and resnet50 models.

发布日志

主要新功能和优化

基础框架

  • 显存优化策略eager deletion支持control flow (e.g. if-else, while)中子block的优化。显著降低包含control flow的模型的显存开销。

  • 优化了split operator,显著提升性能。

  • 扩展multiclass_nms operator,支持多边形的预测框。

  • 新增generatoe_proposals operator的CUDA实现,显著提升性能。

  • 通过affine_channel operator融合batch_norm operator,显著提升性能。

  • 优化depthwise_conv operator的forward和backward,显著提升性能。

  • 优化reduce_mean operator。

  • 优化sum operator,该operator在输入是Tensor的情况下,减少一次zero memory耗时。

  • 优化top_k operator,显著提升性能。

  • 新增sequence_slice operator,对于一个sequence,可以从指定位置开始,slice出指定长度的subsequence。

  • 新增sequence_unpad operator,支持padding Tensor转LoDTensor。

  • 新增sequence_reverse operator,roi_align operator,affine_channel operator。

服务端预测

  • 增加了部署时 AVX 和 NOAVX 自动切换的feature,可以针对重点模型实现AVX, AVX2, AVX512自动切换
  • 提升预测库易用性:只需要 include一个头文件和一个库。
  • ICNet 预测性能大幅提升。

移动端预测

  • 新增Mali GPU和Andreno GPU上mobilenet v1模型支持。
  • 新增ZU5、ZU9等FPGA开发板上resnet34和resnet50模型支持。

PaddlePaddle 1.0.2

24 Oct 06:31
4a93486

Choose a tag to compare

Fix SelectedRows type inference.

PaddlePaddle 1.0.1

10 Oct 11:46
cddff20

Choose a tag to compare

Fix Windows library dynamic loading program

Fix Mac compile on MacOS 10.14

Fix truncated_normal

Fix manylinux docker build

Correctly set SelectedRows output shape

Correctly integrate tensorRT in inference library.

PaddlePaddle 1.0.0

09 Oct 01:08
627bea4

Choose a tag to compare

Release Log

Major New Features and Improvements:

  • Support MacOS training, inference, Windows inference (Alpha).

  • Speed up While operator

  • Enhance support for sparse tensor

  • TensorRT integration enhance

  • More fused operators for CPU inference: GRU, LSTM, etc.

  • Some improvements for sequence operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • Other operator improvements: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad support for distributed training

  • Python multi-process reader

  • API doc improvements. Avoid kwargs.

Others:

  • Tighten public APIs. Hide public APIs that are currently not widely used and unlikely to be used in the near future.

  • Clean up some deprecated features.

Known Issues

  • Memory optimization still has space for improvements in next release.

  • Using memory optimization with distributed training should strictly follow some counter-intuitive instructions.

  • Sparse Tensor (SelectedRows)'s is not handled correctly in some operators and is being fixed in the next release

发布日志

主要新功能和优化

  • 支持 MacOS 训练和预测,Windows预测(内测)

  • 提高while operator的速度

  • 增强对sparse tensor的支持

  • TensorRT 集成的加强

  • 更多CPU预测的融合operator: GRU, LSTM, etc.

  • 优化序列相关operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • 其他operator的优化: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad 支持分布式训练

  • Python多进程reader

  • API 文档优化,避免kwargs等问题

其他:

  • 规范管理public API. 一些当前不常被使用并且将来不太可能被使用的API被隐藏起来

  • 清理一些废弃的功能

已知问题

  • 内存优化在下个release还有一些的提高空间

  • 内存优化和分布式训练的同时使用需要严格遵循一些不太合乎直觉的步骤

  • Sparse Tensor (SelectedRows)'s 在一些operators里面没有被正确的处理,在下个release中会被修复。

PaddlePaddle 1.0.0-rc0

25 Sep 10:45
644bad1

Choose a tag to compare

Pre-release

Release Log

Major New Features and Improvements:

  • Support MacOS training, inference, Windows inference (Alpha).

  • Speed up While operator

  • Enhance support for sparse tensor

  • TensorRT integration enhance

  • More fused operators for CPU inference: GRU, LSTM, etc.

  • Some improvements for sequence operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • Other operator improvements: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad support for distributed training

  • Python multi-process reader

  • API doc improvements. Avoid kwargs.

Others:

  • Tighten public APIs. Hide public APIs that are currently not widely used and unlikely to be used in the near future.

  • Clean up some deprecated features.

Known Issues

  • Memory optimization still has space for improvements in next release.

  • Using memory optimization with distributed training should strictly follow some counter-intuitive instructions.

发布日志

主要新功能和优化

  • 支持 MacOS 训练和预测,Windows预测(内测)

  • 提高while operator的速度

  • 增强对sparse tensor的支持

  • TensorRT 集成的加强

  • 更多CPU预测的融合operator: GRU, LSTM, etc.

  • 优化序列相关operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • 其他operator的优化: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad 支持分布式训练

  • Python多进程reader

  • API 文档优化,避免kwargs等问题

其他:

  • 规范管理public API. 一些当前不常被使用并且将来不太可能被使用的API被隐藏起来

  • 清理一些废弃的功能

已知问题

  • 内存优化在下个release还有一些的提高空间

  • 内存优化和分布式训练的同时使用需要严格遵循一些不太合乎直觉的步骤

PaddlePaddle 0.15.0

05 Sep 03:46
1ca241c

Choose a tag to compare

Release Log

Major New Features and Improvements:

  • PyReader. Support python-level customized data loading and preprocessing for the buffered reader.

  • Unified Intermediate Representation (IR) and transforms for single-machine, distributed training and inference.

  • Python3 early support. (Alpha testing)

  • Inference library symbol hiding. Better isolation with other libraries linked together.

  • Distributed lookup table training with parallel executor. Allow to scale distributed training for large scale sparse dataset. (Alpha testing)

  • Major stability improvements and test coverage improvements of distributed training.

  • Polish high frequency enforce error message. Enhance user usability.

  • Profiler improvements for dist_train and fixes.

  • Operator improvements: mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • Major expansion of TensorRT inference support.

  • Continuous Integration and Evaluation service scale and stability improvements

  • Hide many public APIs that shouldn't be exposed.

Performance:

  • layer_norm speedup: forward: 0.52ms -> 0.16ms (average) backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn speedup, fc, gru cpu improvements.

  • reduce_sum cpu kernel speedup: 4 times

  • softmax_with_cross_entropy op is as followings: Forward: 52.4ms -> 15.6ms

  • OCR CPU model speed improvements. Enhanced im2col and some OCR model performance on 2620v3 improved 34.6%

  • depthwise conv2d_transposed speed up. Improved face detection model by 16.5%.

Others:

  • Added external dependencies: xbyak, cub, libxsmm

  • Merge libpaddle_inference_api[.a/.so] into libpaddle_fluid[.a/.so]. Inference only need to link libpaddle_fluid[.a/.so].

  • Fixes of float16 support

  • Significantly reduce fluid.tgz package size. GPU version reduced from 730M to 190M. CPU version reduced from 335M to 77M.

Known Issues

  • Using memory_optimize with distributed training might trigger subtle bugs. We are aiming to fix it in the next release.

发布日志

主要新功能和优化

  • PyReader. 支持python自定义数据的读取和预处理,然后发送给带buffer的reader

  • 单机,多机和预测都使用统一的中间表达和转换。

  • Python3的支持(内测)

  • 预测库更好的symbol隐藏,更好的和其他的依赖库进行隔离。

  • 支持分布式的lookup table。可以支持训练是的大规模稀疏。(内测)

  • 分布式训练的显著稳定性提升和测试覆盖提升。

  • 提高报错信息的可读性。

  • Profile对分布式的支持和修复

  • 新增算子:mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • 对TensorRT支持的扩展,支持更多的TensorRT算子。

  • 持续集成测试系统规模的稳定性的提升

  • 隐藏了大量不应该暴露的public API,增强public API的严谨性。

性能:

  • layer_norm前向加速:0.52ms -> 0.16ms (average),反向加速:backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn 加速, fc, gru cpu 上优化。

  • reduce_sum cpu上4倍提速

  • softmax_with_cross_entropy提速52.4ms -> 15.6ms

  • OCR CPU模型性能提升,改进im2col实现,增强了conv的执行效率,使得OCR模型在2620v3上取得34.6%的性能提升。

  • conv2d_transposed_op支持设置Group,并且加速depthwise conv2d_transposed,该加速使得人脸检测模型速度提升16.5%

其他:

  • 新增第三方库:xbyak, cub, libxsmm

  • 将 libpaddle_inference_api[.a/.so] 合并到 libpaddle_fluid[.a/.so],预测只需要链接 libpaddle_fluid[.a/.so]

  • float16的修复

  • 大幅减少发布的fluid.tgz包大小,gpu版本从730M降低为190M,cpu版本从335M降低为77M,加快用户下载。

已知问题

  • memory_optimize 在分布式的时候会触发bug,我们会在下一个版本修复。

PaddlePaddle 0.15.0-rc0

03 Sep 03:23
64d48f4

Choose a tag to compare

Pre-release

Release Log

Major New Features and Improvements:

  • PyReader. Support python-level customized data loading and preprocessing for the buffered reader.

  • Unified Intermediate Representation (IR) and transforms for single-machine, distributed training and inference.

  • Python3 early support. (Alpha testing)

  • Inference library symbol hiding. Better isolation with other libraries linked together.

  • Distributed lookup table training with parallel executor. Allow to scale distributed training for large scale sparse dataset. (Alpha testing)

  • Major stability improvements and test coverage improvements of distributed training.

  • Polish high frequency enforce error message. Enhance user usability.

  • Profiler improvements for dist_train and fixes.

  • Operator improvements: mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • Major expansion of TensorRT inference support.

  • Continuous Integration and Evaluation service scale and stability improvements

  • Hide many public APIs that shouldn't be exposed.

Performance:

  • layer_norm speedup: forward: 0.52ms -> 0.16ms (average) backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn speedup, fc, gru cpu improvements.

  • reduce_sum cpu kernel speedup: 4 times

  • softmax_with_cross_entropy op is as followings: Forward: 52.4ms -> 15.6ms

  • OCR CPU model speed improvements. Enhanced im2col and some OCR model performance on 2620v3 improved 34.6%

  • depthwise conv2d_transposed speed up. Improved face detection model by 16.5%.

Others:

  • Added external dependencies: xbyak, cub, libxsmm

  • Merge libpaddle_inference_api[.a/.so] into libpaddle_fluid[.a/.so]. Inference only need to link libpaddle_fluid[.a/.so].

  • Fixes of float16 support

  • Significantly reduce fluid.tgz package size. GPU version reduced from 730M to 190M. CPU version reduced from 335M to 77M.

发布日志

主要新功能和优化

  • PyReader. 支持python自定义数据的读取和预处理,然后发送给带buffer的reader

  • 单机,多机和预测都使用统一的中间表达和转换。

  • Python3的支持(内测)

  • 预测库更好的symbol隐藏,更好的和其他的依赖库进行隔离。

  • 支持分布式的lookup table。可以支持训练是的大规模稀疏。(内测)

  • 分布式训练的显著稳定性提升和测试覆盖提升。

  • 提高报错信息的可读性。

  • Profile对分布式的支持和修复

  • 新增算子:mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • 对TensorRT支持的扩展,支持更多的TensorRT算子。

  • 持续集成测试系统规模的稳定性的提升

  • 隐藏了大量不应该暴露的public API,增强public API的严谨性。

性能:

  • layer_norm前向加速:0.52ms -> 0.16ms (average),反向加速:backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn 加速, fc, gru cpu 上优化。

  • reduce_sum cpu上4倍提速

  • softmax_with_cross_entropy提速52.4ms -> 15.6ms

  • OCR CPU模型性能提升,改进im2col实现,增强了conv的执行效率,使得OCR模型在2620v3上取得34.6%的性能提升。

  • conv2d_transposed_op支持设置Group,并且加速depthwise conv2d_transposed,该加速使得人脸检测模型速度提升16.5%

其他:

  • 新增第三方库:xbyak, cub, libxsmm

  • 将 libpaddle_inference_api[.a/.so] 合并到 libpaddle_fluid[.a/.so],预测只需要链接 libpaddle_fluid[.a/.so]

  • float16的修复

  • 大幅减少发布的fluid.tgz包大小,gpu版本从730M降低为190M,cpu版本从335M降低为77M,加快用户下载。