核心推理能力与模型支持增强
- 支持文本
prompt_logprob及全量logprob能力 #4769 - 支持离线推理中基于 ZMQ 的
logprobs / prompt_logprobs,并引入max_logprobs参数 #4897 - 支持在线推理中基于 ZMQ 的
logprobs / prompt_logprobs,并优化通信方式 #5089 - 新增
logprobs / prompt_logprobs的token_id解码控制开关 #5463 - 受限解码新增
llguidance后端 #5124 - CUDAGraph 支持投机解码 Draft Model 加速(默认关闭)
- [Speculative Decoding] 解耦
draft_tokens后处理流程 #5205 - 支持 Pooling 模型 Runner
- 支持 Reward 模型
- Pooling 模型通用
embedding接口 #4344 - Pooling 模型定制
reward接口 #4518 - 新增开源模型 Ernie-4.5-VL-28B-A3B-Thinking 的
reasoning_parser,兼容- / _命名规则 #4571 #4668 - 支持通过
chat_template_kwargs.options.thinking_mode控制思考开关 - 支持多模模型传入
prompt_token_ids请求,并通过messages输入多模数据,实现 tokens-in / tokens-out 能力
并行架构、调度与 MoE 能力演进
- GLM / Qwen 模型消除 EP 空跑时的通信开销 #5254
- 支持 MoE 分 chunk 执行 #4575
- 支持 EPLB(Expert Load Balancing)#4782
- 支持 EPLB 重排与冗余专家策略 #5142 #5143 #5178 #5239 #5918
- 支持路由重放机制
- PD 分离支持 Deepseek V3 模型 EP 并行部署 #5251
- PD 分离支持 Qwen3-MoE 模型 EP 并行部署 #4691
- PD 分离支持 Prefill 与 Decode 使用不同 TP Size #5296
- 新增 Python 版本 Router,支持集中式与分离式部署调度 #4709
- 支持多步 MTP + CUDAGraph + PD 分离
- 支持 MTP 无损验证
- 支持 MTP 分 chunk #5343
多模态、缓存与量化能力增强
- 支持多模单 batch、纯文本多 batch 混合 Prefill 调度 #4611
- 支持多模 Prefix Cache #4803
- 动态量化支持 Prefix Cache #5125
- 修复并支持多模 Prefix Cache 与 CUDAGraph 同时开启 #4679
- 支持 W4AFP8 动态量化 #5282
- 支持静态 C8 scale 单独加载 #4624
- 完善 Machete 对不同量化 group size 的支持 #4911
- 支持 Flash Mask Attention Backend 接入 #5104 #5134 #5387
- v1 Loader 加载性能优化 #4532
- 支持预编译包功能 #4729
多硬件平台支持扩展
P800
Intel HPU
Iluvatar GPU
MetaX
- 支持 ERNIE-4.5-VL-28B #4820
- 新增 Cutlass MoE #4602 #4685 #5128
- 支持 default_v1 loader #4956 #5001
- 优化 Flash MLA 性能 #4915
- 新增 Triton MoE 的 default_v1 loader 与 quant_config #5030
- 支持 ENABLE_V1_KVCACHE_SCHEDULER #5163
性能优化、可观测性与稳定性修复
性能与通信优化
- AppendAttn 算子支持 CUDA-PDL #5072
- DeepGemm H2D 消除 #5262
- 优化集中式 EP 通信逻辑 #5145
- 移除 CUDA Graph 下 Append Attention 的 DtoH 同步开销
- 支持两阶段低时延通信 #4162
- 支持 TP + EP 混合并行 #4615 #5315 #5353
- 默认编译 RDMA,降低多模 CUDAGraph 开销
可观测性与安全
稳定性与 Bug 修复
- 修复 logprob / prompt_logprob 计算、序列化及通信相关问题 #4681 #4884 #5237 #5335
- 修复 EP、PD 分离、MTP、Prefix Cache、量化、多模态等多类推理场景下的稳定性问题
- 修复多硬件(XPU / MetaX / Luvatar / P800)算子与参数校验问题
What's Changed
- [BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in #4553
- [BugFix] Fix graph opt test case by @gongshaotian in #4634
- [Feature] add mm token usage by @ApplEOFDiscord in #4570
- [XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in #4636
- [Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in #4658
- [XPU] fix pos_emb_type bug by @cqulilujia in #4638
- [Docs] add Qwen25vl yaml by @xjkmfa in #4662
- [Feature] add a new reasoning parser by @kxz2002 in #4571
- [XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in #4665
- add noaux_tc to unitest fused_moe by @zhoutianzi666 in #4656
- [EP] fix several bugs in data parallel by @ltd0924 in #4657
- [OP] Add InferShape&InferDtype for
per_token_quant_paddingby @DrRyanHuang in #4667 - 【Hackathon 9th No.86】autogen
MoeFastHardamardImplWrappertemplate_instantiation by @ccsuzzh in #4592 - [UT] Add ut for speculative sampler by @Deleter-D in #4650
- [Doc] update docs by @ApplEOFDiscord in #4675
- [Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in #4601
- [CI] Add test for paddleocr_vl by @Limerances in #4627
- [unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in #4676
- [noauxtc_kernel] remove useless code by @zhoutianzi666 in #4643
- [BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in #4686
- [BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in #4687
- [BugFix] fix --logprobs-mode raw_logits by @ckl117 in #4681
- [XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in #4695
- [XPU] [CI] Add Vl case by @plusNew001 in #4649
- [BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in #4582
- [Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in #4668
- [BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in #4701
- [XPU] [CI] Lock xvllm version by @plusNew001 in #4715
- [Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in #4645
- [Feature] support mtp distribution equivalence verification by @Deleter-D in #4699
- [KVCache] Support kv cache scale load by @Sunny-bot1 in #4624
- add flops and bandwidth to test_ffn.py by @zhoutianzi666 in #4704
- benchmark工具支持受限解码场景指定response_format by @ophilia-lee in #4718
- [CI] add missing unit tests for tokenizer_cli by @xiaolei373 in #4620
- [Scheduler] update v1 prefill batch by @kevincheng2 in #4611
- [BugFix] Fix profile run in pd-disaggregated deployment by @liyonghua0910 in #4584
- [BugFix] fix mm prefix_cache cuda error bug by @kevincheng2 in #4679
- [Feature] Check bos url by @kevincheng2 in #4711
- [BugFix] fix wint2 config by @chang-wenbin in #4721
- [FDConfig] [PD Disaggregation] [Graph Optimization] Close Cudagraph for P node when PD Disaggregation by @littledgg in #4632
- [XPU] xpu support neox style ROPE by @ddchenhao66 in #4719
- [BugFix] Skip building native architecture when specifying arch list by @ming1753 in #4727
- fix noaux by @zhoutianzi666 in #4731
- [BugFix] fix thinking bug by @yuanlehome in #4710
- [CI] Fix rollout_model test logic by @EmmonsCurse in #4730
- [Feature] support pooling model runner by @lizexu123 in #4590
- format code by @zhoutianzi666 in #4720
- [CI] fix some ci yaml by @EmmonsCurse in #4747
- [Docs]Update XPU document version to 2.3.0 by @yyssys in #4741
- [Speculative Decoding][MTP]Support mtp in splitewise and scheduler_v1 mode by @freeliuzc in #4743
- [Speculative Decoding][MTP]Support attn mask offset by @freeliuzc in #4641
- [Docs]Add parameter to the start service command by @yyssys in #4753
- [Docs]Add parameter by @yyssys in #4755
- [Docs] fix PaddleOCR-VL docs bug by @ming1753 in #4702
- [Feature] Support eplb for fd by @rainyfly in #4599
- [XPU] add v1 support for bf16 by @iosmers in #4744
- 【DataProcessor】add options thinking_mode by @luukunn in #4735
- [Optimize] Support and robust for tpN for PD by @rainyfly in #4595
- [Docs] fix error by @yyssys in #4768
- [CI]test common model by @bukejiyu in #4697
- [Metax] adapt cutlass moe for ernie-vl by @neilzhuu in #4685
- fix dynamic Cfp8 for RL load by @rsmallblue in #4144
- [Docs] PaddleOCR-VL add RTX3060 server param by @ming1753 in #4765
- [BugFix] fix deepseek cuda error by @kevincheng2 in #4739
- [XPU][CI] fix ci base value bug by @plusNew001 in #4783
- [OP]Fix attn_params by @freeliuzc in #4787
- [CI]delete test_common_model by @bukejiyu in #4794
- [XPU] fix thinking bug where output only contains reasoning_content by @ddchenhao66 in #4761
- [XPU] add deployment doc for PaddleOCR-VL in XPU by @cqulilujia in #4784
- [BugFix] Fix ernie4_5_vl_processor.py and qwen_vl_processor.py can not disable thinking by @kxz2002 in #4762
- supports internode_ll_two_stage by @carryyu in #4162
- supports pd partn by @carryyu in #4615
- [Docs] Add new support models by @ming1753 in #4801
- [CI] Refactor CE wheel upload for multiple target paths by @EmmonsCurse in #4790
- [Docx] update mkdocs.yml by @yangjianfengo1 in #4804
- [BugFix] Fix step_shm_value in PD disaggregated deployment by @liyonghua0910 in #4780
- Update Unit Test for PaddleOCR-VL by @Limerances in #4802
- [Metax] adapt cutlass moe and fix mla attention for DeepSeek by @xiaozude in #4602
- [Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs by @ckl117 in #4769
- [get_padding_offset.] clean get_padding_offset.cu by @zhoutianzi666 in #4777
- support ep+tp at op layer by @zhupengyang in #4688
- [BugFix] fix reasoning parser register name by @kxz2002 in #4795
- remove input_ids from ForwardMeta by @zhoutianzi666 in #4793
- [Feature] Add timestamp for profiler by @rainyfly in #4726
- [XPU]Support V1 loader in weight_only Model by @iosmers in #4808
- [Bug Fix] process transparent image by @ApplEOFDiscord in #4807
- add paddleocr_vl benchmark by @zhang-prog in #4833
- [Doc] Update docs for v2.3.0rc0 by @Jiang-Jia-Jun in #4828
- [BugFix] fix messages being inplace modified in offline chat api by @liyonghua0910 in #4831
- 【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in #4272
- [CI] fix docker_build error and add tag-base by @EmmonsCurse in #4810
- [PD Disaggregation] Support Qwen3-MoE use PD + EP inference. by @K11OntheBoat in #4691
- remove seq_lens_this_time by @zhoutianzi666 in #4821
- [BugFix] Fix ernie_vl_reasoning_parsers.py 'end_token' to 'think_end_token' by @kxz2002 in #4805
- Fix: ci port conflict by @sunlei1024 in #4840
- [CI] Add unittest for activation, native_paddle_backend, w4a8, w4afp8, platforms/utils by @Echo-Nie in #4812
- [XPU][CI]Change ci vl model to 28 b by @plusNew001 in #4764
- [Fix] fix
ernie4_5_vlmodel torch format loadding by @aquagull in #4447 - [Feature] [PD] add simple router and refine splitwise deployment by @juncaipeng in #4709
- [Docs] fix: correct typo in nvidia_gpu.md by @playaswd in #4848
- [BugFix] Fix list to List by @Echo-Nie in #4818
- [BugFix] Del get_act_fn, _load_st_projector by @Echo-Nie in #4824
- [Benchmark] Enhance benchmark output logging by @ZhangYulongg in #4682
- [XPU] ep+tp all2all by @zhupengyang in #4836
- [CI] Add Check PR Template by @EmmonsCurse in #4481
- Revert "【New Feature】W4afp8 supports per group quantization" by @EmmonsCurse in #4854
- [CI] Update deploy.py by @ZhangYulongg in #4850
- [CI] Optimize port cleanup logic by @EmmonsCurse in #4860
- [Bug Fix] fix ernie4_5_vl_moe by @LokeZhou in #4843
- Revert "[Bug Fix] fix ernie4_5_vl_moe" by @Jiang-Jia-Jun in #4863
- [Feature] support mm disable_chunked by @kevincheng2 in #4803
- [CI] Update ERNIE-4.5-VL baseline to adapt to MoE changes by @EmmonsCurse in #4867
- [CI] Refactor check-bypass logic in run_tests_with_coverage by @EmmonsCurse in #4655
- [Others] Delete PaddleOCR Useless Function by @Limerances in #4815
- [Feature] Optim PaddleOCR-VL by @ming1753 in #4873
- [XPU] fix ep_tp all2all ci by @zhupengyang in #4876
- [XPU] modify 424B model deployment parameter by @ddchenhao66 in #4888
- [XPU][CI] Ci bug fix by @plusNew001 in #4889
- [BugFix] fix token_processor zmq by @ckl117 in #4827
- [CI] fix docker_build error of ciuse by @EmmonsCurse in #4886
- [Metax] support ERNIE-4.5-VL-28B by @neilzhuu in #4820
- [BugFix] max_lgprobes=-1 maps to ori_vocab_size by @ckl117 in #4884
- [Feature] Enable FastDeploy to support adding the “--api-key” authentication parameter. by @kxz2002 in #4806
- [Docs]Supplement the English and Chinese user documentation for Tool calling by @AuferGachet in #4895
- [XPU][CI]Update test assertion and base response value by @plusNew001 in #4907
- [BugFix] When the value of "temperature" is 0, adjust it to 1e-06 by @luukunn in #4900
- [Docs] add api-key usage instructions by @LiqinruiG in #4902
- [CI] Add four unittest by @Echo-Nie in #4906
- [Bug Fix] fix bug for PD EP by @rainyfly in #4823
- [DeepEP] support async prefill by @zhoutianzi666 in #4899
- [XPU]Update documentation by @qw86972190 in #4917
- [Docs] Improve reasoning_out docs by @LiqinruiG in #4901
- [BugFix] Fix inference_start_time by @kxz2002 in #4922
- [BugFix] Add support for weight shape constraints and group size selection in Machete by @Sunny-bot1 in #4911
- [XPU] [CI]Change CI to multi-concurrency by @plusNew001 in #4866
- [Docs] add doc for glm by @ckl117 in #4933
- [Opti] Unlimit zmq message lens limit by @rainyfly in #4465
- [TSP] Support qwen3 moe tsp + cudagraph by @yuanlehome in #4871
- Update docs for v2.3.0 by @yangjianfengo1 in #4938
- [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction by @LiqinruiG in #4937
- [BugFix][Models] Add
tie_word_embeddingsfor lmhead by @DrRyanHuang in #4916 - [Iluvatar] add vl into ci and support v1 loader by @wuyujiji in #4774
- [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction by @LiqinruiG in #4944
- [XPU][Doc]Update XPU release2.3 note by @iosmers in #4939
- [XPU] fix xpu deployment md by @cqulilujia in #4941
- [CI][XPU]Update run_ci_xpu.sh to lock paddlepaddle-xpu version by @plusNew001 in #4949
- [Perf] Support tensor transmission between work and engine with zero-copy to improve efficiency by @sunlei1024 in #4839
- [PD Disaggregation]Replace paddle.max by numpy to avoid useless error log by @K11OntheBoat in #4893
- [CI] Update test_api_key.py by @kxz2002 in #4948
- [Others] Add Tests for GPU Model Runner and Logprobs Output by @ckl117 in #4913
- [Iluvatar][Doc] Add ERNIE-4.5-VL-28B-A3B-Thinking doc by @wuyujiji in #4955
- [ATTENTION] by @zhoutianzi666 in #4945
- [CI][XPU]Update health check endpoint to use port variable by @plusNew001 in #4965
- [CI] fix apt_sources error of focal in docker_build by @EmmonsCurse in #4961
- [Loader] Refactor PT model loading by @bukejiyu in #4532
- [CI][XPU] Change Paddle Version to Nightly by @plusNew001 in #4973
- [CI] Add five unittest by @Echo-Nie in #4958
- [Docs] Add License in Unittest by @Echo-Nie in #4957
- [CI] remove useless tests in docker_build by @EmmonsCurse in #4974
- [CI] Update PORT range to avoid conflict with system ports by @EmmonsCurse in #4953
- [Benchmark] Add GEMM & MoE kernel bench by @Sunny-bot1 in #4809
- [Iluvatar][CI] fix safetensors_rust.SafetensorError: framework paddle… by @wuyujiji in #4972
- [KVCache] support unified cache backend by @ltd0924 in #4903
- [loader]Update requirements and xpu ci by @bukejiyu in #4969
- [CI][XPU] Fix EP Case Bug by @plusNew001 in #4976
- [BugFix] Avoid loading training file by @BossPi in #4966
- [Metax] support default_v1 loader & thinking model by @StareAtYou in #4956
- [Metax] optimize flash mla by @xiaozude in #4915
- [Docs] remove load default_v1 since already been as default by @zoooo0820 in #4980
- [XPU] fix text_image_gather_scatter op by @cqulilujia in #4882
- [Logprobs]Support prompt_logprobs and max_logprobs by @qwes5s5 in #4897
- [CI] fix test_model_cache by @bukejiyu in #4982
- [BugFix] fix VL fp8 bug when moe token_num is 0 by @ming1753 in #4928
- [BugFix] Fix mtp tsp by @yuanlehome in #4990
- [CI] set DG_NVCC_OVERRIDE_CPP_STANDARD in test_quantized_linear by @EmmonsCurse in #4995
- [FDConfig] add block number verfied by @ltd0924 in #4983
- [Optimization] Skip memcpy(DtoH) capture in get_block_shape_and_split_kv_block by @Sunny-bot1 in #4988
- [BugFix] fix num_requests_running after clear_data by @liyonghua0910 in #4927
- [worker_process.py]modify some var name by @zhoutianzi666 in #4749
- [Loader]Fix and complete the MTP loader by @bukejiyu in #4985
- [XPU] [CI] Change CI ep test from offline to online by @zccjjj in #4885
- [BugFix][Metax] Fix metax compile issue in get_block_shape_and_split_kv_block by @Sunny-bot1 in #5000
- [Feature] Enhance build script, add pre_wheel logic by @Echo-Nie in #4729
- 【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in #4987
- optimize dy_cfp8's performance by @carryyu in #4126
- [BugFix] adjust max_tokens and min_tokens when continue to generate tokens by @kxz2002 in #5010
- [PD Disaggregation] remove splitwise deployment on single node and refine the code by @juncaipeng in #4891
- [CI]【Hackathon 9th Sprint No.56】NO.56 功能模块 fastdeploy/multimodal/utils.py 单测补充 by @essos-bot in #4954
- [Docs] Fix broken commitID by @Echo-Nie in #5008
- [CI] Temporarily lock paddlepaddle-gpu as of 20251112 by @EmmonsCurse in #5017
- [ATTENTION] unitest by @zhoutianzi666 in #4962
- [Executor]move batch_id_per_token by @zhoutianzi666 in #4853
- [BugFix] Revert skip capture by @Sunny-bot1 in #5023
- [Others] check args max_logprobs by @ckl117 in #5018
- [CI]【Hackathon 9th Sprint No.32】NO.32 功能模块 fastdeploy/input/ernie4_5_vl_processor/process_video.py 单测补充 by @WintersMontagne10335 in #5011
- [Optimization] xgrammar async compile, multi thread, speed up by @ST-XX in #4835
- [CI][XPU] Optimize CI logs and variable names by @plusNew001 in #5025
- [Intel HPU] enable level 1 prefix caching and fix some bugs by @fmiao2372 in #4971
- [Iluvatar][CI] Fix moe_expert_dispatch cannot support dequant_scale by @wuyujiji in #5012
- 【Fix】fix deepep dispatch by @yangjianfengo1 in #5036
- [Metax] support default_v1 loader and quant_config is None for triton… by @xiaozude in #5030
- [APIServer] metrics use port the same as api_port by @xyxinyang in #5016
- [Log] Add trace log and add loggingInstrumentor tool by @qwes5s5 in #4692
- [CI]【Hackathon 9th Sprint No.13】NO.13 功能模块 fastdeploy/model_executor/ops/triton_ops/triton_utils.py 单测补充 by @WintersMontagne10335 in #5035
- 【Hackathon 9th No.109】[CppExtension] Support build Custom OP in setuptools 80+ -part by @megemini in #4977
- [CI]【Hackathon 9th Sprint No.28】NO.28 功能模块 fastdeploy/model_executor/ops/triton_ops/triton_utils_v2.py 单测补充 by @WintersMontagne10335 in #5073
- [BugFix] rollback max_tokens and min_tokens when continue to infer by @LiqinruiG in #5052
- [Intel HPU] fix bugs caused by other commits by @fmiao2372 in #5074
- [XPU][CI] fix ci case bug by @plusNew001 in #5084
- Revert "[BugFix] Revert skip capture" by @Sunny-bot1 in #5080
- [Fix] Fix block allocation issue when MTP and logprobs are enabled by @sunlei1024 in #5077
- revert group size 3 by @zhoutianzi666 in #5079
- [INTEL_HPU] enabled fastdeploy PR testing by @FocusLuo in #4596
- [Feature][OP] Append Attn Support CUDA-PDL by @ckl117 in #5072
- 【Hackathon 9th No.76】supplementary unit test for XGrammarChecker by @Echo-Nie in #4075
- [CI] Enable check_pr_template in CI rerun by @EmmonsCurse in #5093
- [Metax] support default_v1 loader based #4988 by @StareAtYou in #5001
- [Iluvatar][CI] disable compiling cudaLaunch API by @wuyujiji in #5100
- Revert "[CI] Temporarily lock paddlepaddle-gpu as of 20251112" by @EmmonsCurse in #5098
- [OP] format flash_mask_attn by @lizhenyun01 in #5104
- [unitest]clean code by @zhoutianzi666 in #5094
- [Docs]fix_cli_docs by @xiaolei373 in #5109
- [BugFix] unify max_tokens by @kxz2002 in #4968
- [HPU][CI]Update Docker image in CI workflow by @plusNew001 in #5108
- [PD Disaggregation]Fix dummy run when use PD Disaggregation with EP inference. by @K11OntheBoat in #5112
- [Feature] ThreadPoolExecutor async fill_token_bitmask by @ST-XX in #5083
- [XPU][Docs]Update document by @qw86972190 in #5091
- [CI]【Hackathon 9th Sprint No.31】NO.31 功能模块 fastdeploy/input/ernie4_5_processor.py 单测补充 by @WintersMontagne10335 in #5097
- [RL]Resolve shape mismatch problems in RL-related modules by @bukejiyu in #5032
- [CI]Exclude abstract methods and irrelevant backend files by @EmmonsCurse in #5031
- [CI] add metrics case by @ZhangYulongg in #5115
- 【Hackathon 9th No.109】[CppExtension] [XPU] Support build Custom OP in setuptools 80+ -part by @megemini in #5106
- [Docs] add ebvlthinking yaml by @tianlef in #5120
- [Metax][BugFix] Fix METAX_GPU OPs Compile Error by @ckl117 in #5114
- [Feature] Add an unquantized option for MoE and Dense quant type by @Sunny-bot1 in #4813
- [BugFix] rollback max_tokens and min_tokens when continue to infer by @LiqinruiG in #5082
- [CI] Add workflow to auto-remove
skip-cilabels after new commits by @EmmonsCurse in #5129 - [BugFix] Support skipping activation scale loading for w4afp8 by @Sunny-bot1 in #5117
- [Feature] support async download features by @kevincheng2 in #5003
- [CI] Temporarily lock paddlepaddle-gpu as of 20251118 by @EmmonsCurse in #5136
- [HPU][CI]Hpu ci update by @plusNew001 in #5116
- [Speculative Decoding][MTP]Support stop_seqs and pd-split mode by @freeliuzc in #5029
- [Metax] optimize cutlass moe and flash attention backend by @neilzhuu in #5128
- [Scheduler] Support chunk prefill for video input by @yangjianfengo1 in #5107
- [Others]get_block_shape_and_split_kv_block clean code by @zhoutianzi666 in #5123
- [Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug by @yuanlehome in #5121
- [Others] clean code by @zhoutianzi666 in #5133
- [CI][XPU] Add XPU chunked_prefill && prefix_caching case by @plusNew001 in #5139
- [Graph Optimization][SOT] Eliminate BreakGraph by move import stmt to top by @DrRyanHuang in #5146
- [BugFix] Fix zero workspace returned by CUB size query under CUDA Graph in MoE dispatch by @littledgg in #5087
- [BugFix] [PD Disaggregation] Fix schedule error in splitwise deployment by @juncaipeng in #5149
- [BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol by @liyonghua0910 in #5132
- [Feature] support bos download retry by @kevincheng2 in #5137
- [CI] Unified diff coverage upload logic by @EmmonsCurse in #5127
- [CI]【Hackathon 9th Sprint No.51】NO.51 功能模块 fastdeploy/scheduler/dp_scheduler.py 单测补充 by @essos-bot in #5046
- [PD Disaggregation][XPU] Add XPU support for PD disaggregation by @ddchenhao66 in #5113
- [Feature] Support noaux for eplb by @xiaoxiaohehe001 in #5143
- [RL]Fix missing is_distributed attribute by @bukejiyu in #5150
- [ENV] support AK SK ENCPOINT while get the multi_modal's feature by @lizhenyun01 in #5159
- [Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage by @freeliuzc in #5155
- [PD Disaggregation] [Refine] Refine splitwise deployment by @juncaipeng in #5151
- [Fix] Fix noaux ep test by @xiaoxiaohehe001 in #5161
- [Polish] Simplify repr method in Request class by @Jiang-Jia-Jun in #5153
- [BugFix] fix num of rdma_comm_ports check by @yuanlehome in #5168
- [Optimization] Improve perf for fd response token with internal adapter by @rainyfly in #4992
- [BugFix] fix reschedule with mtp + logprob by @Deleter-D in #5165
- [Feature] dyc8 support prefixcache by @kevincheng2 in #5125
- [Feature] remove to_numpy by @kevincheng2 in #5162
- 【Hackathon 9th No.109】[CppExtension] 添加
fastdeploy_ops目录到package_data以支持现代打包方式 - part by @megemini in #5156 - [CI] fix coverage_report in daily test by @EmmonsCurse in #5175
- [Others] unitest tests/layers/test_attention_layer.py by @zhoutianzi666 in #5174
- [CI] Ignore new custom ops stub file in coveragerc by @SigureMo in #5177
- [CI] add output for last_token in test_streaming_with_stop_str by @EmmonsCurse in #5170
- [XPU]Update documentation by @qw86972190 in #5180
- [Fix] Fix eplb bug and support fp8 load weight by @xiaoxiaohehe001 in #5178
- [CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -part by @xunyoyo in #5064
- [BugFix] fix release block ids by @juncaipeng in #5184
- [XPU][CI] change VL model to 28B-VL-thinking by @plusNew001 in #5169
- [Feature] Supports separate loading of offline quantization for moe. by @xiaoxiaohehe001 in #5142
- [Metax] support ENABLE_V1_KVCACHE_SCHEDULER by @xiaozude in #5163
- [Feature] support eplb in api_server by @kevincheng2 in #4782
- [BugFix] dummy import some ops by @yuanlehome in #5192
- [CI] Update redis download source for docker_build failure fix by @EmmonsCurse in #5198
- [Bug fix] Send first token in D instance by @rainyfly in #5199
- [BugFix] [OP] Fix the error in MoeExpertFFN operator when valid_token_num=0 by @zccjjj in #5196
- [CI] Add Unittest by @Echo-Nie in #5187
- [CI] 【Hackathon 9th Sprint No.17】NO.17 功能模块单测补充 by @xunyoyo in #5054
- [CI] 【Hackathon 9th Sprint No.24】NO.24 功能模块单测补充 by @xunyoyo in #5055
- [Speculative Decoding][MTP]Update extract_mtp_weight script and optimize config by @freeliuzc in #5183
- [XPU] [CI] Xpu ci lock PaddlePaddle Version by @plusNew001 in #5218
- [BugFix] fix work metrics not returned by metrics api by @liyonghua0910 in #4912
- [BugFix] fix mm_positions type error by @kevincheng2 in #5182
- [Benchmark]add qwen3-235b pd+ep yaml by @xiegegege in #5225
- [CI] Add Cherry-Pick PR check logic by @EmmonsCurse in #5191
- [FDConfig] disable use_sequence_parallel_moe default by @yuanlehome in #5222
- [Feature] The 45VL supports prompt_token_ids + messages input. by @kxz2002 in #5148
- [Feature] enable guided decoding ENABLE_V1_KVCACHE_SCHEDULER = 1 by @ST-XX in #5140
- [Docs] add docs of base64 or local file mm inputs by @ApplEOFDiscord in #5193
- [Metrics] Update time_to_first_token to include tokenization & queue time, and remove redundant metrics by @liyonghua0910 in #4993
- [Docs] add request params by @LiqinruiG in #5207
- [Speculative Decoding]Fix attention mask offset by @freeliuzc in #5208
- 【BugFix】Fix logprob.slice_row inplace Error by @ckl117 in #5237
- [BugFix] fix prompt_token_ids is None in request dict in llm.generate by @kxz2002 in #5241
- [Fix] fix eplb noaux by @xiaoxiaohehe001 in #5239
- [BugFix]Fix attention mask bug in D-Node of PD-split mode by @freeliuzc in #5245
- [BugFix] BF16 MoE Cutlass Backend Support EP by @ckl117 in #5242
- [BugFix] fix vl performance bug by @kevincheng2 in #5181
- [Optimization] Refine row parallel bias and nranks and moe all_reduce by @yuanlehome in #5247
- [CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充 -part by @xunyoyo in #5056
- [Speculative Decoding] split draft_tokens into standalone post-processing path by @sunlei1024 in #5205
- [BugFix] fix mtp logprob bugs in chunk prefill by @Deleter-D in #5244
- [CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part by @essos-bot in #5045
- [BugFix] fix cuda-python requirement by @yuanlehome in #5261
- [CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -part by @xunyoyo in #5062
- [PD Disaggregation] Add unittest for splitwise deployment with using rdma by @juncaipeng in #5189
- [BugFix][Metrics] Fix Prometheus Multiprocess Metrics Issues and Add ZMQ Communication Metrics by @fl0w2o48 in #5185
- [XPU] support kernel for mtp(base) by @cmcamdy in #4748
- [Docs] add qwen25-vl docs by @CSWYF3634076 in #5243
- [CI] disable test_engine_client.py unit test by @EmmonsCurse in #5272
- [CI] fix run batch unit test by @xiaolei373 in #4628
- [BugFix]fix v1 loader lm head fp32 by @ckl117 in #5270
- [CI] Fix test streaming with stop str by @EmmonsCurse in #5275
- [XPU][CI] Set pip index URL to Tsinghua mirror by @plusNew001 in #5277
- [Feature] support flash_mask_attention backend by @lizhenyun01 in #5134
- [CI][XPU] add pd disaggregation by @ddchenhao66 in #5179
- Revert "[CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充" -part by @juncaipeng in #5286
- [BugFix] fix tsp o_proj bias add by @yuanlehome in #5284
- [BugFix] race condition [is_fetching] causing multiple fetch requests by @ST-XX in #5238
- [BugFix]Set default OMP_NUM_THREADS=3 and fix extra GPU memory usage in DeepSeek by @bukejiyu in #5219
- [Others] add PADDLE_ENFORCE by @zhoutianzi666 in #5288
- [OP]Remove extra H2D in DeepGemm. by @K11OntheBoat in #5262
- [Feature] add bos config check by @kevincheng2 in #5273
- [Others] clean code by @zhoutianzi666 in #5235
- [FDConfig] remove engine client args, use fd_config instead by @liyonghua0910 in #5217
- [Benchmark] Support random input by @ZhangYulongg in #5298
- [Intel HPU] change MoE weights and scales from list to tensor and add… by @fmiao2372 in #5289
- [APIServer] add_prompt_ids_test by @DDDivano in #5283
- [BugFix] fix aksk check bug by @kevincheng2 in #5295
- [BugFix] fix mm to_dict bug by @kevincheng2 in #5300
- [xpu] support mtp for xpu(mix) by @cmcamdy in #5274
- [Features] add audio request & fix embedding bug by @ming1753 in #5201
- [Deterministic] Move paddle version batch invariant pkg to Fastdeploy by @littledgg in #4763
- [Feature] support chunked moe by @Wanglongzhi2001 in #4575
- [XPU][CI]Change W4A8 Case Base Value by @plusNew001 in #5309
- [CI] Update build_docker to paddle_manylinux by @EmmonsCurse in #5226
- [CI] Remove need approve by yuanlehome by @yuanlehome in #5310
- [PD Disaggregation] support different tp_size for prefill and decode by @juncaipeng in #5296
- [XPU] fix gather_next_token by @cmcamdy in #5311
- [XPU][CI] Change XPU CI Base Value by @plusNew001 in #5318
- [Optimization] EP empty_input_forward Remove Communication by @ckl117 in #5254
- [CI]add clear to run-batch ci by @xiaolei373 in #5307
- [CI] disable test_chunked_moe.py in unit_test by @EmmonsCurse in #5322
- Revert "[CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -part" by @YuanRisheng in #5291
- Revert "[CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -part" by @YuanRisheng in #5290
- [LogProbs]Enable prompt logprobs output and modify data transmission method for the online interface. by @qwes5s5 in #5089
- [PD Disaggregation] Support PD deployment of DeepSeekv3. by @K11OntheBoat in #5251
- [Feature] support reward model by @lizexu123 in #5301
- [XPU]add enable_logprob by @qw86972190 in #5279
- [CI] Fix return_code check in test_chunked_moe.py by @EmmonsCurse in #5326
- [CI] Update test_docker to paddle_dev by @EmmonsCurse in #5278
- [XPU] [CI] Xpu Ci Refactor by @plusNew001 in #5252
- [UNITEST] add test by @zhoutianzi666 in #5305
- [Intel HPU] add example benchmark scripts for hpu by @fmiao2372 in #5304
- [Quantization] Support w4afp8 MoE dynamic quantization by @Sunny-bot1 in #5282
- [CI] Disable queue state assertion temporarily by @EmmonsCurse in #5329
- [CI] Add env ci by @Wanglongzhi2001 in #5331
- [CI] Allow occasional distributed worker exit_code by @EmmonsCurse in #5341
- [Optimization] supports mtp split_kv_attn, unified to append scenarios by @carryyu in #5343
- [CI] Add RD in env CI. by @Wanglongzhi2001 in #5345
- [Optimization]1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM by @carryyu in #5315
- [BugFix] Fix EP issue in the CUTLASS MoE backend by @Sunny-bot1 in #5337
- [CE]add wint4 ep by @tianlef in #5355
- [Optimization]1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM by @carryyu in #5353
- [bugfix]remove metrics middleware by @xiaolei373 in #5332
- [XPU] xpu support mm prefix cache by @ddchenhao66 in #5356
- [Feature] Guided Decoding add LLguidance backend by @ST-XX in #5124
- [Feature] support audio tts by @ming1753 in #5333
- [FIX BUG] fix bug in TP in permute_x_fp8_kernel by @zhoutianzi666 in #5350
- [BugFix] dynamic cache kv block_wise_fp8 not need create layer.cache_k_scale by @yuanlehome in #5362
- [Optimization] Requirements remove version for setuptools, uvicorn, triton and safetensors, del fastsafetensors by @Echo-Nie in #5330
- [BugFix] Fix issues related to data retrieval logic, parameter validation, and result serialization in both online and offline interfaces. by @qwes5s5 in #5335
- [Bug fix] fix pooling models by @lizexu123 in #5358
- [Intel HPU] fix memory fragmentation issue and fix moe all_reduce issue by @fmiao2372 in #5357
- [BugFix] Reduce timeout in unittest by @juncaipeng in #5366
- [Models] Add forward_meta to moe models' forward function by @Wanglongzhi2001 in #5138
- [PD Disaggregation] support DP via v1 router and decouple DP and EP by @liyonghua0910 in #5197
- [Docs] update FAQ with logprobs MQ limits and deprecation by @sunlei1024 in #5368
- [BugFix] Exit if neither modern nor legacy wheel dir not found by @SigureMo in #5367
- [FUCK] remove fastsafetensors by @yuanlehome in #5371
- [RL] [BugFix] update check_model_weights_status loop by @liyonghua0910 in #5249
- [Fearture] Support cache kv cache for output tokens by @rainyfly in #4535
- [BugFix] fix get_request from scheduler by @juncaipeng in #5369
- [CI] disable test_schedule_output.py in unit_test by @EmmonsCurse in #5377
- [Loader]Adapting DeepSeek weights for PyTorch loading. by @bukejiyu in #5373
- [XPU] [Optimization] [EP] EP communication optimization. by @zccjjj in #5145
- [BugFix] Compatible with asynchronous functions by @ming1753 in #5378
- [XPU] support XDNN downloading function by @cqulilujia in #5365
- [Intel HPU] fix bug about RP 5138 by @fmiao2372 in #5380
- [XPU] [CI] Change Paddle Version to Nightly by @plusNew001 in #5346
- [XPU] bug fix block attn in mix mtp by @cmcamdy in #5384
- [BugFix] Fix flash_attn_backend by @lizhenyun01 in #5387
- [BugFix] Fix the issue of redundant logging for certain events in the trace_logger by @qwes5s5 in #5386
- [Feature] support Two batch overlap, mainly used in Prefill by @zhoutianzi666 in #5078
- [XPU] redirect xvllm/xtdk/xhpc downloading log by @cqulilujia in #5388
- [XPU] support moe_expert_ffn TGEMM selection by @cqulilujia in #5375
- [Optimization]
Qwen2.5-VLsupport multi-batch prefill by @aquagull in #5269 - [BugFix] fix scheduler hang when input length is very close to max_model_len by @liyonghua0910 in #5393
- [XPU] support ep4tp1+v1 loader by @zccjjj in #5398
- [BugFix] fix async download bug by @kevincheng2 in #5349
- [BugFix] fix mtp prefix_cache dy-c8 bug by @kevincheng2 in #5390
- [BugFix]Fix plugin loading logic and logging messages by @wangyuwen1999 in #4909
- [BugFix] fix top_p_candidates by @Deleter-D in #5400
- [Reverted][RL] Support Rollout Routing Replay by @gongshaotian in #5321
- [Bug fix] Fix the multi-input accuracy issue in the pooling model. by @lizexu123 in #5374
- [Others]remove _execute_empty_input by @zhoutianzi666 in #5396
- Revert "[RL] Support Rollout Routing Replay" by @Jiang-Jia-Jun in #5402
- [Cherry-Pick][Loader]fix deepseek torch loading #5410 [loader]fix bf16 deepseek #5379 [Loader]Adapting DeepSeek weights for PyTorch loading by @bukejiyu in #5411
- [Cherry-Pick][New][RL] Support Rollout Routing Replay (#5405) by @gongshaotian in #5408
- [Cherry-Pick][Loader][BugFix] Fix some parameters place on CPU in PaddleOCR-VL (#5413) by @SigureMo in #5414
- [BugFix][Cherry-Pick] fix can not enter into cuda graph by @zhoutianzi666 in #5423
- [Cherry-Pick] [BugFix] [RL] remove shutdown_process_group/restart_process_group for RL (#5433) by @liyonghua0910 in #5434
- [Cherry-Pick][BugFix] 0 not into cuda graph to save memory (#5426) by @zhoutianzi666 in #5432
- [NewFeature]support dynamic load for normal by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5437
- [Cherry-Pick][Optimization] compulte real max_logprobs in batch (#5430) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5448
- [Cherry-Pick] allow 0-dim tensor into ar by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5452
- [BugFix] fix limit_thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5469
- [Cherry-Pick][CI] Fix attention bug in spec decoding(#5460) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5481
- [Cherry-Pick][CI] ep+prefix cache+chunk prefill(#5489) by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5490
- [Cherry-Pick] [BugFix] fix instability after clearing weight (#5493) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5487
- [Cherry-Pick][RL]Fix RL weight loading issue in moe layer #5503 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5505
- [[Cherry-Pick][BugFix] fix hung when n>1 and --enable-logprob (#5492)(#5499) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5498
- [Cherry-Pick] [BugFix] [RL] skip model executing after clearing/updating is done (#5527) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5523
- [Cherry-Pick][Feature][Optimization] Qwen Dynamic C8(#5486) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5536
- [Bug Fix][Cherry-pick] Fix bug for caching output when preempted(#5502) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5510
- [Cherry-Pick][BugFix] fix dynamic c8 in v1 loader(#5562) by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5519
- 【NewFeature】support load fp8 weight by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5566
- [Cherry-Pick][CI] Adape unit_test due to incompatibility change(#5578) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5583
- [Cherry-Pick][RL] R3 Support RDMA Store(#5467) by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5468
- [Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5597
- [Cherry-Pick][Feature]Add a switch for logprobs/prompt_logprobs token decoding.(#5463) by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5572
- [Cherry-Pick][CI]Fix write qknorm cache bug in speculative decoding(#5491) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5617
- [Cherry-Pick] Support for request-level speculative decoding metrics monitoring.(#5518) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5614
- [Cherry-Pick][Others] Maintain the mtp branch temporarily. (#5446) by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5621
- [Model] tp+ep support v1_loader by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5600
- [Cherry-Pick][BugFix] fix speculate_limit_thinking_content_length #5590 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5615
- [Cherry-Pick][RL]Support loading weights via the load_weights function for RL #5549 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5602
- [Cherry-Pick][BugFix] fix rl model_weights_signal to support tp>1 #5639 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5637
- [Cherry-Pick][RL]Fix RL load_weights #5642 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5643
- [Cherry-Pick][BugFix] cp fix_cpu_cache_bugs(#5544) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5577
- [Cherry-Pick][BugFix] fix rl model_weights_signal to support tp>1 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5650
- [Cherry-Pick][XPU] logprob bug #5626 by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5636
- [Cherry-Pick][BugFix] Cp fix eb5 prefix cache(#5638) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5644
- [Cherry-Pick][Others]Prevent core dumps during Paddle version check #5657 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5659
- [Cherry-Pick][BugFix] Fix custom_all_reduce overflow (#5662) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5667
- [Cherry-Pick] [RL] provide options for whether shutdown comm group after weights cleared (#5663) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5664
- [Cherry-Pick][BugFix] fix rl signal #5681 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5678
- [Cherry-Pick][XPU]Set top_p=0.0 by default on XPU to optimize performance(#5686) by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5688
- [Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5670
- [Cherry-Pick] [BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks (#5715) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5710
- [Cherry-Pick][CI] Revert adapt vl_model baseline changes due to Paddle update(#5732) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5733
- [Cherry-Pick][Feature] Entropy calculation support #5692 by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5731
- [Cherry-Pick][BugFix] Fix Chunked Prefill when max_tokens=1(#5736) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5747
- [Cherry-Pick][CI] Refactor RL tests to reuse upload_clear(#5741) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5755
- [BugFix][Cherry-pick] Set enable_cache_output as false by default(#5751) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5752
- [Cherry-Pick][Others]upgrade paddleformer to 0.4.0 #5599 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5716
- [Cherry-Pick][Loader]Fix bug in MTP weight loading #5744 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5745
- [cherry-pick] support FA3 in mixed mode and support Qwen3 rope by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5655
- [BugFix][Cherry-Pick] cp fix logprob bug(#5604) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5770
- [FDConfig][Cherry-Pick] Cp disable mm chunked(#5774) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5775
- [BugFix][Cherry-pick] Fix preemption out of real_bsz(#5805) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5806
- [Cherry-Pick] Fix process_response_dict to support async in serving_completion (#5758) by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5802
- [Cherry-Pick] Support flexible model by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5749
- [Cherry-Pick][BugFix] Fix _disable_sequence_parallel_moe_if_needed#5740 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5811
- [Cherry-Pick][Feature] support glm fa3 (#5586) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5810
- [Cherry-Pick] [BugFix] fix shm opened but not closed in set_data_ipc (#5826) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5827
- [Cherry-Pick][RL] add lm_head_fp32 in RolloutModelConfig(#5825) by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5824
- [Cherry-Pick][BugFix] Fix entropy bugs (#5818) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5819
- [BugFix][Cherry-Pick] eb5 mm skip prefix cache(#5838) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5839
- [Cherry-Pick][Speculative Decoding] Optimize draft logprob (#5842) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5843
- [Cherry-Pick] [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 (#5840) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5841
- [Cherry-Pick][BugFix] cp skip_mm_revert(#5848) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5849
- [Cherry-Pick][Optimization] Optimization for gather_logprob by 10GB (#5817)(#5846) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5834
- [Cherry-Pick][XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL (#5831) by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5845
- [XPU][CI]Release ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5687
- [Cherry-Pick][CI] Fix archive URL injection and add retry(#5725,#5828) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5832
- [Cherry-Pick][APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT(#5865) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5867
- [Cherry-Pick][RL] Change 'model' to the instance variable 'tmp_model'(#5872) by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5873
- [Cherry-Pick][BugFix]support fa3 qwen-vl rope (#5869) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5877
- [BugFix] Fix speculate metrics bug by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5875
- [Cherry-Pick][CI] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes(#5738) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5793
- [Cherry-Pick][OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5889
- [Cherry-Pick] [KVCache] launch cache transfer processes only if hierarchical cache or kv cache storage is enabled (#5871) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5859
- [Cherry-Pick] [BugFix] fix mtp cache attaching for pd disaggregation (#5884) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5885
- [Bugfix]fix model weight signal tensor num by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5899
- [Cherry-Pick] [XPU]Cherry-pick Support ZMQ logprobs(#5628) by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/5852
- [Feature] Add a global toggle for automatic injection of trace_id and span_id in logs by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5765
- [BugFix][Cherry-Pick] Cp fix eb5 prefix cache(#5879) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5881
- [Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5898
- [Cherry Pick][XPU][CI] Add logprobs Case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5907
- [Cherry-Pick] [BugFix] fix mtp split kv attetion by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5921
- [Optim][Cherry-pick] Reduce preemption occurrence when blocks not enough(#5696) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5808
- [Cherry-Pick][Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5928
- [CI] Lock paddlepaddle-gpu==3.3.0 in release/2.4 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5944
- [BugFix] fix xpu import set_data_ipc by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5945
- [Cherry-Pick][Bugfix] Fix entropy calculation bugs (#5941) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5942
- [Cherry-Pick][BugFix] Fix misleading logging in worker_process for request counting (#5939) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5953
- [BugFix][Cherry-Pick] cp fix dyc8 cache bug(#5958) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5959
- support_lastnorm_gather_split_r2.4 by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5925
- [Cherry-Pick][Speculative Decoding] Return accepted tokens per head in response (#5947) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5952
- [CI] Align PaddlePaddle version to latest due to tag change by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5971
- 2.4_fix_mtp_forward_meta by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5977
New Contributors
- @playaswd made their first contribution in #4848
- @WintersMontagne10335 made their first contribution in #5011
- @fl0w2o48 made their first contribution in #5185
- @wangyuwen1999 made their first contribution in #4909
Full Changelog: v2.3.3...v2.4.0