Release v2.4.0 · PaddlePaddle/FastDeploy

核心推理能力与模型支持增强

支持文本 prompt_logprob 及全量 logprob 能力 #4769
支持离线推理中基于 ZMQ 的 logprobs / prompt_logprobs，并引入 max_logprobs 参数 #4897
支持在线推理中基于 ZMQ 的 logprobs / prompt_logprobs，并优化通信方式 #5089
新增 logprobs / prompt_logprobs 的 token_id 解码控制开关 #5463
受限解码新增 llguidance 后端 #5124
CUDAGraph 支持投机解码 Draft Model 加速(默认关闭)
[Speculative Decoding] 解耦 draft_tokens 后处理流程 #5205
支持 Pooling 模型 Runner
支持 Reward 模型
Pooling 模型通用 embedding 接口 #4344
Pooling 模型定制 reward 接口 #4518
新增开源模型 Ernie-4.5-VL-28B-A3B-Thinking 的 reasoning_parser，兼容 - / _ 命名规则 #4571 #4668
支持通过 chat_template_kwargs.options.thinking_mode 控制思考开关
支持多模模型传入 prompt_token_ids 请求，并通过 messages 输入多模数据，实现 tokens-in / tokens-out 能力

并行架构、调度与 MoE 能力演进

GLM / Qwen 模型消除 EP 空跑时的通信开销 #5254
支持 MoE 分 chunk 执行 #4575
支持 EPLB（Expert Load Balancing）#4782
支持 EPLB 重排与冗余专家策略 #5142 #5143 #5178 #5239 #5918
支持路由重放机制
PD 分离支持 Deepseek V3 模型 EP 并行部署 #5251
PD 分离支持 Qwen3-MoE 模型 EP 并行部署 #4691
PD 分离支持 Prefill 与 Decode 使用不同 TP Size #5296
新增 Python 版本 Router，支持集中式与分离式部署调度 #4709
支持多步 MTP + CUDAGraph + PD 分离
支持 MTP 无损验证
支持 MTP 分 chunk #5343

多模态、缓存与量化能力增强

支持多模单 batch、纯文本多 batch 混合 Prefill 调度 #4611
支持多模 Prefix Cache #4803
动态量化支持 Prefix Cache #5125
修复并支持多模 Prefix Cache 与 CUDAGraph 同时开启 #4679
支持 W4AFP8 动态量化 #5282
支持静态 C8 scale 单独加载 #4624
完善 Machete 对不同量化 group size 的支持 #4911
支持 Flash Mask Attention Backend 接入 #5104 #5134 #5387
v1 Loader 加载性能优化 #4532
支持预编译包功能 #4729

多硬件平台支持扩展

P800

支持多模 Prefix Cache #5356
支持 PD 分离 #5179
支持思考模型思考强度限制 #4761
支持 TP + EP 并行 #4688 #4836

Intel HPU

新增 Prefix Caching 支持 #4971
新增 Chunked Prefill 支持 #5289

Iluvatar GPU

支持 ERNIE-4.5-21B-A3B 与 ERNIE-4.5-VL-28B-A3B-Thinking #4774 #4995
修复多项 CI 问题 #4972 #5012 #5100

MetaX

支持 ERNIE-4.5-VL-28B #4820
新增 Cutlass MoE #4602 #4685 #5128
支持 default_v1 loader #4956 #5001
优化 Flash MLA 性能 #4915
新增 Triton MoE 的 default_v1 loader 与 quant_config #5030
支持 ENABLE_V1_KVCACHE_SCHEDULER #5163

性能优化、可观测性与稳定性修复

性能与通信优化

AppendAttn 算子支持 CUDA-PDL #5072
DeepGemm H2D 消除 #5262
优化集中式 EP 通信逻辑 #5145
移除 CUDA Graph 下 Append Attention 的 DtoH 同步开销
支持两阶段低时延通信 #4162
支持 TP + EP 混合并行 #4615 #5315 #5353
默认编译 RDMA，降低多模 CUDAGraph 开销

可观测性与安全

支持基于请求级别的细粒度链路追踪 #5458
添加 trace_id / span_id 自动注入与开关 #4692 #5765
新增 --api-key 权限校验参数 #4806

稳定性与 Bug 修复

修复 logprob / prompt_logprob 计算、序列化及通信相关问题 #4681 #4884 #5237 #5335
修复 EP、PD 分离、MTP、Prefix Cache、量化、多模态等多类推理场景下的稳定性问题
修复多硬件（XPU / MetaX / Luvatar / P800）算子与参数校验问题

What's Changed

[BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in #4553
[BugFix] Fix graph opt test case by @gongshaotian in #4634
[Feature] add mm token usage by @ApplEOFDiscord in #4570
[XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in #4636
[Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in #4658
[XPU] fix pos_emb_type bug by @cqulilujia in #4638
[Docs] add Qwen25vl yaml by @xjkmfa in #4662
[Feature] add a new reasoning parser by @kxz2002 in #4571
[XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in #4665
add noaux_tc to unitest fused_moe by @zhoutianzi666 in #4656
[EP] fix several bugs in data parallel by @ltd0924 in #4657
[OP] Add InferShape&InferDtype for per_token_quant_padding by @DrRyanHuang in #4667
【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation by @ccsuzzh in #4592
[UT] Add ut for speculative sampler by @Deleter-D in #4650
[Doc] update docs by @ApplEOFDiscord in #4675
[Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in #4601
[CI] Add test for paddleocr_vl by @Limerances in #4627
[unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in #4676
[noauxtc_kernel] remove useless code by @zhoutianzi666 in #4643
[BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in #4686
[BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in #4687
[BugFix] fix --logprobs-mode raw_logits by @ckl117 in #4681
[XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in #4695
[XPU] [CI] Add Vl case by @plusNew001 in #4649
[BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in #4582
[Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in #4668
[BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in #4701
[XPU] [CI] Lock xvllm version by @plusNew001 in #4715
[Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in #4645
[Feature] support mtp distribution equivalence verification by @Deleter-D in #4699
[KVCache] Support kv cache scale load by @Sunny-bot1 in #4624
add flops and bandwidth to test_ffn.py by @zhoutianzi666 in #4704
benchmark工具支持受限解码场景指定response_format by @ophilia-lee in #4718
[CI] add missing unit tests for tokenizer_cli by @xiaolei373 in #4620
[Scheduler] update v1 prefill batch by @kevincheng2 in #4611
[BugFix] Fix profile run in pd-disaggregated deployment by @liyonghua0910 in #4584
[BugFix] fix mm prefix_cache cuda error bug by @kevincheng2 in #4679
[Feature] Check bos url by @kevincheng2 in #4711
[BugFix] fix wint2 config by @chang-wenbin in #4721
[FDConfig] [PD Disaggregation] [Graph Optimization] Close Cudagraph for P node when PD Disaggregation by @littledgg in #4632
[XPU] xpu support neox style ROPE by @ddchenhao66 in #4719
[BugFix] Skip building native architecture when specifying arch list by @ming1753 in #4727
fix noaux by @zhoutianzi666 in #4731
[BugFix] fix thinking bug by @yuanlehome in #4710
[CI] Fix rollout_model test logic by @EmmonsCurse in #4730
[Feature] support pooling model runner by @lizexu123 in #4590
format code by @zhoutianzi666 in #4720
[CI] fix some ci yaml by @EmmonsCurse in #4747
[Docs]Update XPU document version to 2.3.0 by @yyssys in #4741
[Speculative Decoding][MTP]Support mtp in splitewise and scheduler_v1 mode by @freeliuzc in #4743
[Speculative Decoding][MTP]Support attn mask offset by @freeliuzc in #4641
[Docs]Add parameter to the start service command by @yyssys in #4753
[Docs]Add parameter by @yyssys in #4755
[Docs] fix PaddleOCR-VL docs bug by @ming1753 in #4702
[Feature] Support eplb for fd by @rainyfly in #4599
[XPU] add v1 support for bf16 by @iosmers in #4744
【DataProcessor】add options thinking_mode by @luukunn in #4735
[Optimize] Support and robust for tpN for PD by @rainyfly in #4595
[Docs] fix error by @yyssys in #4768
[CI]test common model by @bukejiyu in #4697
[Metax] adapt cutlass moe for ernie-vl by @neilzhuu in #4685
fix dynamic Cfp8 for RL load by @rsmallblue in #4144
[Docs] PaddleOCR-VL add RTX3060 server param by @ming1753 in #4765
[BugFix] fix deepseek cuda error by @kevincheng2 in #4739
[XPU][CI] fix ci base value bug by @plusNew001 in #4783
[OP]Fix attn_params by @freeliuzc in #4787
[CI]delete test_common_model by @bukejiyu in #4794
[XPU] fix thinking bug where output only contains reasoning_content by @ddchenhao66 in #4761
[XPU] add deployment doc for PaddleOCR-VL in XPU by @cqulilujia in #4784
[BugFix] Fix ernie4_5_vl_processor.py and qwen_vl_processor.py can not disable thinking by @kxz2002 in #4762
supports internode_ll_two_stage by @carryyu in #4162
supports pd partn by @carryyu in #4615
[Docs] Add new support models by @ming1753 in #4801
[CI] Refactor CE wheel upload for multiple target paths by @EmmonsCurse in #4790
[Docx] update mkdocs.yml by @yangjianfengo1 in #4804
[BugFix] Fix step_shm_value in PD disaggregated deployment by @liyonghua0910 in #4780
Update Unit Test for PaddleOCR-VL by @Limerances in #4802
[Metax] adapt cutlass moe and fix mla attention for DeepSeek by @xiaozude in #4602
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs by @ckl117 in #4769
[get_padding_offset.] clean get_padding_offset.cu by @zhoutianzi666 in #4777
support ep+tp at op layer by @zhupengyang in #4688
[BugFix] fix reasoning parser register name by @kxz2002 in #4795
remove input_ids from ForwardMeta by @zhoutianzi666 in #4793
[Feature] Add timestamp for profiler by @rainyfly in #4726
[XPU]Support V1 loader in weight_only Model by @iosmers in #4808
[Bug Fix] process transparent image by @ApplEOFDiscord in #4807
add paddleocr_vl benchmark by @zhang-prog in #4833
[Doc] Update docs for v2.3.0rc0 by @Jiang-Jia-Jun in #4828
[BugFix] fix messages being inplace modified in offline chat api by @liyonghua0910 in #4831
【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in #4272
[CI] fix docker_build error and add tag-base by @EmmonsCurse in #4810
[PD Disaggregation] Support Qwen3-MoE use PD + EP inference. by @K11OntheBoat in #4691
remove seq_lens_this_time by @zhoutianzi666 in #4821
[BugFix] Fix ernie_vl_reasoning_parsers.py 'end_token' to 'think_end_token' by @kxz2002 in #4805
Fix: ci port conflict by @sunlei1024 in #4840
[CI] Add unittest for activation, native_paddle_backend, w4a8, w4afp8, platforms/utils by @Echo-Nie in #4812
[XPU][CI]Change ci vl model to 28 b by @plusNew001 in #4764
[Fix] fix ernie4_5_vl model torch format loadding by @aquagull in #4447
[Feature] [PD] add simple router and refine splitwise deployment by @juncaipeng in #4709
[Docs] fix: correct typo in nvidia_gpu.md by @playaswd in #4848
[BugFix] Fix list to List by @Echo-Nie in #4818
[BugFix] Del get_act_fn, _load_st_projector by @Echo-Nie in #4824
[Benchmark] Enhance benchmark output logging by @ZhangYulongg in #4682
[XPU] ep+tp all2all by @zhupengyang in #4836
[CI] Add Check PR Template by @EmmonsCurse in #4481
Revert "【New Feature】W4afp8 supports per group quantization" by @EmmonsCurse in #4854
[CI] Update deploy.py by @ZhangYulongg in #4850
[CI] Optimize port cleanup logic by @EmmonsCurse in #4860
[Bug Fix] fix ernie4_5_vl_moe by @LokeZhou in #4843
Revert "[Bug Fix] fix ernie4_5_vl_moe" by @Jiang-Jia-Jun in #4863
[Feature] support mm disable_chunked by @kevincheng2 in #4803
[CI] Update ERNIE-4.5-VL baseline to adapt to MoE changes by @EmmonsCurse in #4867
[CI] Refactor check-bypass logic in run_tests_with_coverage by @EmmonsCurse in #4655
[Others] Delete PaddleOCR Useless Function by @Limerances in #4815
[Feature] Optim PaddleOCR-VL by @ming1753 in #4873
[XPU] fix ep_tp all2all ci by @zhupengyang in #4876
[XPU] modify 424B model deployment parameter by @ddchenhao66 in #4888
[XPU][CI] Ci bug fix by @plusNew001 in #4889
[BugFix] fix token_processor zmq by @ckl117 in #4827
[CI] fix docker_build error of ciuse by @EmmonsCurse in #4886
[Metax] support ERNIE-4.5-VL-28B by @neilzhuu in #4820
[BugFix] max_lgprobes=-1 maps to ori_vocab_size by @ckl117 in #4884
[Feature] Enable FastDeploy to support adding the “--api-key” authentication parameter. by @kxz2002 in #4806
[Docs]Supplement the English and Chinese user documentation for Tool calling by @AuferGachet in #4895
[XPU][CI]Update test assertion and base response value by @plusNew001 in #4907
[BugFix] When the value of "temperature" is 0, adjust it to 1e-06 by @luukunn in #4900
[Docs] add api-key usage instructions by @LiqinruiG in #4902
[CI] Add four unittest by @Echo-Nie in #4906
[Bug Fix] fix bug for PD EP by @rainyfly in #4823
[DeepEP] support async prefill by @zhoutianzi666 in #4899
[XPU]Update documentation by @qw86972190 in #4917
[Docs] Improve reasoning_out docs by @LiqinruiG in #4901
[BugFix] Fix inference_start_time by @kxz2002 in #4922
[BugFix] Add support for weight shape constraints and group size selection in Machete by @Sunny-bot1 in #4911
[XPU] [CI]Change CI to multi-concurrency by @plusNew001 in #4866
[Docs] add doc for glm by @ckl117 in #4933
[Opti] Unlimit zmq message lens limit by @rainyfly in #4465
[TSP] Support qwen3 moe tsp + cudagraph by @yuanlehome in #4871
Update docs for v2.3.0 by @yangjianfengo1 in #4938
[Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction by @LiqinruiG in #4937
[BugFix][Models] Add tie_word_embeddings for lmhead by @DrRyanHuang in #4916
[Iluvatar] add vl into ci and support v1 loader by @wuyujiji in #4774
[Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction by @LiqinruiG in #4944
[XPU][Doc]Update XPU release2.3 note by @iosmers in #4939
[XPU] fix xpu deployment md by @cqulilujia in #4941
[CI][XPU]Update run_ci_xpu.sh to lock paddlepaddle-xpu version by @plusNew001 in #4949
[Perf] Support tensor transmission between work and engine with zero-copy to improve efficiency by @sunlei1024 in #4839
[PD Disaggregation]Replace paddle.max by numpy to avoid useless error log by @K11OntheBoat in #4893
[CI] Update test_api_key.py by @kxz2002 in #4948
[Others] Add Tests for GPU Model Runner and Logprobs Output by @ckl117 in #4913
[Iluvatar][Doc] Add ERNIE-4.5-VL-28B-A3B-Thinking doc by @wuyujiji in #4955
[ATTENTION] by @zhoutianzi666 in #4945
[CI][XPU]Update health check endpoint to use port variable by @plusNew001 in #4965
[CI] fix apt_sources error of focal in docker_build by @EmmonsCurse in #4961
[Loader] Refactor PT model loading by @bukejiyu in #4532
[CI][XPU] Change Paddle Version to Nightly by @plusNew001 in #4973
[CI] Add five unittest by @Echo-Nie in #4958
[Docs] Add License in Unittest by @Echo-Nie in #4957
[CI] remove useless tests in docker_build by @EmmonsCurse in #4974
[CI] Update PORT range to avoid conflict with system ports by @EmmonsCurse in #4953
[Benchmark] Add GEMM & MoE kernel bench by @Sunny-bot1 in #4809
[Iluvatar][CI] fix safetensors_rust.SafetensorError: framework paddle… by @wuyujiji in #4972
[KVCache] support unified cache backend by @ltd0924 in #4903
[loader]Update requirements and xpu ci by @bukejiyu in #4969
[CI][XPU] Fix EP Case Bug by @plusNew001 in #4976
[BugFix] Avoid loading training file by @BossPi in #4966
[Metax] support default_v1 loader & thinking model by @StareAtYou in #4956
[Metax] optimize flash mla by @xiaozude in #4915
[Docs] remove load default_v1 since already been as default by @zoooo0820 in #4980
[XPU] fix text_image_gather_scatter op by @cqulilujia in #4882
[Logprobs]Support prompt_logprobs and max_logprobs by @qwes5s5 in #4897
[CI] fix test_model_cache by @bukejiyu in #4982
[BugFix] fix VL fp8 bug when moe token_num is 0 by @ming1753 in #4928
[BugFix] Fix mtp tsp by @yuanlehome in #4990
[CI] set DG_NVCC_OVERRIDE_CPP_STANDARD in test_quantized_linear by @EmmonsCurse in #4995
[FDConfig] add block number verfied by @ltd0924 in #4983
[Optimization] Skip memcpy(DtoH) capture in get_block_shape_and_split_kv_block by @Sunny-bot1 in #4988
[BugFix] fix num_requests_running after clear_data by @liyonghua0910 in #4927
[worker_process.py]modify some var name by @zhoutianzi666 in #4749
[Loader]Fix and complete the MTP loader by @bukejiyu in #4985
[XPU] [CI] Change CI ep test from offline to online by @zccjjj in #4885
[BugFix][Metax] Fix metax compile issue in get_block_shape_and_split_kv_block by @Sunny-bot1 in #5000
[Feature] Enhance build script, add pre_wheel logic by @Echo-Nie in #4729
【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in #4987
optimize dy_cfp8's performance by @carryyu in #4126
[BugFix] adjust max_tokens and min_tokens when continue to generate tokens by @kxz2002 in #5010
[PD Disaggregation] remove splitwise deployment on single node and refine the code by @juncaipeng in #4891
[CI]【Hackathon 9th Sprint No.56】NO.56 功能模块 fastdeploy/multimodal/utils.py 单测补充 by @essos-bot in #4954
[Docs] Fix broken commitID by @Echo-Nie in #5008
[CI] Temporarily lock paddlepaddle-gpu as of 20251112 by @EmmonsCurse in #5017
[ATTENTION] unitest by @zhoutianzi666 in #4962
[Executor]move batch_id_per_token by @zhoutianzi666 in #4853
[BugFix] Revert skip capture by @Sunny-bot1 in #5023
[Others] check args max_logprobs by @ckl117 in #5018
[CI]【Hackathon 9th Sprint No.32】NO.32 功能模块 fastdeploy/input/ernie4_5_vl_processor/process_video.py 单测补充 by @WintersMontagne10335 in #5011
[Optimization] xgrammar async compile, multi thread, speed up by @ST-XX in #4835
[CI][XPU] Optimize CI logs and variable names by @plusNew001 in #5025
[Intel HPU] enable level 1 prefix caching and fix some bugs by @fmiao2372 in #4971
[Iluvatar][CI] Fix moe_expert_dispatch cannot support dequant_scale by @wuyujiji in #5012
【Fix】fix deepep dispatch by @yangjianfengo1 in #5036
[Metax] support default_v1 loader and quant_config is None for triton… by @xiaozude in #5030
[APIServer] metrics use port the same as api_port by @xyxinyang in #5016
[Log] Add trace log and add loggingInstrumentor tool by @qwes5s5 in #4692
[CI]【Hackathon 9th Sprint No.13】NO.13 功能模块 fastdeploy/model_executor/ops/triton_ops/triton_utils.py 单测补充 by @WintersMontagne10335 in #5035
【Hackathon 9th No.109】[CppExtension] Support build Custom OP in setuptools 80+ -part by @megemini in #4977
[CI]【Hackathon 9th Sprint No.28】NO.28 功能模块 fastdeploy/model_executor/ops/triton_ops/triton_utils_v2.py 单测补充 by @WintersMontagne10335 in #5073
[BugFix] rollback max_tokens and min_tokens when continue to infer by @LiqinruiG in #5052
[Intel HPU] fix bugs caused by other commits by @fmiao2372 in #5074
[XPU][CI] fix ci case bug by @plusNew001 in #5084
Revert "[BugFix] Revert skip capture" by @Sunny-bot1 in #5080
[Fix] Fix block allocation issue when MTP and logprobs are enabled by @sunlei1024 in #5077
revert group size 3 by @zhoutianzi666 in #5079
[INTEL_HPU] enabled fastdeploy PR testing by @FocusLuo in #4596
[Feature][OP] Append Attn Support CUDA-PDL by @ckl117 in #5072
【Hackathon 9th No.76】supplementary unit test for XGrammarChecker by @Echo-Nie in #4075
[CI] Enable check_pr_template in CI rerun by @EmmonsCurse in #5093
[Metax] support default_v1 loader based #4988 by @StareAtYou in #5001
[Iluvatar][CI] disable compiling cudaLaunch API by @wuyujiji in #5100
Revert "[CI] Temporarily lock paddlepaddle-gpu as of 20251112" by @EmmonsCurse in #5098
[OP] format flash_mask_attn by @lizhenyun01 in #5104
[unitest]clean code by @zhoutianzi666 in #5094
[Docs]fix_cli_docs by @xiaolei373 in #5109
[BugFix] unify max_tokens by @kxz2002 in #4968
[HPU][CI]Update Docker image in CI workflow by @plusNew001 in #5108
[PD Disaggregation]Fix dummy run when use PD Disaggregation with EP inference. by @K11OntheBoat in #5112
[Feature] ThreadPoolExecutor async fill_token_bitmask by @ST-XX in #5083
[XPU][Docs]Update document by @qw86972190 in #5091
[CI]【Hackathon 9th Sprint No.31】NO.31 功能模块 fastdeploy/input/ernie4_5_processor.py 单测补充 by @WintersMontagne10335 in #5097
[RL]Resolve shape mismatch problems in RL-related modules by @bukejiyu in #5032
[CI]Exclude abstract methods and irrelevant backend files by @EmmonsCurse in #5031
[CI] add metrics case by @ZhangYulongg in #5115
【Hackathon 9th No.109】[CppExtension] [XPU] Support build Custom OP in setuptools 80+ -part by @megemini in #5106
[Docs] add ebvlthinking yaml by @tianlef in #5120
[Metax][BugFix] Fix METAX_GPU OPs Compile Error by @ckl117 in #5114
[Feature] Add an unquantized option for MoE and Dense quant type by @Sunny-bot1 in #4813
[BugFix] rollback max_tokens and min_tokens when continue to infer by @LiqinruiG in #5082
[CI] Add workflow to auto-remove skip-ci labels after new commits by @EmmonsCurse in #5129
[BugFix] Support skipping activation scale loading for w4afp8 by @Sunny-bot1 in #5117
[Feature] support async download features by @kevincheng2 in #5003
[CI] Temporarily lock paddlepaddle-gpu as of 20251118 by @EmmonsCurse in #5136
[HPU][CI]Hpu ci update by @plusNew001 in #5116
[Speculative Decoding][MTP]Support stop_seqs and pd-split mode by @freeliuzc in #5029
[Metax] optimize cutlass moe and flash attention backend by @neilzhuu in #5128
[Scheduler] Support chunk prefill for video input by @yangjianfengo1 in #5107
[Others]get_block_shape_and_split_kv_block clean code by @zhoutianzi666 in #5123
[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug by @yuanlehome in #5121
[Others] clean code by @zhoutianzi666 in #5133
[CI][XPU] Add XPU chunked_prefill && prefix_caching case by @plusNew001 in #5139
[Graph Optimization][SOT] Eliminate BreakGraph by move import stmt to top by @DrRyanHuang in #5146
[BugFix] Fix zero workspace returned by CUB size query under CUDA Graph in MoE dispatch by @littledgg in #5087
[BugFix] [PD Disaggregation] Fix schedule error in splitwise deployment by @juncaipeng in #5149
[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol by @liyonghua0910 in #5132
[Feature] support bos download retry by @kevincheng2 in #5137
[CI] Unified diff coverage upload logic by @EmmonsCurse in #5127
[CI]【Hackathon 9th Sprint No.51】NO.51 功能模块 fastdeploy/scheduler/dp_scheduler.py 单测补充 by @essos-bot in #5046
[PD Disaggregation][XPU] Add XPU support for PD disaggregation by @ddchenhao66 in #5113
[Feature] Support noaux for eplb by @xiaoxiaohehe001 in #5143
[RL]Fix missing is_distributed attribute by @bukejiyu in #5150
[ENV] support AK SK ENCPOINT while get the multi_modal's feature by @lizhenyun01 in #5159
[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage by @freeliuzc in #5155
[PD Disaggregation] [Refine] Refine splitwise deployment by @juncaipeng in #5151
[Fix] Fix noaux ep test by @xiaoxiaohehe001 in #5161
[Polish] Simplify repr method in Request class by @Jiang-Jia-Jun in #5153
[BugFix] fix num of rdma_comm_ports check by @yuanlehome in #5168
[Optimization] Improve perf for fd response token with internal adapter by @rainyfly in #4992
[BugFix] fix reschedule with mtp + logprob by @Deleter-D in #5165
[Feature] dyc8 support prefixcache by @kevincheng2 in #5125
[Feature] remove to_numpy by @kevincheng2 in #5162
【Hackathon 9th No.109】[CppExtension] 添加 fastdeploy_ops 目录到 package_data 以支持现代打包方式 - part by @megemini in #5156
[CI] fix coverage_report in daily test by @EmmonsCurse in #5175
[Others] unitest tests/layers/test_attention_layer.py by @zhoutianzi666 in #5174
[CI] Ignore new custom ops stub file in coveragerc by @SigureMo in #5177
[CI] add output for last_token in test_streaming_with_stop_str by @EmmonsCurse in #5170
[XPU]Update documentation by @qw86972190 in #5180
[Fix] Fix eplb bug and support fp8 load weight by @xiaoxiaohehe001 in #5178
[CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -part by @xunyoyo in #5064
[BugFix] fix release block ids by @juncaipeng in #5184
[XPU][CI] change VL model to 28B-VL-thinking by @plusNew001 in #5169
[Feature] Supports separate loading of offline quantization for moe. by @xiaoxiaohehe001 in #5142
[Metax] support ENABLE_V1_KVCACHE_SCHEDULER by @xiaozude in #5163
[Feature] support eplb in api_server by @kevincheng2 in #4782
[BugFix] dummy import some ops by @yuanlehome in #5192
[CI] Update redis download source for docker_build failure fix by @EmmonsCurse in #5198
[Bug fix] Send first token in D instance by @rainyfly in #5199
[BugFix] [OP] Fix the error in MoeExpertFFN operator when valid_token_num=0 by @zccjjj in #5196
[CI] Add Unittest by @Echo-Nie in #5187
[CI] 【Hackathon 9th Sprint No.17】NO.17 功能模块单测补充 by @xunyoyo in #5054
[CI] 【Hackathon 9th Sprint No.24】NO.24 功能模块单测补充 by @xunyoyo in #5055
[Speculative Decoding][MTP]Update extract_mtp_weight script and optimize config by @freeliuzc in #5183
[XPU] [CI] Xpu ci lock PaddlePaddle Version by @plusNew001 in #5218
[BugFix] fix work metrics not returned by metrics api by @liyonghua0910 in #4912
[BugFix] fix mm_positions type error by @kevincheng2 in #5182
[Benchmark]add qwen3-235b pd+ep yaml by @xiegegege in #5225
[CI] Add Cherry-Pick PR check logic by @EmmonsCurse in #5191
[FDConfig] disable use_sequence_parallel_moe default by @yuanlehome in #5222
[Feature] The 45VL supports prompt_token_ids + messages input. by @kxz2002 in #5148
[Feature] enable guided decoding ENABLE_V1_KVCACHE_SCHEDULER = 1 by @ST-XX in #5140
[Docs] add docs of base64 or local file mm inputs by @ApplEOFDiscord in #5193
[Metrics] Update time_to_first_token to include tokenization & queue time, and remove redundant metrics by @liyonghua0910 in #4993
[Docs] add request params by @LiqinruiG in #5207
[Speculative Decoding]Fix attention mask offset by @freeliuzc in #5208
【BugFix】Fix logprob.slice_row inplace Error by @ckl117 in #5237
[BugFix] fix prompt_token_ids is None in request dict in llm.generate by @kxz2002 in #5241
[Fix] fix eplb noaux by @xiaoxiaohehe001 in #5239
[BugFix]Fix attention mask bug in D-Node of PD-split mode by @freeliuzc in #5245
[BugFix] BF16 MoE Cutlass Backend Support EP by @ckl117 in #5242
[BugFix] fix vl performance bug by @kevincheng2 in #5181
[Optimization] Refine row parallel bias and nranks and moe all_reduce by @yuanlehome in #5247
[CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充 -part by @xunyoyo in #5056
[Speculative Decoding] split draft_tokens into standalone post-processing path by @sunlei1024 in #5205
[BugFix] fix mtp logprob bugs in chunk prefill by @Deleter-D in #5244
[CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part by @essos-bot in #5045
[BugFix] fix cuda-python requirement by @yuanlehome in #5261
[CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -part by @xunyoyo in #5062
[PD Disaggregation] Add unittest for splitwise deployment with using rdma by @juncaipeng in #5189
[BugFix][Metrics] Fix Prometheus Multiprocess Metrics Issues and Add ZMQ Communication Metrics by @fl0w2o48 in #5185
[XPU] support kernel for mtp(base) by @cmcamdy in #4748
[Docs] add qwen25-vl docs by @CSWYF3634076 in #5243
[CI] disable test_engine_client.py unit test by @EmmonsCurse in #5272
[CI] fix run batch unit test by @xiaolei373 in #4628
[BugFix]fix v1 loader lm head fp32 by @ckl117 in #5270
[CI] Fix test streaming with stop str by @EmmonsCurse in #5275
[XPU][CI] Set pip index URL to Tsinghua mirror by @plusNew001 in #5277
[Feature] support flash_mask_attention backend by @lizhenyun01 in #5134
[CI][XPU] add pd disaggregation by @ddchenhao66 in #5179
Revert "[CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充" -part by @juncaipeng in #5286
[BugFix] fix tsp o_proj bias add by @yuanlehome in #5284
[BugFix] race condition [is_fetching] causing multiple fetch requests by @ST-XX in #5238
[BugFix]Set default OMP_NUM_THREADS=3 and fix extra GPU memory usage in DeepSeek by @bukejiyu in #5219
[Others] add PADDLE_ENFORCE by @zhoutianzi666 in #5288
[OP]Remove extra H2D in DeepGemm. by @K11OntheBoat in #5262
[Feature] add bos config check by @kevincheng2 in #5273
[Others] clean code by @zhoutianzi666 in #5235
[FDConfig] remove engine client args, use fd_config instead by @liyonghua0910 in #5217
[Benchmark] Support random input by @ZhangYulongg in #5298
[Intel HPU] change MoE weights and scales from list to tensor and add… by @fmiao2372 in #5289
[APIServer] add_prompt_ids_test by @DDDivano in #5283
[BugFix] fix aksk check bug by @kevincheng2 in #5295
[BugFix] fix mm to_dict bug by @kevincheng2 in #5300
[xpu] support mtp for xpu(mix) by @cmcamdy in #5274
[Features] add audio request & fix embedding bug by @ming1753 in #5201
[Deterministic] Move paddle version batch invariant pkg to Fastdeploy by @littledgg in #4763
[Feature] support chunked moe by @Wanglongzhi2001 in #4575
[XPU][CI]Change W4A8 Case Base Value by @plusNew001 in #5309
[CI] Update build_docker to paddle_manylinux by @EmmonsCurse in #5226
[CI] Remove need approve by yuanlehome by @yuanlehome in #5310
[PD Disaggregation] support different tp_size for prefill and decode by @juncaipeng in #5296
[XPU] fix gather_next_token by @cmcamdy in #5311
[XPU][CI] Change XPU CI Base Value by @plusNew001 in #5318
[Optimization] EP empty_input_forward Remove Communication by @ckl117 in #5254
[CI]add clear to run-batch ci by @xiaolei373 in #5307
[CI] disable test_chunked_moe.py in unit_test by @EmmonsCurse in #5322
Revert "[CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -part" by @YuanRisheng in #5291
Revert "[CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -part" by @YuanRisheng in #5290
[LogProbs]Enable prompt logprobs output and modify data transmission method for the online interface. by @qwes5s5 in #5089
[PD Disaggregation] Support PD deployment of DeepSeekv3. by @K11OntheBoat in #5251
[Feature] support reward model by @lizexu123 in #5301
[XPU]add enable_logprob by @qw86972190 in #5279
[CI] Fix return_code check in test_chunked_moe.py by @EmmonsCurse in #5326
[CI] Update test_docker to paddle_dev by @EmmonsCurse in #5278
[XPU] [CI] Xpu Ci Refactor by @plusNew001 in #5252
[UNITEST] add test by @zhoutianzi666 in #5305
[Intel HPU] add example benchmark scripts for hpu by @fmiao2372 in #5304
[Quantization] Support w4afp8 MoE dynamic quantization by @Sunny-bot1 in #5282
[CI] Disable queue state assertion temporarily by @EmmonsCurse in #5329
[CI] Add env ci by @Wanglongzhi2001 in #5331
[CI] Allow occasional distributed worker exit_code by @EmmonsCurse in #5341
[Optimization] supports mtp split_kv_attn, unified to append scenarios by @carryyu in #5343
[CI] Add RD in env CI. by @Wanglongzhi2001 in #5345
[Optimization]1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM by @carryyu in #5315
[BugFix] Fix EP issue in the CUTLASS MoE backend by @Sunny-bot1 in #5337
[CE]add wint4 ep by @tianlef in #5355
[Optimization]1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM by @carryyu in #5353
[bugfix]remove metrics middleware by @xiaolei373 in #5332
[XPU] xpu support mm prefix cache by @ddchenhao66 in #5356
[Feature] Guided Decoding add LLguidance backend by @ST-XX in #5124
[Feature] support audio tts by @ming1753 in #5333
[FIX BUG] fix bug in TP in permute_x_fp8_kernel by @zhoutianzi666 in #5350
[BugFix] dynamic cache kv block_wise_fp8 not need create layer.cache_k_scale by @yuanlehome in #5362
[Optimization] Requirements remove version for setuptools, uvicorn, triton and safetensors, del fastsafetensors by @Echo-Nie in #5330
[BugFix] Fix issues related to data retrieval logic, parameter validation, and result serialization in both online and offline interfaces. by @qwes5s5 in #5335
[Bug fix] fix pooling models by @lizexu123 in #5358
[Intel HPU] fix memory fragmentation issue and fix moe all_reduce issue by @fmiao2372 in #5357
[BugFix] Reduce timeout in unittest by @juncaipeng in #5366
[Models] Add forward_meta to moe models' forward function by @Wanglongzhi2001 in #5138
[PD Disaggregation] support DP via v1 router and decouple DP and EP by @liyonghua0910 in #5197
[Docs] update FAQ with logprobs MQ limits and deprecation by @sunlei1024 in #5368
[BugFix] Exit if neither modern nor legacy wheel dir not found by @SigureMo in #5367
[FUCK] remove fastsafetensors by @yuanlehome in #5371
[RL] [BugFix] update check_model_weights_status loop by @liyonghua0910 in #5249
[Fearture] Support cache kv cache for output tokens by @rainyfly in #4535
[BugFix] fix get_request from scheduler by @juncaipeng in #5369
[CI] disable test_schedule_output.py in unit_test by @EmmonsCurse in #5377
[Loader]Adapting DeepSeek weights for PyTorch loading. by @bukejiyu in #5373
[XPU] [Optimization] [EP] EP communication optimization. by @zccjjj in #5145
[BugFix] Compatible with asynchronous functions by @ming1753 in #5378
[XPU] support XDNN downloading function by @cqulilujia in #5365
[Intel HPU] fix bug about RP 5138 by @fmiao2372 in #5380
[XPU] [CI] Change Paddle Version to Nightly by @plusNew001 in #5346
[XPU] bug fix block attn in mix mtp by @cmcamdy in #5384
[BugFix] Fix flash_attn_backend by @lizhenyun01 in #5387
[BugFix] Fix the issue of redundant logging for certain events in the trace_logger by @qwes5s5 in #5386
[Feature] support Two batch overlap, mainly used in Prefill by @zhoutianzi666 in #5078
[XPU] redirect xvllm/xtdk/xhpc downloading log by @cqulilujia in #5388
[XPU] support moe_expert_ffn TGEMM selection by @cqulilujia in #5375
[Optimization] Qwen2.5-VL support multi-batch prefill by @aquagull in #5269
[BugFix] fix scheduler hang when input length is very close to max_model_len by @liyonghua0910 in #5393
[XPU] support ep4tp1+v1 loader by @zccjjj in #5398
[BugFix] fix async download bug by @kevincheng2 in #5349
[BugFix] fix mtp prefix_cache dy-c8 bug by @kevincheng2 in #5390
[BugFix]Fix plugin loading logic and logging messages by @wangyuwen1999 in #4909
[BugFix] fix top_p_candidates by @Deleter-D in #5400
[Reverted][RL] Support Rollout Routing Replay by @gongshaotian in #5321
[Bug fix] Fix the multi-input accuracy issue in the pooling model. by @lizexu123 in #5374
[Others]remove _execute_empty_input by @zhoutianzi666 in #5396
Revert "[RL] Support Rollout Routing Replay" by @Jiang-Jia-Jun in #5402
[Cherry-Pick][Loader]fix deepseek torch loading #5410 [loader]fix bf16 deepseek #5379 [Loader]Adapting DeepSeek weights for PyTorch loading by @bukejiyu in #5411
[Cherry-Pick][New][RL] Support Rollout Routing Replay (#5405) by @gongshaotian in #5408
[Cherry-Pick][Loader][BugFix] Fix some parameters place on CPU in PaddleOCR-VL (#5413) by @SigureMo in #5414
[BugFix][Cherry-Pick] fix can not enter into cuda graph by @zhoutianzi666 in #5423
[Cherry-Pick] [BugFix] [RL] remove shutdown_process_group/restart_process_group for RL (#5433) by @liyonghua0910 in #5434
[Cherry-Pick][BugFix] 0 not into cuda graph to save memory (#5426) by @zhoutianzi666 in #5432
[NewFeature]support dynamic load for normal by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5437
[Cherry-Pick][Optimization] compulte real max_logprobs in batch (#5430) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5448
[Cherry-Pick] allow 0-dim tensor into ar by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5452
[BugFix] fix limit_thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5469
[Cherry-Pick][CI] Fix attention bug in spec decoding(#5460) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5481
[Cherry-Pick][CI] ep+prefix cache+chunk prefill(#5489) by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5490
[Cherry-Pick] [BugFix] fix instability after clearing weight (#5493) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5487
[Cherry-Pick][RL]Fix RL weight loading issue in moe layer #5503 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5505
[[Cherry-Pick][BugFix] fix hung when n>1 and --enable-logprob (#5492)(#5499) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5498
[Cherry-Pick] [BugFix] [RL] skip model executing after clearing/updating is done (#5527) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5523
[Cherry-Pick][Feature][Optimization] Qwen Dynamic C8(#5486) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5536
[Bug Fix][Cherry-pick] Fix bug for caching output when preempted(#5502) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5510
[Cherry-Pick][BugFix] fix dynamic c8 in v1 loader(#5562) by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5519
【NewFeature】support load fp8 weight by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5566
[Cherry-Pick][CI] Adape unit_test due to incompatibility change(#5578) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5583
[Cherry-Pick][RL] R3 Support RDMA Store(#5467) by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5468
[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5597
[Cherry-Pick][Feature]Add a switch for logprobs/prompt_logprobs token decoding.(#5463) by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5572
[Cherry-Pick][CI]Fix write qknorm cache bug in speculative decoding(#5491) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5617
[Cherry-Pick] Support for request-level speculative decoding metrics monitoring.(#5518) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5614
[Cherry-Pick][Others] Maintain the mtp branch temporarily. (#5446) by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5621
[Model] tp+ep support v1_loader by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5600
[Cherry-Pick][BugFix] fix speculate_limit_thinking_content_length #5590 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5615
[Cherry-Pick][RL]Support loading weights via the load_weights function for RL #5549 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5602
[Cherry-Pick][BugFix] fix rl model_weights_signal to support tp>1 #5639 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5637
[Cherry-Pick][RL]Fix RL load_weights #5642 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5643
[Cherry-Pick][BugFix] cp fix_cpu_cache_bugs(#5544) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5577
[Cherry-Pick][BugFix] fix rl model_weights_signal to support tp>1 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5650
[Cherry-Pick][XPU] logprob bug #5626 by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5636
[Cherry-Pick][BugFix] Cp fix eb5 prefix cache(#5638) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5644
[Cherry-Pick][Others]Prevent core dumps during Paddle version check #5657 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5659
[Cherry-Pick][BugFix] Fix custom_all_reduce overflow (#5662) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5667
[Cherry-Pick] [RL] provide options for whether shutdown comm group after weights cleared (#5663) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5664
[Cherry-Pick][BugFix] fix rl signal #5681 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5678
[Cherry-Pick][XPU]Set top_p=0.0 by default on XPU to optimize performance(#5686) by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5688
[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5670
[Cherry-Pick] [BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks (#5715) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5710
[Cherry-Pick][CI] Revert adapt vl_model baseline changes due to Paddle update(#5732) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5733
[Cherry-Pick][Feature] Entropy calculation support #5692 by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5731
[Cherry-Pick][BugFix] Fix Chunked Prefill when max_tokens=1(#5736) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5747
[Cherry-Pick][CI] Refactor RL tests to reuse upload_clear(#5741) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5755
[BugFix][Cherry-pick] Set enable_cache_output as false by default(#5751) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5752
[Cherry-Pick][Others]upgrade paddleformer to 0.4.0 #5599 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5716
[Cherry-Pick][Loader]Fix bug in MTP weight loading #5744 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5745
[cherry-pick] support FA3 in mixed mode and support Qwen3 rope by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5655
[BugFix][Cherry-Pick] cp fix logprob bug(#5604) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5770
[FDConfig][Cherry-Pick] Cp disable mm chunked(#5774) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5775
[BugFix][Cherry-pick] Fix preemption out of real_bsz(#5805) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5806
[Cherry-Pick] Fix process_response_dict to support async in serving_completion (#5758) by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5802
[Cherry-Pick] Support flexible model by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5749
[Cherry-Pick][BugFix] Fix _disable_sequence_parallel_moe_if_needed#5740 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5811
[Cherry-Pick][Feature] support glm fa3 (#5586) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5810
[Cherry-Pick] [BugFix] fix shm opened but not closed in set_data_ipc (#5826) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5827
[Cherry-Pick][RL] add lm_head_fp32 in RolloutModelConfig(#5825) by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5824
[Cherry-Pick][BugFix] Fix entropy bugs (#5818) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5819
[BugFix][Cherry-Pick] eb5 mm skip prefix cache(#5838) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5839
[Cherry-Pick][Speculative Decoding] Optimize draft logprob (#5842) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5843
[Cherry-Pick] [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 (#5840) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5841
[Cherry-Pick][BugFix] cp skip_mm_revert(#5848) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5849
[Cherry-Pick][Optimization] Optimization for gather_logprob by 10GB (#5817)(#5846) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5834
[Cherry-Pick][XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL (#5831) by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5845
[XPU][CI]Release ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5687
[Cherry-Pick][CI] Fix archive URL injection and add retry(#5725,#5828) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5832
[Cherry-Pick][APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT(#5865) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5867
[Cherry-Pick][RL] Change 'model' to the instance variable 'tmp_model'(#5872) by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5873
[Cherry-Pick][BugFix]support fa3 qwen-vl rope (#5869) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5877
[BugFix] Fix speculate metrics bug by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5875
[Cherry-Pick][CI] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes(#5738) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5793
[Cherry-Pick][OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5889
[Cherry-Pick] [KVCache] launch cache transfer processes only if hierarchical cache or kv cache storage is enabled (#5871) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5859
[Cherry-Pick] [BugFix] fix mtp cache attaching for pd disaggregation (#5884) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5885
[Bugfix]fix model weight signal tensor num by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5899
[Cherry-Pick] [XPU]Cherry-pick Support ZMQ logprobs(#5628) by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/5852
[Feature] Add a global toggle for automatic injection of trace_id and span_id in logs by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5765
[BugFix][Cherry-Pick] Cp fix eb5 prefix cache(#5879) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5881
[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5898
[Cherry Pick][XPU][CI] Add logprobs Case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5907
[Cherry-Pick] [BugFix] fix mtp split kv attetion by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5921
[Optim][Cherry-pick] Reduce preemption occurrence when blocks not enough(#5696) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5808
[Cherry-Pick][Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5928
[CI] Lock paddlepaddle-gpu==3.3.0 in release/2.4 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5944
[BugFix] fix xpu import set_data_ipc by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5945
[Cherry-Pick][Bugfix] Fix entropy calculation bugs (#5941) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5942
[Cherry-Pick][BugFix] Fix misleading logging in worker_process for request counting (#5939) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5953
[BugFix][Cherry-Pick] cp fix dyc8 cache bug(#5958) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5959
support_lastnorm_gather_split_r2.4 by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5925
[Cherry-Pick][Speculative Decoding] Return accepted tokens per head in response (#5947) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5952
[CI] Align PaddlePaddle version to latest due to tag change by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5971
2.4_fix_mtp_forward_meta by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5977

New Contributors

@playaswd made their first contribution in #4848
@WintersMontagne10335 made their first contribution in #5011
@fl0w2o48 made their first contribution in #5185
@wangyuwen1999 made their first contribution in #4909

Full Changelog: v2.3.3...v2.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

核心推理能力与模型支持增强

并行架构、调度与 MoE 能力演进

多模态、缓存与量化能力增强

多硬件平台支持扩展

P800

Intel HPU

Iluvatar GPU

MetaX

性能优化、可观测性与稳定性修复

性能与通信优化

可观测性与安全

稳定性与 Bug 修复

What's Changed

New Contributors

Contributors

Uh oh!