Releases: vllm-project/vllm-ascend
v0.10.1rc1
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- LoRA Performance improved much through adding Custom Kernels by China Merchants Bank. #2325
- Support Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. #1568
- Support capture custom ops into aclgraph now. #2113
Core
- Add MLP tensor parallel to improve performance, but note that this will increase memory usage. #2120
- openEuler is upgraded to 24.03. #2631
- Add custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. #2309
- Qwen3 MoE/Qwen2.5 support torchair graph now. #2403
- Support Sliding Window Attention with AscendSceduler, thus fixing Gemma3 accuracy issue. #2528
Other
- Bug fixes:
- Update the graph capture size calculation, somehow alleviated the problem that npu stream not enough in some scenarios #2511
- Fix bugs and refactor cached mask generation logic. #2442
- Fix the nz format does not work in quantization scenarios. #2549
- Fix accuracy issue on Qwen series caused by enabling
enable_shared_pert_dp
by default. #2457 - Fix accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. #2601
- Performance improved through a lot of prs:
- A batch of refactoring prs to enhance the code architecture:
- Parameters changes:
- Add
lmhead_tensor_parallel_size
inadditional_config
, set it to enable lmhead tensor parallel. #2309 - Some unused environ variables
HCCN_PATH
,PROMPT_DEVICE_ID
,DECODE_DEVICE_ID
,LLMDATADIST_COMM_PORT
andLLMDATADIST_SYNC_CACHE_WAIT_TIME
are removed. #2448 - Environ variable
VLLM_LLMDD_RPC_PORT
is renamed toVLLM_ASCEND_LLMDD_RPC_PORT
now. #2450 - Add
VLLM_ASCEND_ENABLE_MLP_OPTIMIZE
in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. #2120 - Remove
MOE_ALL2ALL_BUFFER
andVLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ
in environ variables.#2612 - Add
enable_prefetch
inadditional_config
, whether to enable weight prefetch. #2465 - Add
mode
inadditional_config.torchair_graph_config
, When using reduce-overhead mode for torchair, mode needs to be set. #2461 enable_shared_expert_dp
inadditional_config
is disabled by default now, and it is recommended to enable when inferencing with deepseek. #2457
- Add
Known Issues
- Sliding window attention not support chunked prefill currently, thus we could only enable AscendScheduler to run with it. #2729
- There is a bug with creating mc2_mask when MultiStream is enabled, will fix it in next release. #2681
New Contributors
- @lidenghui1110 made their first contribution in #1917
- @haojiangzheng made their first contribution in #1772
- @QwertyJack made their first contribution in #2298
- @LCAIZJ made their first contribution in #1568
- @liuchenbing made their first contribution in #2325
- @gameofdimension made their first contribution in #2407
- @NicholasTao made their first contribution in #2403
- @ZhaoJiangJiang made their first contribution in #2453
- @s-jiayang made their first contribution in #2373
- @NSDie made their first contribution in #2528
- @panchao-hub made their first contribution in #2639
- @zzy-ContiLearn made their first contribution in #2541
- @baxingpiaochong made their first contribution in #2664
Full Changelog: v0.10.0rc1...v0.10.1rc1
v0.9.1
We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set VLLM_USE_V1=1
to use V1 engine.
In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the official guide.
Please note that this release note will list all the important changes from last official release(v0.7.3)
Highlights
- DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to muliti node tutorials and Large Scale Expert Parallelism.
- Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to Qwen tutorials.
- Disaggregated Prefilling support for V1 Engine. Please refer to Large Scale Expert Parallelism tutorials.
- Automatic prefix caching and chunked prefill feature is supported.
- Speculative decoding feature works with Ngram and MTP method.
- MOE and dense w4a8 quantization support now. Please refer to quantization guide.
- Sleep Mode feature is supported for V1 engine. Please refer to Sleep mode tutorials.
- Dynamic and Static EPLB support is added. This feature is still experimental.
Note
The following notes are especially for reference when upgrading from last final release (v0.7.3):
- V0 Engine is not supported from this release. Please always set
VLLM_USE_V1=1
to use V1 engine with vLLM Ascend. - Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed.
- Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.
Core
- The Ascend scheduler is added for V1 engine. This scheduler is more affine with Ascend hardware.
- Structured output feature works now on V1 Engine.
- A batch of custom ops are added to improve the performance.
Changes
- EPLB support for Qwen3-moe model. #2000
- Fix the bug that MTP doesn't work well with Prefill Decode Disaggregation. #2610 #2554 #2531
- Fix few bugs to make sure Prefill Decode Disaggregation works well. #2538 #2509 #2502
- Fix file not found error with shutil.rmtree in torchair mode. #2506
Known Issues
- When running MoE model, Aclgraph mode only work with tensor parallel. DP/EP doesn't work in this release.
- Pipeline parallelism is not supported in this release for V1 engine.
- If you use w4a8 quantization with eager mode, please set
VLLM_ASCEND_MLA_PARALLEL=1
to avoid oom error. - Accuracy test with some tools may not be correct. It doesn't affect the real user case. We'll fix it in the next post release. #2654
- We notice that there are still some problems when running vLLM Ascend with Prefill Decode Disaggregation. For example, the memory may be leaked and the service may be stuck. It's caused by known issue by vLLM and vLLM Ascend. We'll fix it in the next post release. #2650 #2604 vLLM#22736 vLLM#23554 vLLM#23981
v0.9.1rc3
This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.
Core
- MTP supports V1 scheduler #2371
- Add LMhead TP communication groups #1956
- Fix the bug that qwen3 moe doesn't work with aclgraph #2478
- Fix
grammar_bitmask
IndexError caused by outdatedapply_grammar_bitmask
method #2314 - Remove
chunked_prefill_for_mla
#2177 - Fix bugs and refactor cached mask generation logic #2326
- Fix configuration check logic about ascend scheduler #2327
- Cancel the verification between deepseek-mtp and non-ascend scheduler in disaggregated-prefill deployment #2368
- Fix issue that failed with ray distributed backend #2306
- Fix incorrect req block length in ascend scheduler #2394
- Fix header include issue in rope #2398
- Fix mtp config bug #2412
- Fix error info and adapt
attn_metedata
refactor #2402 - Fix torchair runtime errror caused by configuration mismtaches and
.kv_cache_bytes
file missing #2312 - Move
with_prefill
allreduce from cpu to npu #2230
Docs
- Add document for deepseek large EP #2339
Known Issues
- Full graph mode support are not yet available for some case with
full_cuda_graph
enable. #2182
Full Changelog: v0.9.1rc2...v0.9.1rc3
v0.10.0rc1
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.
Highlights
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model #950, following this tutorial.
- W4A8 quantization method is supported for dense and MoE model now. #2060 #2172
Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to
2.7.1.dev20250724
. #1562 And CANN has been upgraded to8.2.RC1
. #1653 Don’t forget to update them in your environment or using the latest images. - vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
- Pipeline Parallelism is supported in V1 now. #1800
- Prefix cache feature now work with the Ascend Scheduler. #1446
- Torchair graph mode works with tp > 4 now. #1508
- MTP support torchair graph mode now #2145
Other
-
Bug fixes:
-
Performance improved through a lot of prs:
- Caching sin/cos instead of calculate it every layer. #1890
- Improve shared expert multi-stream parallelism #1891
- Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE
to1
. #1926 - Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
- Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance #1806
- Use multicast to avoid padding decode request to prefill size #1555
- The performance of LoRA has been improved. #1884
-
A batch of refactoring prs to enhance the code architecture:
-
Parameters changes:
expert_tensor_parallel_size
inadditional_config
is removed now, and the EP and TP is aligned with vLLM now. #1681- Add
VLLM_ASCEND_MLA_PA
in environ variables, use this to enable mla paged attention operator for deepseek mla decode. - Add
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE
in environ variables, enableMatmulAllReduce
fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance. - Add
VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ
in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
-
UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298
-
Sequence Parallelism works for Qwen3 MoE. #2209
-
Chinese online document is added now. #1870
Known Issues
- Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
- There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
- In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
- MTP not support V1 scheduler currently, will fix it in Q3. #2254
- When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
- GLM 4.5 model has accuracy problem in long output length scenario.
New Contributors
- @pkking made their first contribution in #1792
- @lianyiibo made their first contribution in #1811
- @nuclearwu made their first contribution in #1867
- @aidoczh made their first contribution in #1870
- @shiyuan680 made their first contribution in #1930
- @ZrBac made their first contribution in #1964
- @Ronald1995 made their first contribution in #1988
- @taoxudonghaha made their first contribution in #1884
- @hongfugui made their first contribution in #1583
- @YuanCheng-coder made their first contribution in #2067
- @Liccol made their first contribution in #2127
- @1024daniel made their first contribution in #2037
- @yangqinghao-cmss made their first contribution in #2121
Full Changelog: v0.9.2rc1...v0.10.0rc1
v0.9.1rc2
This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- MOE and dense w4a8 quantization support now: #1320 #1910 #1275 #1480
- Dynamic EPLB support in #1943
- Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:#1953 #1612 #1361 #1746 #1552 #1801 #2083 #1989
Models improvement:
- DeepSeek DeepSeek DBO support and improvement: #1285 #1291 #1328 #1420 #1445 #1589 #1759 #1827 #2093
- DeepSeek MTP improvement and bugfix: #1214 #943 #1584 #1473 #1294 #1632 #1694 #1840 #2076 #1990 #2019
- Qwen3 MoE support improvement and bugfix around graph mode and DP: #1940 #2006 #1832
- Qwen3 performance improvement around rmsnorm/repo/mlp ops: #1545 #1719 #1726 #1782 #1745
- DeepSeek MLA chunked prefill/graph mode/multistream improvement and bugfix: #1240 #933 #1135 #1311 #1750 #1872 #2170 #1551
- Qwen2.5 VL improvement via mrope/padding mechanism improvement: #1261 #1705 #1929 #2007
- Ray: Fix the device error when using ray and add initialize_cache and improve warning info: #1234 #1501
Graph mode improvement:
- Fix DeepSeek with deepseek with mc2 in #1269
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in #1332
- Fix torchair_graph_batch_sizes bug in #1570
- Enable the limit of tp <= 4 for torchair graph mode in #1404
- Fix rope accruracy bug #1887
- Support multistream of shared experts in FusedMoE #997
- Enable kvcache_nz for the decode process in torchair graph mode#1098
- Fix chunked-prefill with torchair case to resolve UnboundLocalError: local variable 'decode_hs_or_q_c' issue in #1378
- Improve shared experts multi-stream perf for w8a8 dynamic. in #1561
- Repair moe error when set multistream. in #1882
- Round up graph batch size to tp size in EP case #1610
- Fix torchair bug when DP is enabled in #1727
- Add extra checking to torchair_graph_config. in #1675
- Fix rope bug in torchair+chunk-prefill scenario in #1693
- torchair_graph bugfix when chunked_prefill is true in #1748
- Improve prefill optimization to support torchair graph mode in #2090
- Fix rank set in DP scenario #1247
- Reset all unused positions to prevent out-of-bounds to resolve GatherV3 bug in #1397
- Remove duplicate multimodal codes in ModelRunner in #1393
- Fix block table shape to resolve accuracy issue in #1297
- Implement primal full graph with limited scenario in #1503
- Restore paged attention kernel in Full Graph for performance in #1677
- Fix DeepSeek OOM issue in extreme
--gpu-memory-utilization
scenario in #1829 - Turn off aclgraph when enabling TorchAir in #2154
Ops improvement:
- add custom ascendc kernel vocabparallelembedding #796
- fix rope sin/cos cache bug in #1267
- Refactoring AscendFusedMoE (#1229) in #1264
- Use fused ops npu_top_k_top_p in sampler #1920
Core:
- Upgrade CANN to 8.2.rc1 in #2036
- Upgrade torch-npu to 2.5.1.post1 in #2135
- Upgrade python to 3.11 in #2136
- Disable quantization in mindie_turbo in #1749
- fix v0 spec decode in #1323
- Enable
ACL_OP_INIT_MODE=1
directly only when using V0 spec decode in #1271 - Refactoring forward_context and model_runner_v1 in #1422
- Fix sampling params in #1423
- add a switch for enabling NZ layout in weights and enable NZ for GMM. in #1409
- Resolved bug in ascend_forward_context in #1449 #1554 #1598
- Address PrefillCacheHit state to fix prefix cache accuracy bug in #1492
- Fix load weight error and add new e2e case in #1651
- Optimize the number of rope-related index selections in deepseek. in #1614
- add mc2 mask in #1642
- Fix static EPLB log2phy condition and improve unit test in #1667 #1896 #2003
- add chunk mc2 for prefill in #1703
- Fix mc2 op GroupCoordinator bug in #1711
- Fix the failure to recognize the actual type of quantization i...
v0.9.2rc1
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the official doc to get started. From this release, V1 engine will be enabled by default, there is no need to set VLLM_USE_V1=1
any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
Highlights
- Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model #1359.
- The performance on Atlas 300I series has been improved. #1591
- aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. #1381
Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to
2.5.1.post1.dev20250619
. Don’t forget to update it in your environment. #1347 - The GatherV3 error has been fixed with aclgraph mode. #1416
- W8A8 quantization works on Atlas 300I series now. #1560
- Fix the accuracy problem with deploy models with parallel parameters. #1678
- The pre-built wheel package now requires lower version of glibc. Users can use it by
pip install vllm-ascend
directly. #1582
Other
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. #1331
- A new env variable
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP
has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is0
. #1335 - A new env variable
VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION
has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future#1732 - A batch of bugs have been fixed for Data Parallelism case #1273 #1322 #1275 #1478
- The DeepSeek performance has been improved. #1194 #1395 #1380
- Ascend scheduler works with prefix cache now. #1446
- DeepSeek now works with prefix cache now. #1498
- Support prompt logprobs to recover ceval accuracy in V1 #1483
Knowissue
New Contributors
- @xleoken made their first contribution in #1357
- @lyj-jjj made their first contribution in #1335
- @sharonyunyun made their first contribution in #1194
- @Pr0Wh1teGivee made their first contribution in #1308
- @leo-pony made their first contribution in #1374
- @zeshengzong made their first contribution in #1452
- @GDzhu01 made their first contribution in #1477
- @Agonixiaoxiao made their first contribution in #1531
- @zhanghw0354 made their first contribution in #1476
- @farawayboat made their first contribution in #1591
- @ZhengWG made their first contribution in #1196
- @wm901115nwpu made their first contribution in #1654
Full Changelog: v0.9.1rc1...v0.9.2rc1
v0.9.1rc1
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.
Experimental
- Atlas 300I series is experimental supported in this release (Functional test passed with Qwen2.5-7b-instruct/Qwen2.5-0.5b/Qwen3-0.6B/Qwen3-4B/Qwen3-8B). #1333
- Support EAGLE-3 for speculative decoding. #1032
After careful consideration, above features will NOT be included in v0.9.1-dev branch (v0.9.1 final release) taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.
Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to
2.5.1.post1.dev20250528
. Don’t forget to update it in your environment. #1235 - Support Atlas 300I series container image. You can get it from quay.io
- Fix token-wise padding mechanism to make multi-card graph mode work. #1300
- Upgrade vLLM to 0.9.1 [#1165]#1165
Other Improvements
- Initial support Chunked Prefill for MLA. #1172
- An example of best practices to run DeepSeek with ETP has been added. #1101
- Performance improvements for DeepSeek using the TorchAir graph. #1098, #1131
- Supports the speculative decoding feature with AscendScheduler. #943
- Improve
VocabParallelEmbedding
custom op performance. It will be enabled in the next release. #796 - Fixed a device discovery and setup bug when running vLLM Ascend on Ray #884
- DeepSeek with MC2 (Merged Compute and Communication) now works properly. #1268
- Fixed log2phy NoneType bug with static EPLB feature. #1186
- Improved performance for DeepSeek with DBO enabled. #997, #1135
- Refactoring AscendFusedMoE #1229
- Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) #1224
- Add unit test framework #1201
Known Issues
- In some cases, the vLLM process may crash with a GatherV3 error when aclgraph is enabled. We are working on this issue and will fix it in the next release. #1038
- Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. #1350
Full Changelog
New Contributors
- @farawayboat made their first contribution in #1333
- @yzim made their first contribution in #1159
- @chenwaner made their first contribution in #1098
- @wangyanhui-cmss made their first contribution in #1184
- @songshanhu07 made their first contribution in #1186
- @yuancaoyaoHW made their first contribution in #1032
Full Changelog: v0.9.0rc2...v0.9.1rc1
v0.9.0rc2
This is the 2nd official release candidate of v0.9.0 for vllm-ascend. Please follow the official doc to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment VLLM_USE_V1=1
to enable V1 Engine.
Highlights
- DeepSeek works with graph mode now. Follow the official doc to take a try. #789
- Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set
enforce_eager=True
when initializing the model.
Core
- The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. #814
- LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. #893
- prefix cache and chunked prefill feature works now #782 #844
- Spec decode and MTP features work with V1 Engine now. #874 #890
- DP feature works with DeepSeek now. #1012
- Input embedding feature works with V0 Engine now. #916
- Sleep mode feature works with V1 Engine now. #1084
Model
- Qwen2.5 VL works with V1 Engine now. #736
- LLama4 works now. #740
- A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set
VLLM_ASCEND_ENABLE_DBO=1
to use it. #941
Other
- online serve with ascend quantization works now. #877
- A batch of bugs for graph mode and moe model have been fixed. #773 #771 #774 #816 #817 #819 #912 #897 #961 #958 #913 #905
- A batch of performance improvement PRs have been merged. #784 #803 #966 #839 #970 #947 #987 #1085
- From this release, binary wheel package will be released as well. #775
- The contributor doc site is added
Known Issue
- In some case, vLLM process may be crashed with aclgraph enabled. We're working this issue and it'll be fixed in the next release. #1038
- Multi node data-parallel doesn't work with this release. This is a known issue in vllm and has been fixed on main branch. #18981
New Contributors
- @chris668899 made their first contribution in #771
- @NeverRaR made their first contribution in #789
- @cxcxflying made their first contribution in #740
- @22dimensions made their first contribution in #835
- @wonderful199082 made their first contribution in #814
- @yangpuPKU made their first contribution in #937
- @ttanzhiqiang made their first contribution in #909
- @ponix-j made their first contribution in #874
- @XWFAlone made their first contribution in #890
- @NINGBENZHE made their first contribution in #896
- @momo609 made their first contribution in #970
- @David9857 made their first contribution in #947
- @depeng1994 made their first contribution in #1013
- @hahazhky made their first contribution in #987
- @weijinqian0 made their first contribution in #1067
- @sdmyzlp made their first contribution in #1091
- @zxdukki made their first contribution in #941
- @ChenTaoyu-SJTU made their first contribution in #736
- @Yuxiao-Xu made their first contribution in #1116
v0.9.0rc1
Just a pre release for 0.9.0. There are still some known bug in this release
v0.7.3.post1
This is the first post release of 0.7.3. Please follow the official doc to start the journey. It includes the following changes:
Highlights
- Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. #903 #915
- Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. #878 Doc Link
Bug Fix
- Qwen2.5-VL works for RLHF scenarios now. #928
- Users can launch the model from online weights now. e.g. from huggingface or modelscope directly #858 #918
- The meaningless log info
UserWorkspaceSize0
has been cleaned. #911 - The log level for
Failed to import vllm_ascend_C
has been changed towarning
instead oferror
. #956 - DeepSeek MLA now works with chunked prefill in V1 Engine. Please note that V1 engine in 0.7.3 is just expermential and only for test usage. #849 #936