PaddlePaddle
diff --git a/‎.github/workflows/_base_test.yml‎
Lines changed: 2 additions & 1 deletion b/‎.github/workflows/_base_test.yml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.github/workflows/_build_linux.yml‎
Lines changed: 7 additions & 1 deletion b/‎.github/workflows/_build_linux.yml‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 1 deletion b/‎README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README_CN.md‎
Lines changed: 2 additions & 1 deletion b/‎README_CN.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎benchmarks/yaml/GLM45-air-32k-bf16.yaml‎
Lines changed: 5 additions & 0 deletions b/‎benchmarks/yaml/GLM45-air-32k-bf16.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎benchmarks/yaml/GLM45-air-32k-wfp8afp8.yaml‎
Lines changed: 6 additions & 0 deletions b/‎benchmarks/yaml/GLM45-air-32k-wfp8afp8.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎benchmarks/yaml/request_yaml/GLM-32k.yaml‎
Lines changed: 8 additions & 0 deletions b/‎benchmarks/yaml/request_yaml/GLM-32k.yaml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎build.sh‎
Lines changed: 9 additions & 1 deletion b/‎build.sh‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎custom_ops/gpu_ops/append_attn/encoder_write_cache_with_rope_impl.cuh‎
Lines changed: 2 additions & 1 deletion b/‎custom_ops/gpu_ops/append_attn/encoder_write_cache_with_rope_impl.cuh‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎custom_ops/gpu_ops/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu‎ renamed to ‎custom_ops/gpu_ops/append_attn/template_instantiation/append_attention_c8_float16_fp8_kernel.cu‎ b/‎custom_ops/gpu_ops/append_attn/template_instantiation/append_attention_c8_float16_fp8_kerne.cu‎ renamed to ‎custom_ops/gpu_ops/append_attn/template_instantiation/append_attention_c8_float16_fp8_kernel.cu‎
@@ -143,7 +143,8 @@ jobs:
           -v "${CACHE_DIR}/ConfigDir:/root/.config" \
           -e TZ="Asia/Shanghai" \
           --gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -xc '
-          python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
+          # python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
+          python -m pip install paddlepaddle-gpu==3.3.0.dev20250917 -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
 
           pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 
 
@@ -106,7 +106,12 @@ jobs:
             CARD_ID=$(echo "${runner_name}" | awk -F'-' '{print $NF}')
             gpu_id=$(echo "$CARD_ID" | fold -w1 | paste -sd,)
 
-            CACHE_DIR="${CACHE_DIR:-$(dirname "$(dirname "${{ github.workspace }}")")}"
+            IFS='/' read -ra parts <<< "${GITHUB_WORKSPACE}"
+            len=${#parts[@]}
+            CCACHE_DEFAULT_DIR="/$(IFS=/; echo "${parts[*]:1:$((len-5))}")"
+            echo "$CCACHE_DEFAULT_DIR"
+
+            CACHE_DIR="${CACHE_DIR:-$CCACHE_DEFAULT_DIR}"
             echo "CACHE_DIR is set to ${CACHE_DIR}"
             if [ ! -f "${CACHE_DIR}/gitconfig" ]; then
               touch "${CACHE_DIR}/gitconfig"
@@ -127,6 +132,7 @@ jobs:
             -e "PADDLEVERSION=${PADDLEVERSION}" \
             -e "PADDLE_WHL_URL=${PADDLE_WHL_URL}" \
             -e "BRANCH_REF=${BRANCH_REF}" \
+            -e "CCACHE_MAXSIZE=50G" \
             --gpus "\"device=${gpu_id}\"" ${docker_image} /bin/bash -c '
             if [[ -n "${FD_VERSION}" ]]; then
               export FASTDEPLOY_VERSION=${FD_VERSION}
 
@@ -43,7 +43,7 @@ English | [简体中文](README_CN.md)
 - 🤝 **OpenAI API Server and vLLM Compatible**: One-command deployment with [vLLM](https://github.com/vllm-project/vllm/) interface compatibility.
 - 🧮 **Comprehensive Quantization Format Support**: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
 - ⏩ **Advanced Acceleration Techniques**: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
-- 🖥️ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.
+- 🖥️ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.
 
 ## Requirements
 
@@ -60,6 +60,7 @@ FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**,
 - [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
 - [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
 - [MetaX GPU](./docs/get_started/installation/metax_gpu.md)
+- [Intel Gaudi](./docs/get_started/installation/intel_gaudi.md)
 
 **Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!
 
 
@@ -41,7 +41,7 @@
 - 🤝 **OpenAI API服务与vLLM兼容**：单命令部署，兼容[vLLM](https://github.com/vllm-project/vllm/)接口
 - 🧮 **全量化格式支持**：W8A16、W8A8、W4A16、W4A8、W2A16、FP8等
 - ⏩ **高级加速技术**：推测解码、多令牌预测（MTP）及分块预填充
-- 🖥️ **多硬件支持**：NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU等
+- 🖥️ **多硬件支持**：NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU、英特尔Gaudi等
 
 ## 要求
 
@@ -58,6 +58,7 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU
 - [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
 - [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
 - [沐曦 GPU](./docs/zh/get_started/installation/metax_gpu.md)
+- [英特尔 Gaudi](./docs/zh/get_started/installation/intel_gaudi.md)
 
 **注意:** 我们正在积极拓展硬件支持范围。目前，包括昇腾（Ascend）NPU 等其他硬件平台正在开发测试中。敬请关注更新！
 
 
@@ -0,0 +1,5 @@
+max_model_len: 32768
+max_num_seqs: 128
+tensor_parallel_size: 4
+use_cudagraph: True
+load_choices: "default_v1"
@@ -0,0 +1,6 @@
+max_model_len: 32768
+max_num_seqs: 128
+tensor_parallel_size: 4
+use_cudagraph: True
+load_choices: "default_v1"
+quantization: wfp8afp8
@@ -0,0 +1,8 @@
+top_p: 0.95
+temperature: 0.6
+metadata:
+  min_tokens: 1
+max_tokens: 12288
+repetition_penalty: 1.0
+frequency_penalty: 0
+presence_penalty: 0
@@ -128,6 +128,12 @@ function copy_ops(){
       echo -e "MACA ops have been copy to fastdeploy"
       return
     fi
+    is_intel_hpu=`$python -c "import paddle; print(paddle.is_compiled_with_custom_device('intel_hpu'))"`
+    if [ "$is_intel_hpu" = "True" ]; then
+      DEVICE_TYPE="intel-hpu"
+      echo -e "intel_hpu ops have been copy to fastdeploy"
+      return
+    fi
 
     DEVICE_TYPE="cpu"
     cd ../../../../
@@ -159,7 +165,9 @@ function build_and_install_ops() {
     else
       FD_BUILDING_ARCS=${FD_BUILDING_ARCS} ${python} setup_ops.py install --install-lib ${OPS_TMP_DIR}
     fi
-    find ${OPS_TMP_DIR} -type f -name "*.o" -exec rm -f {} \;
+    if [ -d "${OPS_TMP_DIR}" ]; then
+      find ${OPS_TMP_DIR} -type f -name "*.o" -exec rm -f {} \;
+    fi
   else
       echo "Error: Invalid parameter '$FD_CPU_USE_BF16'. Please use true or false."
       exit 1
 
@@ -1004,7 +1004,8 @@ __global__ void cache_kernel(
     const uint32_t qkv_bias = bias % hidden_size;
     const uint32_t hi = qkv_bias / head_size;
     const uint32_t h_bias = qkv_bias % head_size;
-    const uint32_t ori_bi = batch_id_per_token[token_idx];
+    const int32_t ori_bi = batch_id_per_token[token_idx];
+    if (ori_bi == -1) continue;  // skip batch_id_per_token[token_idx]=-1
     if (seq_lens[ori_bi] == 0) continue;
     const uint32_t ori_seq_id = (token_idx - cu_seqlens_q[ori_bi]) + seq_lens_decoder[ori_bi];