[XPU][CI] update xpu ci#7514
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 主要围绕 XPU CI 流水线增强:新增 XPU 单测 Job、采集并汇总多路覆盖率数据、并将覆盖率结果上传(BOS + Codecov),同时调整了 Metax CI 的超时时间设置。
Changes:
- 新增 XPU 单测工作流(
_xpu_unit_test.yml),并在ci_xpu.yml中接入 - 为 XPU 4/8 卡 case 测试与单测补充 coverage 采集与上传,并新增覆盖率汇总/增量检查工作流(
_xpu_coverage_report.yml) - 新增 XPU 覆盖率配置文件(
.coveragerc_xpu)与 XPU 相关 pytest 配置文件
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/xpu_ci/unit_test/pytest.ini | 新增 XPU 模型功能单测的 pytest 配置入口 |
| scripts/.coveragerc_xpu | 新增 XPU 覆盖率采集/合并配置 |
| custom_ops/xpu_ops/test/pytest.ini | 为 XPU 自定义算子单测增加 pytest 忽略列表 |
| .github/workflows/ci_xpu.yml | 接入 XPU unit test 与 coverage report 两个新 Job |
| .github/workflows/ci_metax.yml | 调整 Metax Jenkins step 超时时间 |
| .github/workflows/_xpu_unit_test.yml | 新增可复用的 XPU 单测执行与覆盖率上传流程 |
| .github/workflows/_xpu_coverage_report.yml | 新增覆盖率汇总、diff-cover 增量检查、并上传 Codecov 的流程 |
| .github/workflows/_xpu_8cards_case_test.yml | 为 8 卡 case 测试增加 coverage 采集与上传输出 |
| .github/workflows/_xpu_4cards_case_test.yml | 为 4 卡 case 测试增加 coverage 采集与上传输出 |
| xpu_unit_test: | ||
| name: xpu_unit_test | ||
| needs: [clone, xpu_build_test] | ||
| uses: ./.github/workflows/_xpu_unit_test.yml | ||
| with: | ||
| FASTDEPLOY_ARCHIVE_URL: ${{ needs.clone.outputs.repo_archive_url }} | ||
| DOCKER_IMAGE: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:ci | ||
| FASTDEPLOY_WHEEL_URL: ${{ needs.xpu_build_test.outputs.wheel_path }} | ||
| secrets: |
There was a problem hiding this comment.
这个 PR 的描述目前还是模板内容(Motivation/Modifications/Usage/Accuracy Tests 等未填写),会影响后续 CI/维护同学理解变更目的与验证方式。建议补充:为何要新增 XPU unit test/coverage、覆盖率阈值策略、以及跳过用例的原因/恢复计划。
| @@ -0,0 +1,3 @@ | |||
| [pytest] | |||
| # 跳过以下模型功能单测(当前存在已知问题) | |||
There was a problem hiding this comment.
这里的注释说明“跳过以下模型功能单测”,但当前 addopts 为空,没有实际配置任何 --ignore / marker,pytest 运行时不会跳过任何用例。建议补充需要跳过的具体用例/目录,或移除该注释以避免误导。
| # 跳过以下模型功能单测(当前存在已知问题) |
| -e "PADDLEVERSION=${PADDLEVERSION}" \ | ||
| -e "PADDLE_WHL_URL=${PADDLE_WHL_URL}" \ |
There was a problem hiding this comment.
workflow_call 定义了 PADDLEVERSION/PADDLE_WHL_URL 输入,但这里传入容器的是宿主机环境变量 ${PADDLEVERSION}/${PADDLE_WHL_URL},在未显式 export 的情况下会一直为空,导致调用方传入的 inputs 不生效。建议改为从 inputs 取值(例如通过 step env 注入)或移除这两个 inputs。
| echo "Download failed, skipping upload." | ||
| exit 0 | ||
| fi | ||
| sed -i 's||<source>/workspace/FastDeploy/fastdeploy</source>|>|<source>fastdeploy</source>|' xpu_coverage_all.xml |
There was a problem hiding this comment.
sed -i 的替换表达式写法不正确(当前 pattern 为空且包含多余的分隔符),这一步在 GitHub Actions 默认 bash -e 下会直接失败,导致覆盖率上传/整条 job 失败。建议参考仓库里 _unit_test_coverage.yml 的写法,使用正确的 sed 替换 <source>/workspace/FastDeploy/fastdeploy</source> 为 <source>fastdeploy</source>。
| sed -i 's||<source>/workspace/FastDeploy/fastdeploy</source>|>|<source>fastdeploy</source>|' xpu_coverage_all.xml | |
| sed -i 's|<source>/workspace/FastDeploy/fastdeploy</source>|<source>fastdeploy</source>|' xpu_coverage_all.xml |
| -e "PADDLEVERSION=${PADDLEVERSION}" \ | ||
| -e "PADDLE_WHL_URL=${PADDLE_WHL_URL}" \ |
There was a problem hiding this comment.
这里通过 -e "PADDLEVERSION=${PADDLEVERSION}" / PADDLE_WHL_URL 传入 Docker,但当前 step 的 env: 并没有把 workflow_call 的 inputs.PADDLEVERSION / inputs.PADDLE_WHL_URL 映射成同名环境变量,导致两者通常为空,从而使可配置输入失效。建议显式从 ${{ inputs.* }} 赋值到 step env 或直接在 docker run -e 里引用 ${{ inputs.PADDLEVERSION }} / ${{ inputs.PADDLE_WHL_URL }}。
| echo "============================开始运行XPU自定义算子单测============================" | ||
| set +e | ||
| COVERAGE_FILE=/workspace/FastDeploy/coveragedata/.coverage.unit_ops \ | ||
| python -m coverage run --rcfile=${COVERAGE_RCFILE} -m pytest -v -s --tb=short custom_ops/xpu_ops/test/test_adjust_batch_and_gather_next_token.py |
There was a problem hiding this comment.
当前 pytest 直接指定了单个测试文件路径运行,这样不会自动使用 custom_ops/xpu_ops/test/pytest.ini 里配置的 addopts/--ignore(除非显式 -c 指定或在该目录下执行 pytest)。如果期望跳过已知失败用例生效,建议调整命令以显式加载该配置文件,或改为运行目录级用例并用配置统一控制。
| python -m coverage run --rcfile=${COVERAGE_RCFILE} -m pytest -v -s --tb=short custom_ops/xpu_ops/test/test_adjust_batch_and_gather_next_token.py | |
| python -m coverage run --rcfile=${COVERAGE_RCFILE} -m pytest -c custom_ops/xpu_ops/test/pytest.ini -v -s --tb=short custom_ops/xpu_ops/test/test_adjust_batch_and_gather_next_token.py |
| # 跳过以下模型功能单测(当前存在已知问题) | ||
| addopts = |
There was a problem hiding this comment.
该 pytest.ini 注释说明“跳过以下模型功能单测”,但 addopts 为空,实际不会跳过任何用例,容易误导后续维护。建议要么补充具体的 --ignore/markers 配置,要么删除这条注释与空的 addopts。另外按仓库约定,注释建议使用英文。
| # 跳过以下模型功能单测(当前存在已知问题) | |
| addopts = |
| @@ -0,0 +1,25 @@ | |||
| [pytest] | |||
| # 跳过以下自定义算子单测(当前存在已知问题) | |||
There was a problem hiding this comment.
此处注释为中文;按仓库约定代码/配置文件内注释建议使用英文,避免跨团队协作时理解成本上升。建议将该注释改为英文并在需要时补充跳过原因/关联 issue。
| # 跳过以下自定义算子单测(当前存在已知问题) | |
| # Skip the following custom operator unit tests due to known issues. |
| xpu_unit_test: | ||
| name: xpu_unit_test | ||
| needs: [clone, xpu_build_test] |
There was a problem hiding this comment.
PR 描述目前基本仍是模板,未补充本次 CI/XPU 调整的动机、具体修改点及预期影响(例如:为何新增 unit_test 与 coverage job、覆盖率阈值来源等)。建议在 PR 描述中补全 Motivation/Modifications/Usage(or Command),方便后续审阅与回溯。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7514 +/- ##
=========================================
Coverage ? 7.20%
=========================================
Files ? 458
Lines ? 63466
Branches ? 9719
=========================================
Hits ? 4572
Misses ? 58805
Partials ? 89
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| xpu_unit_test: | ||
| name: xpu_unit_test | ||
| needs: [clone, xpu_build_test] | ||
| uses: ./.github/workflows/_xpu_unit_test.yml | ||
| with: |
There was a problem hiding this comment.
PR 描述目前仍是模板占位,Motivation/Modifications/Usage/Accuracy Tests 等关键信息为空;这会影响 CI 变更的评审和后续排障。建议补充:为什么要调整 XPU CI、具体新增/调整了哪些任务(unit test/coverage)、以及失败时如何本地复现/排查。
| [pytest] | ||
| # 跳过以下自定义算子单测(当前存在已知问题) | ||
| addopts = | ||
| --ignore=custom_ops/xpu_ops/test/test_adjust_batch_and_gather_next_token.py | ||
| --ignore=custom_ops/xpu_ops/test/test_adjust_batch_and_recover_batch_sequence.py |
There was a problem hiding this comment.
该 pytest.ini 放在 custom_ops/xpu_ops/test/ 目录下时,只有在 pytest 以该目录为 rootdir 或显式传入 -c custom_ops/xpu_ops/test/pytest.ini 时才会生效;当前 _xpu_unit_test.yml 里直接在仓库根目录运行 pytest,配置很可能不会被加载,导致这里的 --ignore 列表形同虚设。建议在 workflow 里加 -c 指定该文件,或在运行前 cd 到该目录再执行 pytest。
| echo "============================开始运行XPU模型功能单测============================" | ||
| set +e | ||
| COVERAGE_FILE=/workspace/FastDeploy/coveragedata/.coverage.unit_model \ | ||
| python -m coverage run --rcfile=${COVERAGE_RCFILE} -m pytest -v -s --tb=short tests/xpu_ci/unit_test/ |
There was a problem hiding this comment.
当前在仓库根目录直接运行 pytest 时,tests/xpu_ci/unit_test/pytest.ini 一般不会被 pytest 自动发现/加载,因此这里配置的 addopts(以及“跳过用例”的意图)可能不会生效。建议:要么在命令中显式加 -c tests/xpu_ci/unit_test/pytest.ini(或 cd 到该目录执行),要么把需要的 pytest 配置移动到 pytest 可发现的根配置位置。
| python -m coverage run --rcfile=${COVERAGE_RCFILE} -m pytest -v -s --tb=short tests/xpu_ci/unit_test/ | |
| python -m coverage run --rcfile=${COVERAGE_RCFILE} -m pytest -c tests/xpu_ci/unit_test/pytest.ini -v -s --tb=short tests/xpu_ci/unit_test/ |
| secrets: | ||
| github-token: ${{ secrets.github-token }} | ||
| with: | ||
| workflow-name: xpu_coverage |
There was a problem hiding this comment.
check-bypass 的 workflow-name 决定了可用的 skip-ci 标签/指令匹配字符串;这里使用 xpu_coverage,但在 ci_xpu.yml 里该 job 名为 xpu_coverage_report。若团队习惯按 job 名写 skip-ci(例如 skip-ci: xpu_coverage_report),将无法生效。建议统一命名(workflow-name 与上层 job 名保持一致),或在文件内明确约定使用的 skip-ci 名称。
| workflow-name: xpu_coverage | |
| workflow-name: xpu_coverage_report |
| -e "FASTDEPLOY_ARCHIVE_URL=${fd_archive_url}" \ | ||
| -e "FASTDEPLOY_WHEEL_URL=${fd_wheel_url}" \ | ||
| -e "PADDLEVERSION=${PADDLEVERSION}" \ | ||
| -e "PADDLE_WHL_URL=${PADDLE_WHL_URL}" \ | ||
| -e "http_proxy=$(git config --global --get http.proxy)" \ |
| COV_FILES=$(find coveragedata/ -name ".coverage*" -type f 2>/dev/null | head -1) | ||
| if [[ -z "${COV_FILES}" ]]; then | ||
| echo "没有找到任何覆盖率数据,跳过合并和检查" | ||
| chmod a+r /workspace/FastDeploy/xpu_coverage.env | ||
| exit 0 | ||
| fi |
| cd /paddle/sjx_cuda12.6_py310/fd_test/FastDeploy | ||
| export MODEL_PATH=/your/model/path | ||
| export XPU_ID=0 | ||
| export PYTHONPATH=$(pwd):$(pwd)/tests/xpu_ci:$PYTHONPATH |
| --ignore=test_get_token_penalty_multi_scores.py | ||
| --ignore=test_moe_topk_select.py | ||
| --ignore=test_read_data_ipc.py | ||
| --ignore=test_set_get_data_ipc.py | ||
| --ignore=test_token_repetition_penalty.py |
| -e "FASTDEPLOY_ARCHIVE_URL=${fd_archive_url}" \ | ||
| -e "FASTDEPLOY_WHEEL_URL=${fd_wheel_url}" \ | ||
| -e "PADDLEVERSION=${PADDLEVERSION}" \ | ||
| -e "PADDLE_WHL_URL=${PADDLE_WHL_URL}" \ | ||
| -e "http_proxy=$(git config --global --get http.proxy)" \ |
| import os | ||
|
|
||
| collect_ignore_glob = [ | ||
| "test_moe_topk_select.py", | ||
| "test_token_repetition_penalty.py", | ||
| "test_moe_redundant_topk_select.py", | ||
| "test_get_token_penalty_multi_scores.py", | ||
| "test_speculate_get_token_penalty_multi_scores.py", | ||
| "test_speculate_limit_thinking_content_length.py", | ||
| "test_speculate_get_padding_offset.py", | ||
| "test_speculate_schedule_cache.py", | ||
| "test_speculate_verify.py", | ||
| "test_adjust_batch_and_gather_next_token.py", | ||
| "test_unified_update_model_status.py", | ||
| "test_draft_model_update.py", | ||
| "test_set_data_ipc.py", | ||
| "test_read_data_ipc.py", | ||
| "test_set_get_data_ipc.py", | ||
| "test_draft_model_preprocess.py", | ||
| ] | ||
|
|
||
| _this_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| collect_ignore = [os.path.join(_this_dir, f) for f in collect_ignore_glob] |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 11:03:17
📋 Review 摘要
PR 概述:为 XPU CI 新增代码覆盖率收集与报告链路,包含单元测试 workflow、覆盖率汇总及 Codecov 上传,同时缩短 MetaX CI 超时时长。
变更范围:.github/workflows/(XPU/MetaX CI)、scripts/.coveragerc_xpu、custom_ops/xpu_ops/test/、tests/xpu_ci/
影响面 Tag:[CI] [XPU]
📝 PR 规范检查
Usage or Command 和 Accuracy Tests 两个 section 当前内容为 HTML 注释占位符,不符合模板要求(应填 N/A 或具体内容)。标题 [XPU][CI] update xpu ci 使用了两个合规 Tag,格式可接受。
PR 描述建议(可直接复制):
## Motivation
为 XPU CI 添加代码覆盖率收集和报告能力,提升 XPU 代码质量可观测性。同时新增 XPU 单元测试工作流,将覆盖率数据从 unit test、4卡、8卡三个阶段合并后上传至 Codecov,并设置 80% 增量覆盖率门槛。
## Modifications
- 新增 `_xpu_unit_test.yml`:XPU 自定义算子单测 + 模型功能单测工作流
- 新增 `_xpu_coverage_report.yml`:合并各阶段覆盖率数据、生成报告、上传 Codecov
- 修改 `_xpu_4cards_case_test.yml` / `_xpu_8cards_case_test.yml`:增加 coverage 采集和 BOS 上传
- 修改 `ci_xpu.yml`:串联 unit_test 和 coverage_report 任务
- 新增 `scripts/.coveragerc_xpu`:XPU 专用覆盖率配置
- 新增 `custom_ops/xpu_ops/test/pytest.ini`:跳过已知问题的 XPU 算子单测
- 修改 `ci_metax.yml`:将 MetaX CI 超时从 120 分钟缩短至 60 分钟
## Usage or Command
N/A
## Accuracy Tests
N/A(本 PR 仅涉及 CI 配置和测试基础设施,不影响模型输出)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | scripts/.coveragerc_xpu:5 |
patch = subprocess 不是 coverage.py 合法配置键,将被静默忽略 |
| ❓ 疑问 | .github/workflows/_xpu_unit_test.yml:8 |
required: true 与 default: 语义矛盾 |
总体评价
整体结构清晰,覆盖率收集链路设计合理(三路并行 → 汇总 → diff-cover 门禁 → Codecov)。存在一处配置键错误(patch = subprocess)需要确认意图后修复或删除,其余为轻微一致性问题,不阻塞合入。
| branch = True | ||
| source = fastdeploy | ||
| concurrency = multiprocessing | ||
| patch = subprocess |
There was a problem hiding this comment.
🟡 建议 patch = subprocess 不是 coverage.py [run] 节的合法配置键,运行时会被静默忽略。
若目标是追踪 subprocess 中的覆盖率,正确做法是通过 COVERAGE_PROCESS_START 环境变量让子进程自动启动 coverage,而非此配置项:
export COVERAGE_PROCESS_START=${COVERAGE_RCFILE}若无需 subprocess 覆盖率,建议直接删除该行,避免误导。
| inputs: | ||
| DOCKER_IMAGE: | ||
| description: "Build Images" | ||
| required: true |
There was a problem hiding this comment.
❓ 疑问 required: true 与 default: 同时存在语义矛盾:required: true 意味着调用方必须显式传入该值,但 default: 隐含了可省略。
建议改为 required: false(已有默认值,调用方可省略),或去掉 default: 并保持 required: true(强制调用方传入)。
当前 ci_xpu.yml 中已显式传入了该参数,实际不会报错,但定义层面的矛盾容易误导维护者。
Motivation
为 XPU CI 添加代码覆盖率收集和报告能力,提升 XPU 代码质量可观测性。同时新增 XPU 单元测试工作流,将覆盖率数据从 unit test、4卡、8卡三个阶段合并后上传至 Codecov,并设置 80% 增量覆盖率门槛。
Modifications
新增 _xpu_unit_test.yml:XPU 自定义算子单测 + 模型功能单测工作流
新增 _xpu_coverage_report.yml:合并各阶段覆盖率数据、生成报告、上传 Codecov
修改 _xpu_4cards_case_test.yml / _xpu_8cards_case_test.yml:增加 coverage 采集和 BOS 上传
修改 ci_xpu.yml:串联 unit_test 和 coverage_report 任务
新增 scripts/.coveragerc_xpu:XPU 专用覆盖率配置
新增 custom_ops/xpu_ops/test/pytest.ini:跳过已知问题的 XPU 算子单测
修改 ci_metax.yml:将 MetaX CI 超时从 120 分钟缩短至 60 分钟
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.