Skip to content

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746

Open
sunlei1024 wants to merge 11 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm
Open

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746
sunlei1024 wants to merge 11 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm

Conversation

@sunlei1024
Copy link
Copy Markdown
Collaborator

Motivation

默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM
以提升 Engine-to-Worker 的张量传递效率,以及引擎任务队列基于共享内存(SHM)的通信性能。

该优化在大模型推理场景下可以减少序列化/反序列化开销,提高吞吐和延迟表现。

Modifications

  • fastdeploy/envs.py
    • FD_ENABLE_E2W_TENSOR_CONVERT 默认值由 0 改为 1
    • FD_ENGINE_TASK_QUEUE_WITH_SHM 默认值由 0 改为 1

行为变更说明

  • 未显式设置环境变量时,将默认启用上述优化能力
  • 如需保持旧行为,可手动设置:
    • FD_ENABLE_E2W_TENSOR_CONVERT=0
    • FD_ENGINE_TASK_QUEUE_WITH_SHM=0
  • 在容器环境中需确保 /dev/shm 空间充足(建议 ≥ 1GB,视模型规模而定)

Usage or Command

默认无需额外配置,升级后自动生效。

如需关闭相关功能,可通过环境变量控制:

export FD_ENABLE_E2W_TENSOR_CONVERT=0
export FD_ENGINE_TASK_QUEUE_WITH_SHM=0

Docker 使用示例(配置共享内存):

docker run --shm-size=1g ...

Accuracy Tests

本次修改仅涉及环境变量默认值调整,不涉及模型计算逻辑变更。

验证结果:

  • ✅ 功能验证:服务启动、推理流程正常
  • ✅ 性能验证:E2W tensor convert 与 SHM queue 正常工作
  • ✅ 一致性验证:关闭开关(设为0)后结果与旧版本一致(无精度差异)

Checklist

  • Add at least one tag in PR title (e.g., [FDConfig])
  • Code formatted and pre-commit passed
  • Unit tests added
    • 原因:本次修改仅为默认配置变更,无新增逻辑路径
  • Accuracy results provided
  • Backward compatibility considered(可通过环境变量回退)
  • Not a Cherry-Pick PR / OR follows Cherry-Pick rules if applicable

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@sunlei1024 sunlei1024 changed the title [test] Stop server with /dev/shm cleanup [FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM May 7, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 14:12:57

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 1 个 required 失败任务Approval),另有 6 个 required 任务运行中,请优先处理 Approval 审批问题后等待其余任务完成。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 26 2 7 1 0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:修改受保护文件需特定RD审批(2项未满足) 请指定RD审批envs.py改动和日志行为变更 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Run Base Tests / base_tests - 运行中 - Job -
Run Four Cards Tests / run_4_cards_tests - 运行中 - Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
其余 3 个必选任务通过(Pre Commit、run_tests_logprob、stable_tests) - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 2m43s Job -
Run iluvatar Tests / run_iluvatar_cases - Job -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(审批检查)(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范(审批检查失败)
  • 置信度: 高
  • 根因摘要: PR修改受保护文件需特定RD审批,共2项未通过
  • 分析器: 通用分析(fallback)

根因详情:
PR 修改了两类受保护内容:(1) fastdeploy/envs.py,需要 @jiangjiajun/@liuyuanle/@chenjian26/@wanglongzhi 中任一人 Review 批准;(2) 新增 .error() 日志调用(修改日志行为),需要 @zhouchong/@zhangyongyue 中任一人 Review 批准。两项审批均未满足,脚本以 exit code 6 退出。

关键日志:

Detected log modification in diff:
+    self.llm_logger.error("Failed to connect to engine worker queue, retry after 5 seconds")
+    llm_logger.error("Failed to connect to engine worker queue")
0. You must have one FastDeploy RD (jiangjiajun/liuyuanle/chenjian26/wanglongzhi) approval for modifying [fastdeploy/envs.py].
1. You must have one FastDeploy RD (zhouchong/zhangyongyue) approval for modifying logging behavior.
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @Jiang-Jia-Jun@yuanlehome@rainyfly@Wanglongzhi2001 审批 fastdeploy/envs.py 的修改
  2. @xyxinyang@zyyzghb 审批日志行为变更(新增 .error() 调用)

修复建议摘要: 请指定RD审批envs.py改动和日志行为变更

关联变更: fastdeploy/envs.py(新增envs)+ 日志调用变更(.error()
链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 23.33333% with 23 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@983b1a3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/utils.py 33.33% 9 Missing and 3 partials ⚠️
...stdeploy/inter_communicator/engine_worker_queue.py 11.11% 8 Missing ⚠️
fastdeploy/engine/common_engine.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7746   +/-   ##
==========================================
  Coverage           ?   72.36%           
==========================================
  Files              ?      396           
  Lines              ?    55733           
  Branches           ?     8714           
==========================================
  Hits               ?    40333           
  Misses             ?    12647           
  Partials           ?     2753           
Flag Coverage Δ
GPU 72.36% <23.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 14:05:01

📋 Review 摘要

PR 概述:默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM,同步新增 SHM 队列断连探测与端口可用性检查,并重构测试工具函数。
变更范围fastdeploy/envs.pyfastdeploy/engine/common_engine.pyfastdeploy/inter_communicator/engine_worker_queue.pyfastdeploy/utils.pytests/
影响面 Tag[FDConfig] [Engine] [CI]

📝 PR 规范检查

标题包含官方 Tag [FDConfig] ✓,描述模板 5 个 section(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)均完整 ✓,规范合规,无需修改建议。

问题

级别 文件 概述
🟡 建议 fastdeploy/inter_communicator/engine_worker_queue.py:857 except Exception: return False 静默吞掉异常,无日志
🟡 建议 tests/ci_use/EB_Lite_with_adapter/test_eblite_serving.py:93 rm -rf /dev/shm/* 过于激进,可能误删 CI 并发任务的共享内存
❓ 疑问 fastdeploy/inter_communicator/engine_worker_queue.py:851 BaseManager.connect() 探测连接状态的可靠性存疑

总体评价

整体方向正确,默认启用 SHM 队列在正常环境下可有效提升性能,配套的断连探测和端口检查逻辑也完善了错误恢复路径。主要需关注测试清理逻辑的安全性及 is_broken() 的实现健壮性。

self.manager.connect()
return False
except (ConnectionRefusedError, ConnectionResetError, BrokenPipeError, EOFError, OSError):
llm_logger.error("Failed to connect to engine worker queue")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 except Exception: return False 静默吞掉所有非预期异常,无任何日志(命中 §C:except Exception:logger.error → 错误静默)。

self.manager.connect() 因非网络原因抛出异常时,此处会掩盖真实错误,使 is_broken() 误报为「队列正常」。

建议改为:except Exception as e: llm_logger.warning("is_broken: unexpected error: %s", e); return False

"""
# 清理/dev/shm中的临时文件
try:
subprocess.run("rm -rf /dev/shm/*", shell=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 rm -rf /dev/shm/* 会删除 /dev/shm所有文件,在 CI 并发场景下可能误删其他测试任务或系统服务正在使用的共享内存段,引起不相关测试失败。

建议只清理本项目的 socket 文件,与 clean_ports 的实现保持一致:
glob.glob("/dev/shm/fd_task_queue_*.sock") 逐一删除,而非全量清理。

"""
if self.manager is not None and self.is_server:
self.manager.shutdown()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 BaseManager.connect() 的语义是「初始化/重建连接」,而非「探测连接是否存活」。若管理器端口仍在监听但内部状态已损坏,connect() 可能成功握手但后续业务请求仍会失败,导致 is_broken() 误判为正常。

另外请确认:当 FD_ENGINE_TASK_QUEUE_WITH_SHM=0(TCP 模式)时,self.manager.connect() 是否也能正确反映队列状态?建议作者补充说明两种模式下的行为一致性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants