Skip to content

Conversation

@dillon-cullinan
Copy link
Contributor

@dillon-cullinan dillon-cullinan commented Nov 5, 2025

Overview:

Some attempts to fix occasional pod timeouts in deploy tests. Sometimes the image pulls take a long time. If all tests are triggered on the same time, and get scheduled on the same node, the singular node may get network congested with pulling many large docker images.

Summary:

  • Increase pod readiness timeout by 5 minutes in cases of long image pulls
  • Limits the parallelism of the matrix to 1 job to avoid overloading a single node on image pulls

Summary by CodeRabbit

  • Chores
    • Optimized CI/CD pipeline job concurrency constraints across multiple deployment workflows to improve resource efficiency and stability.
    • Extended deployment readiness timeout to ensure reliable pod initialization during validation processes.

@dillon-cullinan dillon-cullinan requested a review from a team as a code owner November 5, 2025 21:00
@github-actions github-actions bot added the ci Issues/PRs that reference CI build/test label Nov 5, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

Walkthrough

Modified GitHub Actions workflow configuration to add concurrency constraints across multiple job matrices and extended pod readiness timeout. The changes introduce max-parallel limits for operator, vLLM, sglang, and trtllm deployment test jobs, plus increased kubectl wait timeout duration.

Changes

Cohort / File(s) Summary
Workflow Concurrency & Timeout Optimization
\.github/workflows/container-validation-backends\.yml
Added max-parallel constraints to operator (2), deploy-test-vllm (2), deploy-test-sglang (1), and deploy-test-trtllm (2) job matrices; extended pod readiness wait timeout from 1000s to 1300s for vLLM deployment tests

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Configuration-level changes only; no logic modifications
  • Straightforward parameter additions (max-parallel settings) with clear intent
  • Single timeout value change with documented duration extension
  • Changes are consistent and follow established patterns

Poem

🐰 A rabbit hops through workflows swift,
With limits set, concurrency's gift!
Pods wait longer, jobs run with care,
Parallelism tuned with flair! ✨

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the two main changes: increasing timeout and adding max matrix concurrency, matching the changeset modifications.
Description check ✅ Passed The pull request description addresses all required template sections with clear context about the issue and proposed changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4765d88 and 4c0051f.

📒 Files selected for processing (1)
  • .github/workflows/container-validation-backends.yml (4 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/container-validation-backends.yml

422-422: expecting a single ${{...}} expression or array value for matrix variations, but found plain text node

(syntax-check)


572-572: expecting a single ${{...}} expression or array value for matrix variations, but found plain text node

(syntax-check)


593-593: expecting a single ${{...}} expression or array value for matrix variations, but found plain text node

(syntax-check)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
.github/workflows/container-validation-backends.yml (2)

476-476: Timeout increase appropriately addresses the pod readiness delay risk.

The increase from 1000s to 1300s (5-minute buffer) aligns with the PR objective to accommodate long image pulls when multiple pods start simultaneously. This change reduces false timeout failures during heavy concurrent network load on cluster nodes.


40-51: Verify if operator job should also receive max-parallel constraint.

The AI summary mentions "operator: max-parallel: 2," but this constraint is not visible in the provided operator job definition (lines 40–51). Confirm whether the operator job's matrix should also be constrained. If added, a max-parallel: 2 would have no effect since the matrix only has 2 platform entries (amd64, arm64), though it would make intent explicit for consistency.

Signed-off-by: Dillon Cullinan <[email protected]>
Copy link

@ranrubin ranrubin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Issues/PRs that reference CI build/test size/XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants