Skip to content

JAX-vLLM Offloading k8s AWS EKS#1798

Merged
yhtang merged 31 commits intomainfrom
sbosisio/transfer-multinode-eks
Dec 9, 2025
Merged

JAX-vLLM Offloading k8s AWS EKS#1798
yhtang merged 31 commits intomainfrom
sbosisio/transfer-multinode-eks

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Nov 24, 2025

No description provided.

Copy link
Contributor

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this! Made some comments. Let me know what you think.

Copy link
Contributor

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have a working example, could we integrate this into jio.yaml?

Also, could we add a performance-monitoring step to this job so that if throughput drops below a certain baseline, the job reports a failure?

@Steboss
Copy link
Contributor Author

Steboss commented Dec 2, 2025

hey @yhtang

Also, could we add a performance-monitoring step to this job so that if throughput drops below a certain baseline, the job reports a failure?

We can definitely make up the fully working example, just a caveat, we're still investigating why NCCL doesn't pick up EFA on EKS. If it's ok with you we can start with this approach, performance will be low, I'll give you some numbers by EOW at most.

@Steboss
Copy link
Contributor Author

Steboss commented Dec 4, 2025

@yhtang shared with you the performance on EKS

@Steboss Steboss requested a review from yhtang December 4, 2025 16:06
@Steboss Steboss requested a review from yhtang December 5, 2025 14:11
Copy link
Contributor

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you for the hard work!

@yhtang yhtang merged commit c58d0a8 into main Dec 9, 2025
81 of 100 checks passed
@yhtang yhtang deleted the sbosisio/transfer-multinode-eks branch December 9, 2025 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants