Skip to content

Adding EC2 tests on vLLM DLC #4986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 220 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
220 commits
Select commit Hold shift + click to select a range
44d464f
testing vllm
Jyothirmaikottu Jul 1, 2025
cb8fdb8
add ec2
Jyothirmaikottu Jul 7, 2025
3d8a430
Merge remote-tracking branch 'upstream/master' into vllm-ec2
Jyothirmaikottu Jul 7, 2025
1d09e4c
testing vllm route
Jyothirmaikottu Jul 9, 2025
2675fc8
fixed error in trigger_test
Jyothirmaikottu Jul 9, 2025
df0751e
added new dir for vllm test and infra
Jyothirmaikottu Jul 9, 2025
3eb563c
commented out test runner
Jyothirmaikottu Jul 9, 2025
1fb0aa1
trigger ec2
Jyothirmaikottu Jul 9, 2025
61d9100
create ec2
Jyothirmaikottu Jul 9, 2025
564b299
change region
Jyothirmaikottu Jul 10, 2025
745dadc
adding fsx
Jyothirmaikottu Jul 10, 2025
43e8b1c
adding fsx
Jyothirmaikottu Jul 10, 2025
28b6175
create func for subnet id
Jyothirmaikottu Jul 11, 2025
7e2b382
print statements
Jyothirmaikottu Jul 11, 2025
5540b04
print statements
Jyothirmaikottu Jul 13, 2025
eccd3c2
add more delete functionalities
Jyothirmaikottu Jul 13, 2025
b1a460d
fix ingress rules
Jyothirmaikottu Jul 13, 2025
49ec039
make dg is list
Jyothirmaikottu Jul 13, 2025
164e128
modify egress and ingress
Jyothirmaikottu Jul 13, 2025
4e82437
add ingress and egress rules
Jyothirmaikottu Jul 13, 2025
e43313b
add setup_script
Jyothirmaikottu Jul 13, 2025
b52607c
add setup_script
Jyothirmaikottu Jul 13, 2025
45cbdc5
fixed path
Jyothirmaikottu Jul 13, 2025
26a8611
fix sg re ordering
Jyothirmaikottu Jul 13, 2025
d4c2c9b
fix sg and fsx
Jyothirmaikottu Jul 14, 2025
f0adbc0
fix sg and fsx
Jyothirmaikottu Jul 14, 2025
0708a4d
fix sg and fsx
Jyothirmaikottu Jul 14, 2025
c581420
fixed hf token error
Jyothirmaikottu Jul 14, 2025
1ac8f01
fix error with sg and fsx mount
Jyothirmaikottu Jul 14, 2025
b2ff468
commented out cleanup code
Jyothirmaikottu Jul 14, 2025
7895633
refactor setup() and fsx_utils sg
Jyothirmaikottu Jul 14, 2025
34de29e
fix sg
Jyothirmaikottu Jul 14, 2025
75bb994
modify sg creation
Jyothirmaikottu Jul 14, 2025
4fb6a69
add self
Jyothirmaikottu Jul 14, 2025
8fe55ce
setup instances failure
Jyothirmaikottu Jul 14, 2025
d83e5a8
fix deletion of sg
Jyothirmaikottu Jul 14, 2025
bc0a3dd
remove version
Jyothirmaikottu Jul 14, 2025
2c9ac89
remove command
Jyothirmaikottu Jul 14, 2025
02c776c
adding test-runner path and actual single node test
Jyothirmaikottu Jul 14, 2025
0c0347b
added secret key for hf
Jyothirmaikottu Jul 14, 2025
b95b637
fixed import
Jyothirmaikottu Jul 14, 2025
d66c172
rename fn'
Jyothirmaikottu Jul 14, 2025
b52810d
testspec use trigger_test:
Jyothirmaikottu Jul 14, 2025
81117a1
fix import
Jyothirmaikottu Jul 15, 2025
edf60a3
fix errors
Jyothirmaikottu Jul 15, 2025
d5f1a2e
add cleanup logic
Jyothirmaikottu Jul 15, 2025
04256b6
increase time out
Jyothirmaikottu Jul 15, 2025
293ed99
change region
Jyothirmaikottu Jul 15, 2025
89cf458
changed it back to us-west-2
Jyothirmaikottu Jul 15, 2025
61f5250
modified test for single node
Jyothirmaikottu Jul 15, 2025
cb0af6d
modified test for single node
Jyothirmaikottu Jul 15, 2025
ac3e801
changes to test to use script
Jyothirmaikottu Jul 16, 2025
05eac52
Merge remote-tracking branch 'upstream/master' into vllm-ec2
Jyothirmaikottu Jul 16, 2025
11f3145
retrigger tst
Jyothirmaikottu Jul 16, 2025
7eb1204
remove nvjpeg
Jyothirmaikottu Jul 16, 2025
de2b4ad
remove nvjpeg
Jyothirmaikottu Jul 16, 2025
cdab16d
add logs
Jyothirmaikottu Jul 16, 2025
2b6f79b
fix script to pass arguments
Jyothirmaikottu Jul 17, 2025
68a1691
fix script to pass arguments
Jyothirmaikottu Jul 17, 2025
c41ed8a
fix string
Jyothirmaikottu Jul 17, 2025
e6a1e63
fix string
Jyothirmaikottu Jul 17, 2025
aef0687
test ec2
Jyothirmaikottu Jul 17, 2025
f742524
remove unused code
Jyothirmaikottu Jul 21, 2025
13c28dd
add multinode
Jyothirmaikottu Jul 21, 2025
57ec879
fixed connection
Jyothirmaikottu Jul 21, 2025
8427db0
fix connection
Jyothirmaikottu Jul 21, 2025
cc01855
fix path
Jyothirmaikottu Jul 21, 2025
438b725
fix comand
Jyothirmaikottu Jul 21, 2025
1a38f4d
retrigger ec2
Jyothirmaikottu Jul 22, 2025
a8c2937
Merge branch 'master' into vllm-ec2
Jyothirmaikottu Jul 22, 2025
58ef98f
Merge branch 'master' into vllm-ec2
Jyothirmaikottu Jul 23, 2025
3d775b8
increase wait time and add fsx version
Jyothirmaikottu Jul 23, 2025
ed94ecc
fix fsx command
Jyothirmaikottu Jul 23, 2025
8c2ca97
fix names
Jyothirmaikottu Jul 23, 2025
471b9ff
fix dir
Jyothirmaikottu Jul 23, 2025
17b1ca5
fix vllm dir and add log
Jyothirmaikottu Jul 23, 2025
bda6b37
fix git clone
Jyothirmaikottu Jul 23, 2025
f1170f9
fix git url
Jyothirmaikottu Jul 23, 2025
a4eef72
fix path
Jyothirmaikottu Jul 23, 2025
97077fa
increase max attempts
Jyothirmaikottu Jul 24, 2025
06cf30b
fixed paths
Jyothirmaikottu Jul 24, 2025
d6e33a5
added more fixes
Jyothirmaikottu Jul 24, 2025
f06c4fe
sleep
Jyothirmaikottu Jul 24, 2025
1fc7fdb
setup instance one at a time
Jyothirmaikottu Jul 24, 2025
d06cd9a
create diff fsx and sg for another instance
Jyothirmaikottu Jul 24, 2025
48abe34
add conda installer
Jyothirmaikottu Jul 24, 2025
c08c822
create conda env
Jyothirmaikottu Jul 25, 2025
636cb42
conda accept tps
Jyothirmaikottu Jul 25, 2025
92a8076
fix sg and multinode
Jyothirmaikottu Jul 25, 2025
ee9f82c
add packages
Jyothirmaikottu Jul 27, 2025
3ebed39
add venv vllm_env
Jyothirmaikottu Jul 27, 2025
0452b3c
add venv vllm_env
Jyothirmaikottu Jul 27, 2025
1c08d41
fixed vllm venv
Jyothirmaikottu Jul 27, 2025
fd8e127
fixed transformrs isntallation
Jyothirmaikottu Jul 28, 2025
dd125b1
fix cleanup logic
Jyothirmaikottu Jul 28, 2025
b160c6c
add timer
Jyothirmaikottu Jul 28, 2025
d125d23
increase timer
Jyothirmaikottu Jul 28, 2025
db862de
run single node
Jyothirmaikottu Jul 28, 2025
2cd2b0a
multinode test
Jyothirmaikottu Jul 28, 2025
5aafae8
add packages
Jyothirmaikottu Jul 29, 2025
a0f0ac4
test ec2
Jyothirmaikottu Jul 29, 2025
1e34f95
activate venv
Jyothirmaikottu Jul 29, 2025
c6fba55
add vllm serve
Jyothirmaikottu Jul 29, 2025
7b6afc7
increase cleanup timer
Jyothirmaikottu Jul 29, 2025
f358ff2
test multinode
Jyothirmaikottu Jul 30, 2025
e750c2c
test mutlinode
Jyothirmaikottu Jul 30, 2025
1f6ed9e
test multinode
Jyothirmaikottu Jul 30, 2025
cdd5e26
test multinode
Jyothirmaikottu Jul 30, 2025
a7ff3f0
test multinode
Jyothirmaikottu Jul 31, 2025
0c1a8e6
retest
Jyothirmaikottu Jul 31, 2025
90c1589
retest
Jyothirmaikottu Jul 31, 2025
66eed3f
retest
Jyothirmaikottu Jul 31, 2025
3e26dc6
retest
Jyothirmaikottu Jul 31, 2025
8c8ee6f
retest
Jyothirmaikottu Jul 31, 2025
87156bf
retest single node
Jyothirmaikottu Jul 31, 2025
0941b76
test
Jyothirmaikottu Jul 31, 2025
cf69a6f
test single node
Jyothirmaikottu Jul 31, 2025
f114b92
test efa and nccl
Jyothirmaikottu Jul 31, 2025
db4fb9c
test efa and nccl multinode
Jyothirmaikottu Jul 31, 2025
f1e0c80
test efa and nccl
Jyothirmaikottu Aug 1, 2025
2df0158
add sleep timer
Jyothirmaikottu Aug 1, 2025
ff15bab
test efa
Jyothirmaikottu Aug 1, 2025
bace925
test efa
Jyothirmaikottu Aug 3, 2025
fc2055a
test efa
Jyothirmaikottu Aug 3, 2025
623dfa9
test efa
Jyothirmaikottu Aug 3, 2025
c11772d
added print statements
Jyothirmaikottu Aug 3, 2025
4cf106f
debug efa
Jyothirmaikottu Aug 4, 2025
0aed7c7
debug efa
Jyothirmaikottu Aug 4, 2025
14603b8
test vllm openai server
Jyothirmaikottu Aug 4, 2025
82eba71
test vllm openai server
Jyothirmaikottu Aug 4, 2025
9e3b145
test vllm openai server
Jyothirmaikottu Aug 4, 2025
c224bc2
add cleanup for address allocation exception
Jyothirmaikottu Aug 4, 2025
1b6bdd2
add ingress rule
Jyothirmaikottu Aug 4, 2025
d6b7079
retest sg
Jyothirmaikottu Aug 4, 2025
8a52004
add sleep timer for debugging
Jyothirmaikottu Aug 4, 2025
184d44a
test efa
Jyothirmaikottu Aug 4, 2025
e232da4
test efa
Jyothirmaikottu Aug 4, 2025
784405a
test efa
Jyothirmaikottu Aug 4, 2025
2ac742d
revert efa
Jyothirmaikottu Aug 4, 2025
e7d328c
test single node
Jyothirmaikottu Aug 5, 2025
9e0adee
test single node
Jyothirmaikottu Aug 5, 2025
4669f43
test single node
Jyothirmaikottu Aug 5, 2025
08a1f35
make ipv6 true
Jyothirmaikottu Aug 5, 2025
c8fddb5
run efa ipv6
Jyothirmaikottu Aug 5, 2025
a23727b
run efa ipv6
Jyothirmaikottu Aug 5, 2025
3016d40
modify instance setup
Jyothirmaikottu Aug 6, 2025
31dc5e2
test multinode and efa
Jyothirmaikottu Aug 6, 2025
7613013
test multinode and efa
Jyothirmaikottu Aug 6, 2025
8d466bb
modify instance setup
Jyothirmaikottu Aug 6, 2025
8d0a316
modify instance setup
Jyothirmaikottu Aug 6, 2025
1f27b28
test efa
Jyothirmaikottu Aug 7, 2025
cb572cc
revamp ec2
Jyothirmaikottu Aug 7, 2025
95678f6
revamp ec2
Jyothirmaikottu Aug 7, 2025
c1141cd
add test tunner
Jyothirmaikottu Aug 7, 2025
c6d65bf
Merge branch 'master' into vllm-ec2
Jyothirmaikottu Aug 7, 2025
9fd5b2c
add ec2 in test runner
Jyothirmaikottu Aug 7, 2025
9bfa04b
add ec2 elasticip cleanup
Jyothirmaikottu Aug 7, 2025
234ab10
test efa
Jyothirmaikottu Aug 7, 2025
12c67ff
fix error in efa_ec2 print statement
Jyothirmaikottu Aug 7, 2025
431f838
fix error in efa_ec2 print statement
Jyothirmaikottu Aug 7, 2025
4b75800
fix path of setup_fsx
Jyothirmaikottu Aug 7, 2025
9211443
retest ec2
Jyothirmaikottu Aug 7, 2025
9fc0174
retest ec2
Jyothirmaikottu Aug 7, 2025
c25e1b7
add condition to skip chdir
Jyothirmaikottu Aug 8, 2025
5457d1d
change dir
Jyothirmaikottu Aug 8, 2025
66d8b85
test ec2
Jyothirmaikottu Aug 8, 2025
101ebb9
test ec2
Jyothirmaikottu Aug 8, 2025
fabc670
test efa'
Jyothirmaikottu Aug 8, 2025
ca22d6a
test multinode
Jyothirmaikottu Aug 8, 2025
764c925
test multinode
Jyothirmaikottu Aug 8, 2025
1b6c52b
test multinode
Jyothirmaikottu Aug 8, 2025
ad3f90d
test multinode
Jyothirmaikottu Aug 8, 2025
751b52e
test multinode
Jyothirmaikottu Aug 8, 2025
96eff07
test multinode
Jyothirmaikottu Aug 8, 2025
3cb7aa3
test multinode
Jyothirmaikottu Aug 8, 2025
39354d7
test multinode
Jyothirmaikottu Aug 8, 2025
a1e128b
test multinode tmux
Jyothirmaikottu Aug 8, 2025
7e8f799
test multinode tmux
Jyothirmaikottu Aug 8, 2025
296fb48
remove timer
Jyothirmaikottu Aug 8, 2025
085008d
remove tmux from worker
Jyothirmaikottu Aug 10, 2025
280565b
test multinode
Jyothirmaikottu Aug 10, 2025
974108c
test multinode
Jyothirmaikottu Aug 10, 2025
5b1be97
test efa and multinode
Jyothirmaikottu Aug 10, 2025
11999c5
test multinode
Jyothirmaikottu Aug 10, 2025
99cfee1
test efa
Jyothirmaikottu Aug 11, 2025
04a3b46
test efa
Jyothirmaikottu Aug 11, 2025
9f3ca48
fix key pair logic
Jyothirmaikottu Aug 11, 2025
072c790
test efa
Jyothirmaikottu Aug 11, 2025
8591112
test efa
Jyothirmaikottu Aug 11, 2025
09a0cd9
test efa and test multinode
Jyothirmaikottu Aug 11, 2025
4afa666
test efa and test multinode
Jyothirmaikottu Aug 11, 2025
8775639
test efa and test multinode
Jyothirmaikottu Aug 11, 2025
ffafcfb
test efa and test multinode
Jyothirmaikottu Aug 11, 2025
86b90c2
test multinode
Jyothirmaikottu Aug 11, 2025
d9b30f6
test multinode
Jyothirmaikottu Aug 11, 2025
8c8d42a
test multinode
Jyothirmaikottu Aug 11, 2025
8dbeb90
increased max attempts
Jyothirmaikottu Aug 11, 2025
56c580d
run efa and multinode
Jyothirmaikottu Aug 11, 2025
4553623
test efa and multinode
Jyothirmaikottu Aug 11, 2025
4e08e03
test efa and multinode
Jyothirmaikottu Aug 11, 2025
5e99d0b
test efa and multinode
Jyothirmaikottu Aug 11, 2025
ee7d7db
add single script
Jyothirmaikottu Aug 11, 2025
70afd36
test multinode
Jyothirmaikottu Aug 11, 2025
4b1f6d5
test multinode
Jyothirmaikottu Aug 12, 2025
c63530a
test multinode
Jyothirmaikottu Aug 12, 2025
e17c656
test multinode
Jyothirmaikottu Aug 12, 2025
cea0e46
add delay
Jyothirmaikottu Aug 12, 2025
f10b6df
add delay and model ready waiter
Jyothirmaikottu Aug 12, 2025
bf6293c
add async
Jyothirmaikottu Aug 12, 2025
a9fb658
test enforce eager
Jyothirmaikottu Aug 12, 2025
5eab0e3
add timer to test:
Jyothirmaikottu Aug 12, 2025
e3ba035
test multinode
Jyothirmaikottu Aug 13, 2025
73a16f7
test multinode
Jyothirmaikottu Aug 13, 2025
77d4bfc
Test all methods
Jyothirmaikottu Aug 13, 2025
2e3b9a4
test methods
Jyothirmaikottu Aug 13, 2025
ba2b565
test methods
Jyothirmaikottu Aug 13, 2025
328aef8
test methods
Jyothirmaikottu Aug 13, 2025
cb7bb0e
test methods
Jyothirmaikottu Aug 13, 2025
277caca
test single node and multinode
Jyothirmaikottu Aug 13, 2025
76e07e4
test single node and multinode
Jyothirmaikottu Aug 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions dlc_developer_config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,16 @@ deep_canary_mode = false
[build]
# Add in frameworks you would like to build. By default, builds are disabled unless you specify building an image.
# available frameworks - ["base", "vllm", "autogluon", "huggingface_tensorflow", "huggingface_pytorch", "huggingface_tensorflow_trcomp", "huggingface_pytorch_trcomp", "pytorch_trcomp", "tensorflow", "pytorch", "stabilityai_pytorch"]
build_frameworks = []
build_frameworks = ["vllm"]


# By default we build both training and inference containers. Set true/false values to determine which to build.
build_training = true
build_inference = true
build_training = false
build_inference = false

# Set do_build to "false" to skip builds and test the latest image built by this PR
# Note: at least one build is required to set do_build to "false"
do_build = true
do_build = false

[notify]
### Notify on test failures
Expand Down
4 changes: 2 additions & 2 deletions test/dlc_tests/container_tests/bin/efa/testEFA
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ validate_all_reduce_performance_logs(){
# EFA 1.37.0 using "Using network Libfabric" instead of "Using network AWS Libfabric"
grep -E "Using network (AWS )?Libfabric" ${TRAINING_LOG} || { echo "efa is not working, please check if it is installed correctly"; exit 1; }
if [[ ${INSTANCE_TYPE} == p4d* || ${INSTANCE_TYPE} == p5* ]]; then
grep "Setting NCCL_TOPO_FILE environment variable to" ${TRAINING_LOG}
grep "NCCL_TOPO_FILE set by environment to" ${TRAINING_LOG}
# EFA 1.37.0 change from NET/AWS Libfabric/0/GDRDMA to NET/Libfabric/0/GDRDMA
grep -E "NET/(AWS )?Libfabric/0/GDRDMA" ${TRAINING_LOG}
fi
Expand Down Expand Up @@ -89,7 +89,7 @@ check_efa_nccl_all_reduce(){

RETURN_VAL=${PIPESTATUS[0]}
# In case, if you would like see logs, uncomment below line
# RESULT=$(cat ${TRAINING_LOG})
RESULT=$(cat ${TRAINING_LOG})

if [ ${RETURN_VAL} -eq 0 ]; then
echo "***************************** check_efa_nccl_all_reduce passed *****************************"
Expand Down
14 changes: 10 additions & 4 deletions test/dlc_tests/ec2/test_efa.py
Original file line number Diff line number Diff line change
Expand Up @@ -294,10 +294,16 @@ def _setup_container(connection, docker_image, container_name):
# using SSH on a pre-defined port (as decided by sshd_config on server-side).
# Allow instance to share all memory with container using memlock=-1:-1.
# Share all EFA devices with container using --device <device_location> for all EFA devices.
connection.run(
f"docker run --runtime=nvidia --gpus all -id --name {container_name} --network host --ulimit memlock=-1:-1 "
f"{docker_all_devices_arg} -v $HOME/container_tests:/test -v /dev/shm:/dev/shm {docker_image} bash"
)
if "vllm" in docker_image:
connection.run(
f"docker run --entrypoint=/bin/bash -e CUDA_HOME=/usr/local/cuda --runtime=nvidia --gpus all -id --name {container_name} --network host --ulimit memlock=-1:-1 "
f"{docker_all_devices_arg} -v $HOME/container_tests:/test -v /dev/shm:/dev/shm {docker_image}"
)
else:
connection.run(
f"docker run --runtime=nvidia --gpus all -id --name {container_name} --network host --ulimit memlock=-1:-1 "
f"{docker_all_devices_arg} -v $HOME/container_tests:/test -v /dev/shm:/dev/shm {docker_image} bash"
)


def _setup_master_efa_ssh_config(connection):
Expand Down
21 changes: 21 additions & 0 deletions test/test_utils/ec2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1817,6 +1817,27 @@ def get_default_subnet_for_az(ec2_client, availability_zone):
return az_subnet_id


def get_subnet_id_by_vpc(ec2_client, vpc_id):

response = ec2_client.describe_subnets(
Filters=[
{
"Name": "vpc-id",
"Values": [
vpc_id,
],
},
],
)

subnet_ids = []
for subnet in response["Subnets"]:
if subnet["SubnetId"] is not None:
subnet_ids.append(subnet["SubnetId"])

return subnet_ids


def get_vpc_id_by_name(ec2_client, vpc_name):
"""
Get VPC ID by VPC name tag
Expand Down
8 changes: 4 additions & 4 deletions test/testrunner.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ def main():
pull_dlc_images(all_image_list)
if specific_test_type == "bai":
build_bai_docker_container()
if specific_test_type == "eks" and not is_all_images_list_eia:
if specific_test_type in ["eks", "ec2"] and not is_all_images_list_eia:
frameworks_in_images = [
framework
for framework in ("mxnet", "pytorch", "tensorflow", "vllm")
Expand All @@ -424,13 +424,13 @@ def main():

if framework == "vllm":
try:
LOGGER.info(f"Running vLLM EKS tests with image: {all_image_list[0]}")
LOGGER.info(f"Running vLLM EKS EC2 tests with image: {all_image_list[0]}")
test()
LOGGER.info("vLLM EKS tests completed successfully")
LOGGER.info("vLLM EKS EC2 tests completed successfully")
# Exit function after vLLM tests
return
except Exception as e:
LOGGER.error(f"vLLM EKS tests failed: {str(e)}")
LOGGER.error(f"vLLM EKS EC2 tests failed: {str(e)}")
raise

eks_cluster_name = f"dlc-{framework}-{build_context}"
Expand Down
Loading