Skip to content

Commit c4a69e2

Browse files
authored
Merge pull request #84 from converged-computing/redo-osu
osu: fix runs for gpu 128 GKE and CE
2 parents 3feacac + 2b51308 commit c4a69e2

File tree

116 files changed

+11669
-6
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+11669
-6
lines changed

experiments/google/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# Google Cloud Experiments
22

3-
This directory will hold experiments for Google Cloud.
3+
This directory will hold experiments for Google Cloud.
44

55
- [gke/cpu](gke/cpu): CPU experiments for Google Kubernetes Engine
66
- [gke/gpu](gpu/gpu): GPU experiments for Google Kubernetes Engine
7+
8+
## OSU Benchmarks
9+
10+
Note that for our first experiments, we ran OSU across sizes, however we made the mistake of using flux submit for the GPU runs, which isn't blocking, and meant that osu latency would interfere with bandwidth. To fix this we re-ran GPU for sizes 16 nodes for each of Compute Engine and GKE. This size is what should be used for data analysis.

experiments/google/compute-engine/gpu/size16/README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:c
237237
#### OSU
238238

239239
Write this script to the filesystem `flux-run-combinations.sh`
240+
Note that the initial study in August 2024 used flux submit, which isn't blocking (and erroneous since they run at the same time). We redid this size in March 2025 with flux run, ensuring it would be blocking.
240241

241242
```bash
242243
#/bin/bash
@@ -258,13 +259,13 @@ for i in $hosts; do
258259
dequeue_from_list $list
259260
for j in $list; do
260261
echo "${i} ${j}"
261-
flux submit -N 2 -n 2 \
262+
flux run -N 2 -n 2 \
262263
--setattr=user.study_id=$app-2-iter-$iter \
263264
--requires="hosts:${i},${j}" \
264265
-o cpu-affinity=per-task \
265266
-g 1 -o gpu-affinity=per-task \
266267
singularity exec --nv /opt/containers/metric-osu-gpu_google-gpu.sif /opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency -d cuda H H
267-
flux submit -N 2 -n 2 \
268+
flux run -N 2 -n 2 \
268269
--setattr=user.study_id=$app-2-iter-$iter \
269270
--requires="hosts:${i},${j}" \
270271
-o cpu-affinity=per-task \
@@ -298,7 +299,12 @@ done
298299

299300
# When they are done:
300301
./save.sh $output
302+
303+
# August 2024
301304
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:compute-engine-gpu-16-$app $output
305+
306+
# March 2025 with fixed osu latency and osu bw
307+
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:compute-engine-gpu-16-$app-fixed $output
302308
```
303309

304310
#### Quicksilver
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
Warning: OMB could not identify the local rank of the process.
2+
This can lead to multiple processes using the same GPU.
3+
Please use the get_local_rank script in the OMB repo for this.
4+
Warning: OMB could not identify the local rank of the process.
5+
This can lead to multiple processes using the same GPU.
6+
Please use the get_local_rank script in the OMB repo for this.
7+
# OSU MPI-CUDA Latency Test v5.8
8+
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
9+
# Size Latency (us)
10+
0 24.71
11+
1 24.37
12+
2 24.61
13+
4 24.51
14+
8 24.43
15+
16 24.39
16+
32 24.23
17+
64 24.40
18+
128 24.19
19+
256 24.53
20+
512 24.81
21+
1024 25.39
22+
2048 29.46
23+
4096 31.57
24+
8192 36.56
25+
16384 43.71
26+
32768 66.71
27+
65536 174.45
28+
131072 186.18
29+
262144 235.77
30+
524288 349.58
31+
1048576 564.08
32+
2097152 970.02
33+
4194304 1868.56
34+
START OF JOBSPEC
35+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}, {"type": "gpu", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["singularity", "exec", "--nv", "/opt/containers/metric-osu-gpu_google-gpu.sif", "/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency", "-d", "cuda", "H", "H"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/containers", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": 0, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-009,flux-006"]}}, "user": {"study_id": "osu-2-iter-0"}}, "version": 1}
36+
START OF EVENTLOG
37+
{"timestamp":1742925379.8451526,"name":"init"}
38+
{"timestamp":1742925379.8460701,"name":"starting"}
39+
{"timestamp":1742925379.8658693,"name":"shell.init","context":{"service":"501043911-shell-fUdjYatP","leader-rank":5,"size":2}}
40+
{"timestamp":1742925379.9225433,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
41+
{"timestamp":1742925405.0883598,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":2472,"wait_status":0,"signaled":0,"exitcode":0}}
42+
{"timestamp":1742925405.1001337,"name":"complete","context":{"status":0}}
43+
{"timestamp":1742925405.1001668,"name":"done"}
44+
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
Warning: OMB could not identify the local rank of the process.
2+
This can lead to multiple processes using the same GPU.
3+
Please use the get_local_rank script in the OMB repo for this.
4+
Warning: OMB could not identify the local rank of the process.
5+
This can lead to multiple processes using the same GPU.
6+
Please use the get_local_rank script in the OMB repo for this.
7+
# OSU MPI-CUDA Bandwidth Test v5.8
8+
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
9+
# Size Bandwidth (MB/s)
10+
1 0.18
11+
2 0.39
12+
4 0.68
13+
8 1.48
14+
16 3.26
15+
32 6.08
16+
64 20.84
17+
128 40.89
18+
256 78.46
19+
512 152.48
20+
1024 295.84
21+
2048 539.71
22+
4096 878.73
23+
8192 1315.63
24+
16384 1656.37
25+
32768 1967.43
26+
65536 2194.64
27+
131072 2150.22
28+
262144 2056.25
29+
524288 1924.40
30+
1048576 1853.76
31+
2097152 1761.99
32+
4194304 1715.88
33+
START OF JOBSPEC
34+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}, {"type": "gpu", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["singularity", "exec", "--nv", "/opt/containers/metric-osu-gpu_google-gpu.sif", "/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_bw", "-d", "cuda", "H", "H"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/containers", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": 0, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-009,flux-006"]}}, "user": {"study_id": "osu-2-iter-0"}}, "version": 1}
35+
START OF EVENTLOG
36+
{"timestamp":1742925405.3776855,"name":"init"}
37+
{"timestamp":1742925405.3786778,"name":"starting"}
38+
{"timestamp":1742925405.3981018,"name":"shell.init","context":{"service":"501043911-shell-fftPNRGs","leader-rank":5,"size":2}}
39+
{"timestamp":1742925405.4020319,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
40+
{"timestamp":1742925414.4804258,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":2525,"wait_status":0,"signaled":0,"exitcode":0}}
41+
{"timestamp":1742925414.4866488,"name":"complete","context":{"status":0}}
42+
{"timestamp":1742925414.4866798,"name":"done"}
43+
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
Warning: OMB could not identify the local rank of the process.
2+
This can lead to multiple processes using the same GPU.
3+
Please use the get_local_rank script in the OMB repo for this.
4+
Warning: OMB could not identify the local rank of the process.
5+
This can lead to multiple processes using the same GPU.
6+
Please use the get_local_rank script in the OMB repo for this.
7+
# OSU MPI-CUDA Latency Test v5.8
8+
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
9+
# Size Latency (us)
10+
0 24.53
11+
1 24.95
12+
2 25.18
13+
4 25.36
14+
8 24.93
15+
16 24.43
16+
32 24.97
17+
64 24.56
18+
128 25.23
19+
256 26.02
20+
512 25.40
21+
1024 26.45
22+
2048 29.49
23+
4096 31.35
24+
8192 36.66
25+
16384 45.80
26+
32768 67.06
27+
65536 181.80
28+
131072 189.26
29+
262144 238.14
30+
524288 356.01
31+
1048576 562.24
32+
2097152 965.45
33+
4194304 1708.20
34+
START OF JOBSPEC
35+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}, {"type": "gpu", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["singularity", "exec", "--nv", "/opt/containers/metric-osu-gpu_google-gpu.sif", "/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency", "-d", "cuda", "H", "H"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/containers", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": 0, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-009,flux-002"]}}, "user": {"study_id": "osu-2-iter-1"}}, "version": 1}
36+
START OF EVENTLOG
37+
{"timestamp":1742925414.761935,"name":"init"}
38+
{"timestamp":1742925414.7628424,"name":"starting"}
39+
{"timestamp":1742925414.7843826,"name":"shell.init","context":{"service":"501043911-shell-fk2FaDkj","leader-rank":1,"size":2}}
40+
{"timestamp":1742925414.8324976,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
41+
{"timestamp":1742925439.7674615,"name":"shell.task-exit","context":{"localid":0,"rank":1,"state":"Exited","pid":2329,"wait_status":0,"signaled":0,"exitcode":0}}
42+
{"timestamp":1742925439.7992711,"name":"complete","context":{"status":0}}
43+
{"timestamp":1742925439.7993052,"name":"done"}
44+
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
Warning: OMB could not identify the local rank of the process.
2+
This can lead to multiple processes using the same GPU.
3+
Please use the get_local_rank script in the OMB repo for this.
4+
Warning: OMB could not identify the local rank of the process.
5+
This can lead to multiple processes using the same GPU.
6+
Please use the get_local_rank script in the OMB repo for this.
7+
# OSU MPI-CUDA Bandwidth Test v5.8
8+
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
9+
# Size Bandwidth (MB/s)
10+
1 0.20
11+
2 0.50
12+
4 0.92
13+
8 1.49
14+
16 3.58
15+
32 7.05
16+
64 20.39
17+
128 38.56
18+
256 74.51
19+
512 142.02
20+
1024 271.58
21+
2048 507.87
22+
4096 829.86
23+
8192 1384.97
24+
16384 1849.12
25+
32768 2149.54
26+
65536 2188.41
27+
131072 2194.09
28+
262144 1989.21
29+
524288 1882.12
30+
1048576 1843.67
31+
2097152 1829.40
32+
4194304 1928.82
33+
START OF JOBSPEC
34+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}, {"type": "gpu", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["singularity", "exec", "--nv", "/opt/containers/metric-osu-gpu_google-gpu.sif", "/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_bw", "-d", "cuda", "H", "H"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/containers", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": 0, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-009,flux-002"]}}, "user": {"study_id": "osu-2-iter-1"}}, "version": 1}
35+
START OF EVENTLOG
36+
{"timestamp":1742925440.0800998,"name":"init"}
37+
{"timestamp":1742925440.0810995,"name":"starting"}
38+
{"timestamp":1742925440.1029232,"name":"shell.init","context":{"service":"501043911-shell-fwBQeiZ5","leader-rank":1,"size":2}}
39+
{"timestamp":1742925440.1071765,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
40+
{"timestamp":1742925448.7218251,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":2244,"wait_status":0,"signaled":0,"exitcode":0}}
41+
{"timestamp":1742925448.7313514,"name":"complete","context":{"status":0}}
42+
{"timestamp":1742925448.731385,"name":"done"}
43+
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
Warning: OMB could not identify the local rank of the process.
2+
This can lead to multiple processes using the same GPU.
3+
Please use the get_local_rank script in the OMB repo for this.
4+
Warning: OMB could not identify the local rank of the process.
5+
This can lead to multiple processes using the same GPU.
6+
Please use the get_local_rank script in the OMB repo for this.
7+
# OSU MPI-CUDA Latency Test v5.8
8+
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
9+
# Size Latency (us)
10+
0 30.16
11+
1 30.22
12+
2 30.47
13+
4 30.38
14+
8 30.69
15+
16 30.46
16+
32 30.60
17+
64 30.74
18+
128 30.68
19+
256 31.60
20+
512 31.79
21+
1024 32.73
22+
2048 36.24
23+
4096 38.27
24+
8192 43.85
25+
16384 54.69
26+
32768 80.31
27+
65536 211.02
28+
131072 210.16
29+
262144 271.94
30+
524288 367.93
31+
1048576 591.96
32+
2097152 975.44
33+
4194304 1859.91
34+
START OF JOBSPEC
35+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}, {"type": "gpu", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["singularity", "exec", "--nv", "/opt/containers/metric-osu-gpu_google-gpu.sif", "/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency", "-d", "cuda", "H", "H"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/containers", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": 0, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-006,flux-007"]}}, "user": {"study_id": "osu-2-iter-10"}}, "version": 1}
36+
START OF EVENTLOG
37+
{"timestamp":1742925709.3119636,"name":"init"}
38+
{"timestamp":1742925709.3130028,"name":"starting"}
39+
{"timestamp":1742925709.3332787,"name":"shell.init","context":{"service":"501043911-shell-f3yqHEv7h","leader-rank":5,"size":2}}
40+
{"timestamp":1742925709.3373182,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
41+
{"timestamp":1742925730.459857,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":2816,"wait_status":0,"signaled":0,"exitcode":0}}
42+
{"timestamp":1742925730.4663754,"name":"complete","context":{"status":0}}
43+
{"timestamp":1742925730.4664104,"name":"done"}
44+
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
Warning: OMB could not identify the local rank of the process.
2+
This can lead to multiple processes using the same GPU.
3+
Please use the get_local_rank script in the OMB repo for this.
4+
Warning: OMB could not identify the local rank of the process.
5+
This can lead to multiple processes using the same GPU.
6+
Please use the get_local_rank script in the OMB repo for this.
7+
# OSU MPI-CUDA Bandwidth Test v5.8
8+
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
9+
# Size Bandwidth (MB/s)
10+
1 0.17
11+
2 0.34
12+
4 0.79
13+
8 1.05
14+
16 3.45
15+
32 8.12
16+
64 21.08
17+
128 43.15
18+
256 77.26
19+
512 151.05
20+
1024 288.65
21+
2048 524.34
22+
4096 858.74
23+
8192 1397.65
24+
16384 1808.67
25+
32768 2147.22
26+
65536 2343.34
27+
131072 2199.78
28+
262144 2087.49
29+
524288 2018.49
30+
1048576 1963.70
31+
2097152 1974.59
32+
4194304 1962.15
33+
START OF JOBSPEC
34+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}, {"type": "gpu", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["singularity", "exec", "--nv", "/opt/containers/metric-osu-gpu_google-gpu.sif", "/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_bw", "-d", "cuda", "H", "H"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/containers", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": 0, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-006,flux-007"]}}, "user": {"study_id": "osu-2-iter-10"}}, "version": 1}
35+
START OF EVENTLOG
36+
{"timestamp":1742925730.7448459,"name":"init"}
37+
{"timestamp":1742925730.7459881,"name":"starting"}
38+
{"timestamp":1742925730.7653837,"name":"shell.init","context":{"service":"501043911-shell-f49H8dX6P","leader-rank":5,"size":2}}
39+
{"timestamp":1742925730.7693317,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
40+
{"timestamp":1742925739.0274324,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":2856,"wait_status":0,"signaled":0,"exitcode":0}}
41+
{"timestamp":1742925739.032778,"name":"complete","context":{"status":0}}
42+
{"timestamp":1742925739.0328097,"name":"done"}
43+

0 commit comments

Comments
 (0)