Skip to content

Commit bda94e6

Browse files
authored
Merge pull request #85 from converged-computing/azure-osu-reruns
osu re-runs - not a success
2 parents c4a69e2 + 84fad35 commit bda94e6

File tree

4 files changed

+130
-7
lines changed

4 files changed

+130
-7
lines changed

experiments/azure/aks/cpu/size256/README.md

Lines changed: 46 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,25 @@ az aks create \
2929
az aks get-credentials --resource-group aks-gpu-testing-west --name aks-gpu-testing-cluster
3030
```
3131

32+
And for the redo run in March 2025 for OSU:
33+
34+
```bash
35+
az aks create \
36+
--resource-group flux-usernetes \
37+
--name performance-study-256 \
38+
--ppg /subscriptions/3e173a37-8f81-492f-a234-ca727b72e6f8/resourceGroups/flux-usernetes/providers/Microsoft.Compute/proximityPlacementGroups/aks-placement-test \
39+
--network-plugin azure \
40+
--node-count 99 \
41+
--node-vm-size standard_hb120-96rs_v3 \
42+
--location southcentralus \
43+
--enable-node-public-ip \
44+
--vm-set-type VirtualMachineScaleSets \
45+
--load-balancer-sku standard \
46+
--generate-ssh-keys
47+
```
48+
49+
You'll need to manually scale up to 256 for the VMSet.
50+
3251
### 1. Setup
3352

3453
Note that I needed to create this entirely in the UI, and you can't do it automatically. We are required to have at least one node in the agent pool. For testing I used one static, and for production I allowed autoscaling 1-3, not knowing what might be needed. Once your deployment is ready and you can use the Connect -> cloud shell to connect, register the feature for AKSInfinibandSupport:
@@ -54,6 +73,12 @@ Run this final step:
5473
az provider register --namespace Microsoft.ContainerService
5574
```
5675

76+
This was an extra command needed the second time to get the credentials:
77+
78+
```bash
79+
az aks get-credentials --name performance-study-256 --resource-group flux-usernetes
80+
```
81+
5782
Note that if you shell in now and install `ibverbs-utils` and do `ibv_devices` it will be empty.
5883
If you are doing this in the cloud shell, you'll next want to copy the entirety of the `~/.kube/config` to your local machine to access the cluster. Let's try to install infiniband next, and we will use a container that is also built with ubuntu 22.04 drivers.
5984

@@ -103,7 +128,7 @@ Now we are ready for different MiniCluster setups. For each of the below, to she
103128
```bash
104129
kubectl exec -it flux-sample-0-xxx bash
105130
```
106-
Next, choose a cluster size in one of the experiment folders.
131+
107132
Monitoring:
108133

109134
```bash
@@ -114,12 +139,6 @@ kubectl create namespace monitoring
114139
kubectl apply -f deploy
115140
```
116141

117-
Install the Flux Operator:
118-
119-
```bash
120-
kubectl apply -f ./flux-operator.yaml
121-
```
122-
123142
Now we are ready for different MiniCluster setups. For each of the below, to shell in to the lead broker (index 0) you do:
124143

125144
```bash
@@ -184,6 +203,8 @@ kubectl delete -f crd/single-node.yaml
184203

185204
#### OSU
186205

206+
Note that this second time, the container pull took over 10 minutes.
207+
187208
```bash
188209
kubectl logs -n monitoring event-exporter-6bf9c87d4d-v4rtr -f |& tee ./events-osu-$(date +%s).json
189210
kubectl apply -f ./crd/osu.yaml
@@ -236,13 +257,27 @@ export app=osu
236257
output=./results/$app
237258
mkdir -p $output
238259

260+
chmod +x flux-run-combinations.sh
239261
./flux-run-combinations.sh 256 $app
240262

241263
for i in $(seq 1 5); do
242264
echo "Running iteration $i"
243265
time flux run --setattr=user.study_id=$app-256-iter-$i -N256 -n 24576 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
244266
done
245267

268+
# Just successful ones
269+
for jobid in $(flux jobs --filter=completed --json | jq -r .jobs[].id)
270+
do
271+
# Get the job study id
272+
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
273+
echo "Parsing jobid ${jobid} and study id ${study_id}"
274+
flux job attach $jobid &> $output/${study_id}-${jobid}.out
275+
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
276+
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
277+
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
278+
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
279+
done
280+
246281
# When they are done:
247282
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
248283
do
@@ -257,6 +292,7 @@ for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
257292
done
258293

259294
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:aks-infiniband-cpu-256-$app $output
295+
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:aks-infiniband-cpu-256-$app-rerun $output
260296
```
261297
```bash
262298
kubectl delete -f ./crd/osu.yaml
@@ -578,6 +614,9 @@ oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:a
578614
```bash
579615
kubectl delete -f ./crd/quicksilver.yaml
580616
```
617+
618+
The second attempt was not successful. The creation was different (I had to manually ask for credentials) and all of the osu tests segfaulted except for one. I ran the script 4x to see if more successful results occurred and then gave up, as it's an expensive cluster.
619+
581620
### Clean Up
582621

583622
When you are done, delete the cluster from the web interface.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# OSU MPI Latency Test v5.8
2+
# Size Latency (us)
3+
0 1.59
4+
1 1.58
5+
2 1.58
6+
4 1.58
7+
8 1.59
8+
16 1.59
9+
32 1.74
10+
64 1.79
11+
128 1.85
12+
256 2.35
13+
512 2.47
14+
1024 2.58
15+
2048 2.75
16+
4096 3.48
17+
8192 4.09
18+
16384 5.10
19+
32768 6.59
20+
65536 9.01
21+
131072 13.22
22+
262144 17.26
23+
524288 27.81
24+
1048576 49.79
25+
2097152 92.36
26+
4194304 177.80
27+
START OF JOBSPEC
28+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/azhpc-images/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": -1, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-sample-0,flux-sample-29"]}}, "user": {"study_id": "osu-2-iter-1"}}, "version": 1}
29+
START OF EVENTLOG
30+
{"timestamp":1743294664.2062294,"name":"init"}
31+
{"timestamp":1743294664.2068343,"name":"starting"}
32+
{"timestamp":1743294664.2358925,"name":"shell.init","context":{"service":"0-shell-fJz6hgzj","leader-rank":0,"size":2}}
33+
{"timestamp":1743294664.2392285,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
34+
{"timestamp":1743294666.3936222,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":366,"wait_status":0,"signaled":0,"exitcode":0}}
35+
{"timestamp":1743294666.4000165,"name":"complete","context":{"status":0}}
36+
{"timestamp":1743294666.4000454,"name":"done"}
37+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# OSU MPI Bandwidth Test v5.8
2+
# Size Bandwidth (MB/s)
3+
1 3.85
4+
2 7.87
5+
4 15.51
6+
8 31.56
7+
16 62.61
8+
32 121.06
9+
64 243.40
10+
128 488.32
11+
256 925.33
12+
512 1724.95
13+
1024 3260.50
14+
2048 5980.09
15+
4096 9182.91
16+
8192 13711.41
17+
16384 18938.37
18+
32768 21164.41
19+
65536 22852.02
20+
131072 23628.27
21+
262144 24008.90
22+
524288 24114.70
23+
1048576 24228.32
24+
2097152 24397.00
25+
4194304 24486.00
26+
START OF JOBSPEC
27+
{"resources": [{"type": "node", "count": 2, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_bw"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/azhpc-images/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": -1, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task"}}, "constraints": {"hostlist": ["flux-sample-0,flux-sample-29"]}}, "user": {"study_id": "osu-2-iter-1"}}, "version": 1}
28+
START OF EVENTLOG
29+
{"timestamp":1743294666.520262,"name":"init"}
30+
{"timestamp":1743294666.5208848,"name":"starting"}
31+
{"timestamp":1743294666.5504811,"name":"shell.init","context":{"service":"0-shell-fL1JH1fM","leader-rank":0,"size":2}}
32+
{"timestamp":1743294666.5542791,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,2,1,1]]}}}
33+
{"timestamp":1743294667.665215,"name":"shell.task-exit","context":{"localid":0,"rank":0,"state":"Exited","pid":372,"wait_status":0,"signaled":0,"exitcode":0}}
34+
{"timestamp":1743294667.6716905,"name":"complete","context":{"status":0}}
35+
{"timestamp":1743294667.6717153,"name":"done"}
36+

paper/osu-benchmarks/1-run-analysis.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,17 @@ def parse_data(indir, outdir, files):
237237
if exp.size == 2:
238238
continue
239239

240+
# We reran google cloud GPU size 16 with fixed results for latency and bw.
241+
if (
242+
exp.cloud == "google"
243+
and exp.env_type == "gpu"
244+
and exp.size == 16
245+
and "osu-allreduce" not in filename
246+
and "osu-fixed" not in filename
247+
):
248+
print(f"Skipping old result {filename}")
249+
continue
250+
240251
# Cyclecloud GPU had multiple types, osu-dd and osu-hh
241252
# For Google Cloud GPU on GKE the point to point benchmarks
242253
# were run without GPU - never worked

0 commit comments

Comments
 (0)