You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I brought up the cluster, verified that infiniband installed
successfully, and then tested osu. There were consistent segfaults
across the board, and only one set of nodes that worked.
Signed-off-by: vsoch <[email protected]>
You'll need to manually scale up to 256 for the VMSet.
50
+
32
51
### 1. Setup
33
52
34
53
Note that I needed to create this entirely in the UI, and you can't do it automatically. We are required to have at least one node in the agent pool. For testing I used one static, and for production I allowed autoscaling 1-3, not knowing what might be needed. Once your deployment is ready and you can use the Connect -> cloud shell to connect, register the feature for AKSInfinibandSupport:
@@ -54,6 +73,12 @@ Run this final step:
54
73
az provider register --namespace Microsoft.ContainerService
55
74
```
56
75
76
+
This was an extra command needed the second time to get the credentials:
77
+
78
+
```bash
79
+
az aks get-credentials --name performance-study-256 --resource-group flux-usernetes
80
+
```
81
+
57
82
Note that if you shell in now and install `ibverbs-utils` and do `ibv_devices` it will be empty.
58
83
If you are doing this in the cloud shell, you'll next want to copy the entirety of the `~/.kube/config` to your local machine to access the cluster. Let's try to install infiniband next, and we will use a container that is also built with ubuntu 22.04 drivers.
59
84
@@ -103,7 +128,7 @@ Now we are ready for different MiniCluster setups. For each of the below, to she
103
128
```bash
104
129
kubectl exec -it flux-sample-0-xxx bash
105
130
```
106
-
Next, choose a cluster size in one of the experiment folders.
Note that this second time, the container pull took over 10 minutes.
207
+
187
208
```bash
188
209
kubectl logs -n monitoring event-exporter-6bf9c87d4d-v4rtr -f |& tee ./events-osu-$(date +%s).json
189
210
kubectl apply -f ./crd/osu.yaml
@@ -236,13 +257,27 @@ export app=osu
236
257
output=./results/$app
237
258
mkdir -p $output
238
259
260
+
chmod +x flux-run-combinations.sh
239
261
./flux-run-combinations.sh 256 $app
240
262
241
263
for i in $(seq 1 5); do
242
264
echo "Running iteration $i"
243
265
time flux run --setattr=user.study_id=$app-256-iter-$i -N256 -n 24576 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
244
266
done
245
267
268
+
# Just successful ones
269
+
for jobid in $(flux jobs --filter=completed --json | jq -r .jobs[].id)
270
+
do
271
+
# Get the job study id
272
+
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
273
+
echo "Parsing jobid ${jobid} and study id ${study_id}"
The second attempt was not successful. The creation was different (I had to manually ask for credentials) and all of the osu tests segfaulted except for one. I ran the script 4x to see if more successful results occurred and then gave up, as it's an expensive cluster.
619
+
581
620
### Clean Up
582
621
583
622
When you are done, delete the cluster from the web interface.
0 commit comments