Skip to content

Commit 3ec2c31

Browse files
authored
bug: vmset is not reliable with hostnames (#3)
* bug: vmset is not reliable with hostnames We cannot be guaranteed that the lead broker is flux00000 so we need to provide automation to update/fix the issue. I am also adding a script for OSU to install the benchmarks Signed-off-by: vsoch <[email protected]>
1 parent 83d2f1f commit 3ec2c31

File tree

5 files changed

+152
-3
lines changed

5 files changed

+152
-3
lines changed

tutorial/azure/README.md

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,15 +71,81 @@ follower_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux
7171
ssh -i ./id_azure azureuser@${follower_address}
7272
```
7373

74+
Note that if the lead broker doesn't come up as flux_0 (flux with all zeros, Azure is not predicable like that) we will need to update.
75+
76+
```bash
77+
lead_broker=$(az vmss list-instances -g terraform-testing -n flux | jq -r .[0].osProfile.computerName)
78+
echo "The lead broker is ${lead_broker}"
79+
```
80+
81+
Here is how you can fix all your brokers:
82+
83+
```bash
84+
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
85+
do
86+
echo "Updating $address"
87+
scp -i ./id_azure update_brokers.sh azureuser@${address}:/tmp/update_brokers.sh
88+
ssh -i ./id_azure azureuser@$address "/bin/bash /tmp/update_brokers.sh flux $lead_broker"
89+
done
90+
```
91+
92+
Note that I've also provided a script to install the OSU benchmarks with the same strategy above:
93+
94+
```bash
95+
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
96+
do
97+
echo "Updating $address"
98+
scp -i ./id_azure install_osu.sh azureuser@${address}:/tmp/install_osu.sh
99+
ssh -i ./id_azure azureuser@$address "/bin/bash /tmp/install_osu.sh"
100+
done
101+
```
102+
103+
This installs to /usr/local/libexec/osu-benchmarks/mpi.
104+
74105
### 3. Checks
75106

76107
Check the cluster status, the overlay status, and try running a job:
77108

78109
```bash
79-
$ flux resource list
110+
flux resource list
80111
```
81112
```bash
82-
$ flux run -N 2 hostname
113+
flux run -N 2 hostname
114+
```
115+
116+
Try running a benchmark!
117+
118+
```bash
119+
flux run -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce
120+
flux run -N2 -n2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
121+
```
122+
```console
123+
# OSU MPI Latency Test v5.8
124+
# Size Latency (us)
125+
0 1.57
126+
1 1.56
127+
2 1.56
128+
4 1.56
129+
8 1.57
130+
16 1.57
131+
32 1.70
132+
64 1.76
133+
128 1.80
134+
256 2.31
135+
512 2.36
136+
1024 2.52
137+
2048 2.70
138+
4096 3.46
139+
8192 3.96
140+
16384 5.24
141+
32768 6.85
142+
65536 9.18
143+
131072 14.20
144+
262144 17.30
145+
524288 27.94
146+
1048576 50.00
147+
2097152 92.04
148+
4194304 177.34
83149
```
84150

85151
### 4. Cleanup

tutorial/azure/install_osu.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
3+
cd /tmp
4+
OSU_VERSION=5.8
5+
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-$OSU_VERSION.tgz
6+
tar zxvf ./osu-micro-benchmarks-5.8.tgz
7+
cd osu-micro-benchmarks-5.8/
8+
./configure CC=mpicc CXX=mpicxx
9+
make -j 4 && sudo make install
10+
11+
# installs to /usr/local/libexec/osu-benchmarks/mpi

tutorial/azure/main.tf

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,6 @@ locals {
7171
tags = {
7272
flux_core = "0-68-0"
7373
}
74-
application_port = 8081
7574
}
7675

7776
resource "random_pet" "id" {}
@@ -167,6 +166,10 @@ resource "azurerm_linux_virtual_machine_scale_set" "vmss" {
167166
caching = "ReadWrite"
168167
}
169168

169+
identity {
170+
type = "SystemAssigned"
171+
}
172+
170173
admin_ssh_key {
171174
username = local.admin_user
172175
public_key = azapi_resource_action.ssh_public_key_gen.output.publicKey

tutorial/azure/start-script.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
#!/bin/bash
22

3+
# In case the user wants to play with this.
4+
sudo pip install azure-cli
5+
36
# Assume a huge number. This will error with Azure because they
47
# eventually dive into alpha numeric, but this works for a small demo
58
NODELIST=${template_name}000[000-999]
9+
10+
# The lead broker can be anything, azure is not predictable
611
lead_broker=${template_name}000000
712

813
flux R encode --hosts=$NODELIST --local > R

tutorial/azure/update_brokers.sh

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#!/bin/bash
2+
3+
template_name=${1}
4+
lead_broker=${2}
5+
template_username=${3:-azureuser}
6+
template_ethernet_device=${4:-eth0}
7+
start_number=$(echo "${lead_broker/flux/""}")
8+
start_number=$(echo "${start_number/000/""}")
9+
10+
# Assume a huge number. This will error with Azure because they
11+
# eventually dive into alpha numeric, but this works for a small demo
12+
NODELIST=${template_name}000[$start_number-999]
13+
14+
# Write updated resource file
15+
flux R encode --hosts=$NODELIST --local > R
16+
sudo mv R /etc/flux/system/R
17+
sudo chown ${template_username} /etc/flux/system/R
18+
19+
# Write updated broker.toml
20+
cat <<EOF | tee /tmp/broker.toml
21+
# Flux needs to know the path to the IMP executable
22+
[exec]
23+
imp = "/usr/libexec/flux/flux-imp"
24+
25+
# Allow users other than the instance owner (guests) to connect to Flux
26+
# Optionally, root may be given "owner privileges" for convenience
27+
[access]
28+
allow-guest-user = true
29+
allow-root-owner = true
30+
31+
# Point to resource definition generated with flux-R(1).
32+
# Uncomment to exclude nodes (e.g. mgmt, login), from eligibility to run jobs.
33+
[resource]
34+
path = "/etc/flux/system/R"
35+
36+
# Point to shared network certificate generated flux-keygen(1).
37+
# Define the network endpoints for Flux's tree based overlay network
38+
# and inform Flux of the hostnames that will start flux-broker(1).
39+
[bootstrap]
40+
curve_cert = "/etc/flux/system/curve.cert"
41+
42+
default_port = 8050
43+
default_bind = "tcp://${template_ethernet_device}:%p"
44+
default_connect = "tcp://%h:%p"
45+
46+
# Rank 0 is the TBON parent of all brokers unless explicitly set with
47+
# parent directives.
48+
# The actual ip addresses (for both) need to be added to /etc/hosts
49+
# of each VM for now.
50+
hosts = [
51+
{ host = "$NODELIST" },
52+
]
53+
# Speed up detection of crashed network peers (system default is around 20m)
54+
[tbon]
55+
tcp_user_timeout = "2m"
56+
EOF
57+
58+
sudo mv /tmp/broker.toml /etc/flux/system/conf.d/broker.toml
59+
60+
# See the README.md for commands how to set this manually without systemd
61+
sudo systemctl daemon-reload
62+
sudo systemctl restart flux.service
63+
sleep 2
64+
sudo systemctl status flux.service

0 commit comments

Comments
 (0)