Skip to content

Commit 83d2f1f

Browse files
authored
Test azure (#2)
* test: packer build I am not able to interact with Azure from any machine, so I need to use their cloud shell. :/ * azure: add build instructions * tf: testing terraform setup for vmset * test upgrading to linux vm scale set * final tweaks to azure (still needs final test) Signed-off-by: vsoch <[email protected]>
1 parent 0bac4b3 commit 83d2f1f

File tree

10 files changed

+888
-0
lines changed

10 files changed

+888
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
- [flux-in-slurm](tutorial/flux-in-slurm): Bring up a Flux instance (in user-space) in a Slurm Allocation - both in Kubernetes ([video](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ))
88
- [Flux on AWS](tutorial/aws): Deploy an entire Flux Framework cluster to "bare metal" instances on AWS with (essentially) two `make` commands - one to build with packer, and one to deploy with Terraform ([video](https://youtu.be/LJh-ab6fAqE?si=dIzScA530N7lXs_7))
9+
- [Flux on Azure](tutorial/azure): Deploy Flux Framework on Azure with Infiniband
910
- [HPCIC Tutorial 2024](https://youtu.be/Dt4CSZWSEJE?si=b2O7lQrJixcKh-EJ)
1011

1112
## What is this?

tutorial/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.terraform.lock.hcl
2+
.env
3+
.terraform

tutorial/azure/Makefile

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
.PHONY: all
2+
all: init fmt validate apply
3+
4+
.PHONY: init
5+
init:
6+
terraform init
7+
8+
.PHONY: fmt
9+
fmt:
10+
terraform fmt
11+
12+
.PHONY: validate
13+
validate:
14+
terraform validate
15+
16+
.PHONY: apply
17+
apply:
18+
terraform apply
19+
20+
.PHONY: apply-approved
21+
apply-approved:
22+
terraform apply --auto-approve
23+
24+
.PHONY: destroy
25+
destroy:
26+
terraform destroy

tutorial/azure/README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Flux on Azure
2+
3+
## Usage
4+
5+
### 1. Build Images
6+
7+
Note that you should [build](build) the images first. Follow the instructions in the README there.
8+
9+
### 2. Deploy Terraform
10+
11+
Check the [start-script.sh](start-script.sh) and variables at the top of [main.tf](main.tf). You'll need to export the image full identifier to the environment:
12+
13+
```bash
14+
export TF_VAR_vm_image_storage_reference=/subscriptions/xxxxxxx/resourceGroups/xxxxx/providers/Microsoft.Compute/images/flux-framework
15+
```
16+
17+
Note that I needed to clone this and do from the cloud shell in the Azure portal.
18+
19+
```bash
20+
git clone https://github.com/converged-computing/flux-tutorials
21+
cd flux-tutorials/tutorial/azure
22+
```
23+
24+
and then:
25+
26+
```bash
27+
make
28+
```
29+
30+
The shell can be buggy - if it seems like it's hanging, it's that terraform is waiting for you to enter "yes." You can type it (despite not seeing it) and press enter and it works every time... 50% of the time. :) I added a command to the Makefile to get around this:
31+
32+
```bash
33+
make apply-approved
34+
```
35+
36+
You can also run each command separately:
37+
38+
```bash
39+
# Terraform init
40+
make init
41+
42+
# Terraform validate
43+
make validate
44+
45+
# Create
46+
make apply
47+
48+
# Destroy
49+
make destroy
50+
```
51+
52+
When it's done, save the public and private key to local files:
53+
54+
```bash
55+
terraform output -json public_key | jq -r > id_azure.pub
56+
terraform output -json private_key | jq -r > id_azure
57+
chmod 600 id_azure*
58+
```
59+
60+
Then get the instance ip addresses from the command line (or portal), and ssh in!
61+
62+
```bash
63+
ip_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[0].ipAddress)
64+
ssh -i ./id_azure azureuser@${ip_address}
65+
```
66+
67+
To get a difference instance, just use the index (e.g., index 1 is the second instance)
68+
69+
```bash
70+
follower_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[1].ipAddress)
71+
ssh -i ./id_azure azureuser@${follower_address}
72+
```
73+
74+
### 3. Checks
75+
76+
Check the cluster status, the overlay status, and try running a job:
77+
78+
```bash
79+
$ flux resource list
80+
```
81+
```bash
82+
$ flux run -N 2 hostname
83+
```
84+
85+
### 4. Cleanup
86+
87+
This should work (but see [debugging](#debugging)).
88+
89+
```bash
90+
make destroy
91+
```
92+
93+
But if not, you can either delete the resource group from the console, or the command line:
94+
95+
```bash
96+
az group delete --name terraform-testing
97+
```
98+
99+
Note that this current build does not have flux-pmix, which might lead to issues with MPI. It's an issue of the VM base being compiled with a libpmix.so that has a different ABI than what flux is expecting. I will be looking into it.
100+
101+
### Debugging
102+
103+
Depending on your environment, terraform (e.g., `make` or `make destroy` doesn't always work. I get this error from the Azure Cloud Shell:
104+
105+
```console
106+
terraform destroy
107+
random_pet.id: Refreshing state... [id=usable-grouper]
108+
random_string.fqdn: Refreshing state... [id=lhppiw]
109+
110+
│ Error: building account: could not acquire access token to parse claims: running Azure CLI: exit status 1: ERROR: Failed to connect to MSI. Please make sure MSI is configured correctly.
111+
│ Get Token request returned: <Response [400]>
112+
113+
│ with provider["registry.terraform.io/hashicorp/azurerm"],
114+
│ on main.tf line 28, in provider "azurerm":
115+
│ 28: provider "azurerm" {
116+
117+
118+
make: *** [Makefile:22: destroy] Error 1
119+
```
120+
121+
If I open a new cloud shell, it seems to magically go away. But you can also interact with the `az` tool (that does seem to to work) or issue commands via clicking directly in the portal.

tutorial/azure/build/Makefile

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
.PHONY: all
2+
all: init fmt validate build
3+
4+
.PHONY: init
5+
init:
6+
packer init .
7+
8+
.PHONY: fmt
9+
fmt:
10+
packer fmt .
11+
12+
.PHONY: validate
13+
validate:
14+
packer validate .
15+
16+
.PHONY: build
17+
build:
18+
packer build flux-build.pkr.hcl

tutorial/azure/build/README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Build Packer Images
2+
3+
Note that I needed to do this build from a cloud shell, so clone and then:
4+
5+
```bash
6+
git clone https://github.com/converged-computing/flux-tutorials
7+
flux-tutorials/tutorial/azure/build
8+
```
9+
10+
And install packer
11+
12+
```bash
13+
wget https://releases.hashicorp.com/packer/1.11.2/packer_1.11.2_linux_amd64.zip
14+
unzip packer_1.11.2_linux_amd64.zip
15+
mkdir -p ./bin
16+
mv ./packer ./bin/
17+
export PATH=$(pwd)/bin:$PATH
18+
```
19+
20+
Get your account information for azure as follows:
21+
22+
```bash
23+
az account show
24+
```
25+
26+
And export variables in the following format. Note that the resource group needs to actually exist - I created mine in the console UI.
27+
28+
```bash
29+
export AZURE_SUBSCRIPTION_ID=xxxxxxxxx
30+
export AZURE_TENANT_ID=xxxxxxxxxxx
31+
export AZURE_RESOURCE_GROUP_NAME=packer-testing
32+
```
33+
34+
Then build!
35+
36+
```bash
37+
make
38+
```

0 commit comments

Comments
 (0)