Skip to content

Commit 52932a1

Browse files
authored
azure: first attempt to build base container (#5)
* azure: first attempt to build base container This is entirely automated. We use the assets from their azurevm hpc repository, and add flux. Importantly, the environment for hpcx needs to be sourced, and likely we will need to do that in different environments (kubernetes/usernetes/singularity) Signed-off-by: vsoch <[email protected]>
1 parent ab74fad commit 52932a1

File tree

92 files changed

+4918
-21
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+4918
-21
lines changed

tutorial/azure/README.md

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -107,29 +107,26 @@ done
107107
pssh -h hosts.txt -x "-i ./id_azure" "/bin/bash /tmp/update_brokers.sh flux $lead_broker"
108108
```
109109

110-
Note that I've also provided a script to install the OSU benchmarks with the same strategy above:
110+
Note that I've also provided scripts to install the OSU benchmarks and lammps with the same strategy above:
111111

112112
```bash
113-
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
114-
do
115-
echo "Updating $address"
116-
scp -i ./id_azure install_osu.sh azureuser@${address}:/tmp/install_osu.sh
117-
done
118-
pssh -h hosts.txt -x "-i ./id_azure" "/bin/bash /tmp/install_osu.sh flux $lead_broker"
113+
# Choose the script you want to install
114+
script=install_osu.sh
115+
script=install_lammps.sh
119116
```
120117

121-
This installs to `/usr/local/libexec/osu-benchmarks/mpi`. And lammps:
118+
And then install!
122119

123-
```bash
120+
```console
124121
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
125122
do
126123
echo "Updating $address"
127-
scp -i ./id_azure install_lammps.sh azureuser@${address}:/tmp/install_lammps.sh
124+
scp -i ./id_azure ./install/${script} azureuser@${address}:/tmp/${script}
128125
done
129-
pssh -h hosts.txt -x "-i ./id_azure" "/bin/bash /tmp/install_lammps.sh flux $lead_broker"
126+
pssh -h hosts.txt -x "-i ./id_azure" "/bin/bash /tmp/${script}"
130127
```
131-
That installs to `/usr/bin/lmp`
132128

129+
This installs to `/usr/local/libexec/osu-micro-benchmarks/mpi`. And lammps installs to `/usr/bin/lmp`
133130

134131
### 3. Checks
135132

@@ -142,7 +139,7 @@ flux resource list
142139
flux run -N 2 hostname
143140
```
144141

145-
### 4. Benchmarks
142+
### 4. Applications and Benchmarks
146143

147144
Try running a benchmark!
148145

@@ -278,6 +275,10 @@ Total wall time: 0:00:37
278275

279276
</details>
280277

278+
#### Usernetes
279+
280+
See [flux-usernetes](https://github.com/converged-computing/flux-usernetes/tree/main/azure) for build and deploy instructions for deployment of user space kubernetes.
281+
281282
### 4. Cleanup
282283

283284
This should work (but see [debugging](#debugging)).
@@ -840,6 +841,10 @@ This is free software; see the source for copying conditions. There is NO
840841
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
841842
```
842843

844+
### Docker
845+
846+
For advanced users, we have a [docker](docker) directory with builds that emulate the base set of VMs that are intended to be used with them. It would be good if Microsoft wanted to provide more production bases for us :)
847+
843848
### Debugging
844849

845850
Depending on your environment, terraform (e.g., `make` or `make destroy` doesn't always work. I get this error from the Azure Cloud Shell:

tutorial/azure/docker/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Azure Docker Builds
2+
3+
We hope that Microsoft can eventually provide container bases, but until then we need to make a best effort to do that. This attempts to mirror the logic and match versions for their Azure HPC images builds.
4+
5+
## Base
6+
7+
The base image has core dependencies like hpcx and flux.
8+
9+
```bash
10+
cd ./base
11+
docker build -t ghcr.io/converged-computing/flux-tutorials:azurehpc-2204 .
12+
docker push ghcr.io/converged-computing/flux-tutorials:azurehpc-2204
13+
```
14+
15+
## OSU
16+
17+
```bash
18+
cd ./osu
19+
docker build -t ghcr.io/converged-computing/flux-tutorials:azurehpc-2204-osu .
20+
docker push ghcr.io/converged-computing/metric-osu-cpu:azure-hpc-osu
21+
```
22+
23+
## LAMMPS
24+
25+
```bash
26+
cd ./lammps-reax
27+
docker build -t ghcr.io/converged-computing/flux-tutorials:azurehpc-2204-lammps-reax .
28+
docker push ghcr.io/converged-computing/metric-osu-cpu:azure-hpc-lammps-reax
29+
```
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
FROM ubuntu:22.04
2+
3+
# docker build -t ghcr.io/converged-computing/flux-tutorials:azurehpc-2204 .
4+
# docker push ghcr.io/converged-computing/flux-tutorials:azurehpc-2204
5+
6+
WORKDIR /opt
7+
RUN apt-get update && apt-get install -y munge git curl wget unzip gpg debian-archive-keyring \
8+
pkg-config vim ubuntu-keyring systemctl && apt-get clean
9+
RUN export VERSION="1.2.2" && \
10+
curl -LO "https://github.com/oras-project/oras/releases/download/v${VERSION}/oras_${VERSION}_linux_amd64.tar.gz" && \
11+
mkdir -p oras-install/ && \
12+
tar -zxf oras_${VERSION}_*.tar.gz -C oras-install/ && \
13+
mv oras-install/oras /usr/local/bin/ && \
14+
rm -rf oras_${VERSION}_*.tar.gz oras-install/
15+
16+
# Azure hpc-images deps added here - not clear if all of these are needed
17+
RUN apt-get update && apt-get install -y numactl rpm libnuma-dev libmpc-dev libmpfr-dev libxml2-dev m4 byacc \
18+
libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-genl-3-dev libnl-genl-3-200 libnl-route-3-200 bison \
19+
libsecret-1-0 dkms libyaml-dev libreadline-dev libkeyutils1 libkeyutils-dev libmount-dev nfs-common pssh \
20+
libvulkan1 hwloc selinux-policy-dev nvme-cli && apt-get clean # vulkan is for nvidia gpu driver
21+
ENV DEBIAN_FRONTEND=noninteractive
22+
23+
# OSU Benchmarks in hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/ompi/tests/
24+
RUN oras pull ghcr.io/converged-computing/rdma-infiniband:ubuntu-22.04-tgz --output /opt && \
25+
cd /opt && \
26+
tar -xzvf MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64.tgz && \
27+
cd MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64/DEBS/ && \
28+
dpkg -i mpitests_3.2.23-45a045b.2404066_amd64.deb && \
29+
dpkg -i libibverbs1* && \
30+
dpkg -i ibverbs-providers* && \
31+
dpkg -i libibverbs* && \
32+
dpkg -i librdmacm* && \
33+
dpkg -i ucx_1.17.0-1.2404066_amd64.deb && \
34+
dpkg -i libibumad3* && \
35+
dpkg -i sharp_3.7.0.MLNX20240421.48444036-1.2404066_amd64.deb && \
36+
dpkg -i hcoll_4.8.3227-1.2404066_amd64.deb
37+
38+
# This was extracted into separate lines, below, to avoid one large layer (and debug each)
39+
# RUN ./install.sh
40+
ENV GPU=NVIDIA
41+
42+
# Install only what we need as we go (so change to single file doesn't require complete rebuild)
43+
WORKDIR /opt/azhpc-images/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc
44+
COPY ./azhpc-images/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc/install_prerequisites.sh ./install_prerequisites.sh
45+
46+
# install pre-requisites
47+
RUN ./install_prerequisites.sh
48+
COPY ./azhpc-images/versions.json /opt/azhpc-images/versions.json
49+
50+
COPY ./azhpc-images/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc/set_properties.sh ./
51+
COPY ./azhpc-images/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc/install_utils.sh ./
52+
COPY ./azhpc-images/ubuntu/common/remove_unused_packages.sh /opt/azhpc-images/ubuntu/common/remove_unused_packages.sh
53+
COPY ./azhpc-images/ubuntu/common/install_utils.sh /opt/azhpc-images/ubuntu/common/install_utils.sh
54+
COPY ./azhpc-images/ubuntu/common/install_pmix.sh /opt/azhpc-images/ubuntu/common/install_pmix.sh
55+
COPY ./azhpc-images/ubuntu/common/install_pmix.sh /opt/azhpc-images/ubuntu/common/install_mpis.sh
56+
COPY ./azhpc-images/common/ /opt/azhpc-images/common/
57+
COPY ./azhpc-images/tools/ /opt/azhpc-images/tools/
58+
59+
# remove packages requiring Ubuntu Pro for security updates
60+
RUN . ./set_properties.sh && \
61+
/bin/bash $UBUNTU_COMMON_DIR/remove_unused_packages.sh && \
62+
./install_utils.sh
63+
64+
COPY ./azhpc-images/ubuntu/common/install_docker.sh /opt/azhpc-images/ubuntu/common/install_docker.sh
65+
COPY ./azhpc-images/ubuntu/common/* /opt/azhpc-images/ubuntu/common/
66+
RUN . ./set_properties.sh && \
67+
/bin/bash $UBUNTU_COMMON_DIR/install_docker.sh
68+
69+
# install diagnostic script, optimizations
70+
RUN . ./set_properties.sh && \
71+
/bin/bash $COMMON_DIR/install_hpcdiag.sh && \
72+
/bin/bash $COMMON_DIR/install_azure_persistent_rdma_naming.sh
73+
74+
RUN . ./set_properties.sh && \
75+
/bin/bash $UBUNTU_COMMON_DIR/hpc-tuning.sh
76+
77+
COPY ./azhpc-images/tests/ /opt/azhpc-images/tests
78+
COPY ./azhpc-images/customizations/ /opt/azhpc-images/customizations
79+
COPY ./azhpc-images/topology/ /opt/azhpc-images/topology
80+
81+
RUN . ./set_properties.sh && \
82+
/bin/bash $COMMON_DIR/copy_test_file.sh && \
83+
/bin/bash $COMMON_DIR/install_monitoring_tools.sh && \
84+
/bin/bash $COMMON_DIR/install_amd_libs.sh
85+
86+
RUN . ./set_properties.sh && \
87+
/bin/bash $COMMON_DIR/setup_sku_customizations.sh
88+
89+
RUN . ./set_properties.sh && \
90+
/bin/bash $UBUNTU_COMMON_DIR/install_pmix.sh
91+
92+
# For some reason this command, when moved higher up, was flaky.
93+
# Watch it and make sure it doesn't skip (if it does the build will fail later)
94+
RUN . ./set_properties.sh && \
95+
/bin/bash $UBUNTU_COMMON_DIR/install_mpis.sh
96+
97+
# This would match the VM exactly (you'd ned to change the source script, etc).
98+
# RUN mv /opt/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64 /opt/hpcx-v2.15-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/
99+
100+
# cleanup downloaded tarballs - clear some space
101+
RUN rm -rf *.tgz *.bz2 *.tbz *.tar.gz *.run *.deb *_offline.sh && \
102+
rm -rf /tmp/MLNX_OFED_LINUX* /tmp/*conf* && \
103+
rm -rf /var/intel/ /var/cache/* && \
104+
rm -Rf -- */
105+
106+
# INFO: Building OMPI with HCOLL
107+
# Ready to rebuild
108+
# HPCX_ROOT: /opt/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64
109+
# OMPI PREFIX: /opt/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hpcx-rebuild
110+
# UCX location: /opt/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/ucx
111+
# UCC location: /opt/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/ucc
112+
# HCOLL location: /opt/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll
113+
114+
ENV LANG=C.UTF-8
115+
RUN apt-get update && apt-get install -y locales && locale-gen en_US.UTF-8 && apt-get clean
116+
117+
# Add an azureuser
118+
ARG USER=azureuser
119+
ARG UID=1000
120+
ARG GID=1000
121+
RUN set -x && groupadd -g $UID $USER && \
122+
useradd -g $USER -u $UID -d /home/$USER -m $USER && \
123+
printf "$USER ALL= NOPASSWD: ALL\\n" >> /etc/sudoers
124+
125+
# flux security
126+
WORKDIR /opt/flux
127+
128+
RUN apt-get update && \
129+
apt-get install -y man flex ssh sudo vim luarocks munge lcov ccache lua5.4 \
130+
valgrind build-essential pkg-config autotools-dev libtool \
131+
libffi-dev autoconf automake make clang clang-tidy \
132+
gcc g++ libpam-dev apt-utils lua-posix \
133+
libsodium-dev libzmq3-dev libczmq-dev libjansson-dev libmunge-dev \
134+
libncursesw5-dev liblua5.4-dev liblz4-dev libsqlite3-dev uuid-dev \
135+
libhwloc-dev libs3-dev libevent-dev libarchive-dev \
136+
libboost-graph-dev libboost-system-dev libboost-filesystem-dev \
137+
libboost-regex-dev libyaml-cpp-dev libedit-dev uidmap dbus-user-session python3-cffi && apt-get clean
138+
139+
COPY ./azhpc-images/source-hpcx.sh /source-hpcx.sh
140+
RUN . /source-hpcx.sh && hpcx_load && \
141+
wget https://github.com/flux-framework/flux-security/releases/download/v0.13.0/flux-security-0.13.0.tar.gz && \
142+
tar -xzvf flux-security-0.13.0.tar.gz && \
143+
mv flux-security-0.13.0 /opt/flux/flux-security && \
144+
cd /opt/flux/flux-security && \
145+
./configure --prefix=/usr --sysconfdir=/etc && \
146+
make -j && make install
147+
148+
# The VMs will share the same munge key
149+
RUN mkdir -p /var/run/munge && \
150+
dd if=/dev/urandom bs=1 count=1024 > munge.key && \
151+
mv munge.key /etc/munge/munge.key && \
152+
chown -R munge /etc/munge/munge.key /var/run/munge && \
153+
chmod 600 /etc/munge/munge.key
154+
155+
# Make the flux run directory
156+
RUN mkdir -p /home/azureuser/run/flux && chown azureuser /home/azureuser
157+
RUN python3 -m pip install jsonschema --upgrade
158+
159+
# Flux core
160+
RUN . /source-hpcx.sh && hpcx_load && \
161+
wget https://github.com/flux-framework/flux-core/releases/download/v0.68.0/flux-core-0.68.0.tar.gz && \
162+
tar -xzvf flux-core-0.68.0.tar.gz && \
163+
mv flux-core-0.68.0 /opt/flux/flux-core && \
164+
cd /opt/flux/flux-core && \
165+
./configure --prefix=/usr --sysconfdir=/etc --with-flux-security && \
166+
make clean && \
167+
make -j && make install
168+
169+
# Flux sched (later than this requires newer gcc and clang)
170+
RUN . /source-hpcx.sh && hpcx_load && \
171+
wget https://github.com/flux-framework/flux-sched/releases/download/v0.37.0/flux-sched-0.37.0.tar.gz && \
172+
tar -xzvf flux-sched-0.37.0.tar.gz && \
173+
mv flux-sched-0.37.0 /opt/flux/flux-sched && \
174+
cd /opt/flux/flux-sched && \
175+
mkdir build && \
176+
cd build && \
177+
cmake ../ && make -j && make install && ldconfig && \
178+
echo "DONE flux build"
179+
180+
# Flux curve.cert
181+
# Ensure we have a shared curve certificate
182+
RUN flux keygen /tmp/curve.cert && \
183+
mkdir -p /etc/flux/system && \
184+
cp /tmp/curve.cert /etc/flux/system/curve.cert && \
185+
chown azureuser /etc/flux/system/curve.cert && \
186+
chmod o-r /etc/flux/system/curve.cert && \
187+
chmod g-r /etc/flux/system/curve.cert && \
188+
# Permissions for imp
189+
chmod u+s /usr/libexec/flux/flux-imp && \
190+
chmod 4755 /usr/libexec/flux/flux-imp && \
191+
# /var/lib/flux needs to be owned by the instance owner
192+
mkdir -p /var/lib/flux && \
193+
chown azureuser -R /var/lib/flux && \
194+
# clean up (and make space)
195+
cd /opt && \
196+
rm -rf /opt/flux
197+
198+
# Ensure we source the environment.
199+
RUN echo ". /opt/hpcx-v2.15-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/hpcx-mt-init.sh" >> /root/.bashrc && \
200+
echo ". /opt/hpcx-v2.15-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/hpcx-mt-init.sh" >> /home/azureuser/.bashrc && \
201+
echo "hpcx_load" >> /root/.bashrc && \
202+
echo "hpcx_load" >> /home/azureuser/.bashrc && \
203+
echo "FLUX_URI DEFAULT=local:///opt/run/flux/local" >> ./environment && \
204+
mv ./environment /etc/security/pam_env.conf
205+
WORKDIR /opt
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
FROM ghcr.io/converged-computing/flux-tutorials:azurehpc-2204
2+
3+
# Note that this isn't currently built or used, as ucx is provided in the base
4+
# However, we are anticipating a bug we had earlier with needing to build ucx with
5+
# a different flag, and preserving this example for that.
6+
7+
# Let's compile UCX without GPU checking.
8+
# This is the one we wound up using for our experiments
9+
10+
# docker build -f Dockerfile.ucx -t ghcr.io/converged-computing/azurehpc:flux-slim-nogpu .
11+
# docker push ghcr.io/converged-computing/azurehpc:flux-slim-nogpu
12+
13+
# Get build flags with ucx_info -b
14+
RUN wget https://github.com/openucx/ucx/releases/download/v1.15.0/ucx-1.15.0.tar.gz && \
15+
tar -xzvf ucx-1.15.0.tar.gz && \
16+
cd ucx-1.15.0 && \
17+
./configure --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --prefix=/usr --enable-examples --without-java --without-go --without-xpmem --without-cuda && \
18+
make -j4 && make install && ldconfig
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.* eol=lf
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Microsoft Corporation.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
trigger: none
2+
3+
pr:
4+
- master
5+
6+
pool:
7+
name: 1ES-hosted-pool-scrub1
8+
9+
jobs:
10+
- job: queue_azdo
11+
timeoutInMinutes: 360
12+
steps:
13+
- bash: |
14+
echo $(System.PullRequest.PullRequestNumber)
15+
displayName: Print PR Num
16+
17+
- task: Bash@3
18+
inputs:
19+
targetType: 'filePath'
20+
filePath: './azure-pipelines/queue_ado.sh'
21+
failOnStderr: true
22+
env:
23+
SYSTEM_ACCESSTOKEN: $(System.AccessToken)
24+
PR_NUM: $(System.PullRequest.PullRequestNumber)
25+
displayName: Queue Validation Build and Monitor Status

0 commit comments

Comments
 (0)