Skip to content

Commit 8986d3d

Browse files
committed
update isolated docs
1 parent 966033e commit 8986d3d

File tree

2 files changed

+85
-214
lines changed

2 files changed

+85
-214
lines changed

ansible/extras.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@
3232
hosts: eessi
3333
tags: eessi
3434
become: true
35-
environment: "{{ appliances_remote_environment_vars }}"
3635
gather_facts: false
3736
tasks:
3837
- name: Install / configure EESSI

docs/experimental/isolated-clusters.md

Lines changed: 85 additions & 213 deletions
Original file line numberDiff line numberDiff line change
@@ -1,216 +1,89 @@
11
# Isolated Clusters
22

3-
By default, the appliance assumes and requires that there is outbound internet
4-
access, possibly via a [proxy](../../ansible/roles/proxy/) . However it is
5-
possible to create clusters in more restrictive environments, with some
6-
limitations on functionality.
7-
8-
## No outbound internet
9-
10-
A cluster can be deployed using the upstream image (or one derived from it) without any outbound internet at all.
11-
12-
At present, this supports all roles/groups enabled:
13-
- Directly in the `common` environment
14-
- In the `environments/$ENV/inventory/groups` file created by cookiecutter for
15-
a new environment (from the "everything template").
16-
17-
plus some additional roles/groups not enabled by default listed below.
18-
19-
Note that the `hpl` test from the `ansible/adhoc/hpctests.yml` playbook is not
20-
functional and must be skipped using:
21-
22-
```shell
23-
ansible-playbook ansible/adhoc/hpctests.yml --skip-tags hpl-solo
24-
```
25-
26-
The full list of supported roles/groups is below, with those marked "*"
27-
enabled by default in the common environment or "everything template":
28-
- alertmanager *
29-
- ansible_init *
30-
- basic_users *
31-
- cacerts
32-
- chrony
33-
- eessi *
34-
- etc_hosts *
35-
- filebeat *
36-
- grafana *
37-
- mysql *
38-
- nfs *
39-
- node_exporter *
40-
- openhpc *
41-
- opensearch *
42-
- podman *
43-
- prometheus *
44-
- proxy
45-
- rebuild
46-
- selinux **
47-
- slurm_exporter *
48-
- slurm_stats *
49-
- systemd **
50-
- tuned
51-
- fail2ban *
52-
- firewalld *
53-
- hpctests *
54-
- openondemand *
55-
- persist_hostkeys *
56-
- compute_init
57-
- nhc *
58-
- openondemand_desktop *
59-
60-
Note that for this to work, all dnf repositories are disabled at the end of
61-
image builds, so that `ansible.builtin.dnf` tasks work when running against
62-
packages already installed in the image.
63-
64-
## Outbound internet via proxy not available to cluster users
65-
If additional functionality is required it is possible configure Ansible to use
66-
an authenticated http/https proxy (e.g. [squid](https://www.squid-cache.org/)).
67-
The proxy credentials are not written to the cluster nodes so the proxy cannot
68-
be used by cluster users.
69-
70-
To do this the proxy variables required in the remote environment must be
71-
defined for the Ansible variable `appliances_remote_environment_vars`. Note
72-
some default proxy variables are provided in `environments/common/inventory/group_vars/all/proxy.yml` so generally it will be sufficient set the proxy user, password and address and to add these to the remote environment:
73-
74-
```yaml
75-
# environments/site/inventory/group_vars/all/proxy.yml:
76-
proxy_basic_user: my_squid_user
77-
proxy_basic_password: "{{ vault_proxy_basic_password }}"
78-
proxy_http_address: squid.mysite.org
79-
80-
# environments/site/inventory/group_vars/all/vault_proxy.yml:
81-
# NB: ansible vault-encrypt this file
82-
vault_proxy_basic_password: super-secret-password
83-
84-
# environments/site/inventory/group_vars/all/default.yml:
85-
appliances_remote_environment_vars:
86-
http_proxy: "{{ proxy_http_proxy }}"
87-
https_proxy: "{{ proxy_http_proxy }}"
88-
```
89-
90-
TODO: Do we need to set `no_proxy`??
91-
92-
This uses Ansible's [remote environment support](https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_environment.html). Currrently this is suported for the following roles/groups:
93-
- eessi: TODO: is this right though??
94-
- manila
95-
96-
97-
Although EESSI will install with the above configuration, as there is no
98-
outbound internet access except for Ansible tasks, making it functional will
99-
require [configuring a proxy for CVMFS](https://multixscale.github.io/cvmfs-tutorial-hpc-best-practices/access/proxy/#client-system-configuration).
100-
101-
102-
103-
## Deploying Squid using the appliance
104-
If an external squid is not available, one can be deployed by the cluster on a
105-
dual-homed host. See [docs/networks.md#proxies](../networks.md#proxies) for
106-
guidance, but note a separate host should be used rather than a Slurm node, to
107-
avoid users on that node getting direct access.
108-
109-
If the deploy host is RockyLinux, this could be used as the squid host by adding
110-
it to inventory:
111-
112-
```ini
113-
# environments/$ENV/inventory/squid
114-
[squid]
115-
# configure squid on deploy host
116-
localhost ansible_host=10.20.0.121 ansible_connection=local
117-
```
118-
119-
The IP address should be the deploy hosts's IP on the cluster network and is used
120-
later to define the proxy address. Other connection variables (e.g. `ansible_user`)
121-
could be set if required.
122-
123-
## Using Squid with basic authentication
124-
125-
First create usernames/passwords on the squid host (tested on RockyLinux 8.9):
126-
127-
```shell
128-
SQUID_USER=rocky
129-
dnf install -y httpd-tools
130-
htpasswd -c /etc/squid/passwords $SQUID_USER # enter pasword at prompt
131-
sudo chown squid /etc/squid/passwords
132-
sudo chmod u=rw,go= /etc/squid/passwords
133-
```
134-
135-
This can be tested by running:
136-
```
137-
/usr/lib64/squid/basic_ncsa_auth /etc/squid/passwords
138-
```
139-
140-
and entering `$SQUID_USER PASSWORD`, which should respond `OK`.
141-
142-
If using the appliance to deploy squid, override the default `squid`
143-
configuration to use basic auth:
144-
145-
```yaml
146-
# environments/$ENV/inventory/group_vars/all/squid.yml:
147-
squid_acls:
148-
- acl ncsa_users proxy_auth REQUIRED
149-
squid_auth_param: |
150-
auth_param basic program /usr/lib64/squid/basic_ncsa_auth /etc/squid/passwords
151-
auth_param basic children 5
152-
auth_param basic credentialsttl 1 minute
153-
```
154-
155-
See the [squid docs](https://wiki.squid-cache.org/ConfigExamples/Authenticate/Ncsa) for more information.
156-
157-
## Proxy Configuration
158-
159-
Configure the appliance to configure proxying on all cluster nodes:
160-
161-
```ini
162-
# environments/.stackhpc/inventory/groups:
163-
...
164-
[proxy:children]
165-
cluster
166-
...
167-
```
168-
169-
Now configure the appliance to set proxy variables via remote environment
170-
rather than by writing it to the host, and provide the basic authentication
171-
credentials:
172-
173-
```yaml
174-
#environments/$ENV/inventory/group_vars/all/proxy.yml:
175-
proxy_basic_user: $SQUID_USER
176-
proxy_basic_password: "{{ vault_proxy_basic_password }}"
177-
proxy_plays_only: true
178-
```
179-
180-
```yaml
181-
#environments/$ENV/inventory/group_vars/all/vault_proxy.yml:
182-
vault_proxy_basic_password: $SECRET
183-
```
184-
This latter file should be vault-encrypted.
185-
186-
If using an appliance-deployed squid then the other [proxy role variables](../../ansible/roles/proxy/README.md)
187-
will be automatically constructed (see environments/common/inventory/group_vars/all/proxy.yml).
188-
You may need to override `proxy_http_address` if the hostname of the squid node
189-
is not resolvable by the cluster. This is typically the case if squid is deployed
190-
to the deploy host, in which case the IP address may be specified instead using
191-
the above example inventory as:
192-
193-
```
194-
proxy_http_address: "{{ hostvars[groups['squid'] | first].ansible_host }}"
195-
```
196-
197-
If using an external squid, at a minimum set `proxy_http_address`. You may
198-
also need to set `proxy_http_port` or any other [proxy role's variables](../../ansible/roles/proxy/README.md)
199-
if the calculated parameters are not appropriate.
200-
201-
## Image build
202-
203-
TODO: describe proxy setup for that
204-
205-
## EESSI
206-
3+
Full functionality of the appliance requires that there is outbound internet
4+
access from all nodes, possibly via a [proxy](../../ansible/roles/proxy/).
5+
6+
However many features (as defined by Ansible inventory groups/roles) will work
7+
if the cluster network(s) provide no outbound access. Currently this includes
8+
all "default" features, i.e. roles/groups which are enabled either in the
9+
`common` environment or in the `environments/$ENV/inventory/groups` file
10+
created by cookiecutter for a new environment.
11+
12+
The full list of features and whether they are functional on such an "isolated" network is shown in the table below. Note that:
13+
14+
1. The `hpl` test from the `ansible/adhoc/hpctests.yml` playbook is not
15+
functional and must be skipped using:
16+
17+
```shell
18+
ansible-playbook ansible/adhoc/hpctests.yml --skip-tags hpl-solo
19+
```
20+
21+
2. Using [EESSI](https://www.eessi.io/docs/) necessarily requires outbound
22+
network access for the CernVM File System. However this can be provided
23+
via an authenticated proxy. While the proxy configuration on the cluster node
24+
is readable by all users, this proxy can provide access only to EESSI's
25+
CVMFS Stratum 1 servers.
26+
27+
## Support by feature for isolated networks
28+
29+
See above for definition of "Default" features. In the "Isolated?" column:
30+
- "Y": Feature works without outbound internet access.
31+
- "N": Known not to work.
32+
- "?": Not investigated at present.
33+
34+
| Inventory group/role | Default? | Isolated? |
35+
| ----------------------| -------- | --------- |
36+
| alertmanager | Y | Y |
37+
| ansible_init | Y | Y |
38+
| basic_users | Y | Y |
39+
| block_devices | Y | No (depreciated) |
40+
| cacerts | - | Y |
41+
| chrony | - | Y |
42+
| compute_init | - | Y |
43+
| cuda | - | ? |
44+
| eessi | Y | Y - see above |
45+
| etc_hosts | Y | Y |
46+
| extra_packages | - | No |
47+
| fail2ban | Y | Y |
48+
| filebeat | Y | Y |
49+
| firewalld | Y | Y |
50+
| gateway | n/a | n/a - build only |
51+
| grafana | Y | Y |
52+
| hpctests | Y | Y - except hpl-solo, see above |
53+
| k3s_agent | - | ? |
54+
| k3s_server | - | ? |
55+
| k9s | - | ? |
56+
| lustre | - | ? |
57+
| manila | Y | Y |
58+
| mysql | Y | Y |
59+
| nfs | Y | Y |
60+
| nhc | Y | Y |
61+
| node_exporter | Y | Y |
62+
| openhpc | Y | Y |
63+
| openondemand | Y | Y |
64+
| openondemand_desktop | Y | Y |
65+
| openondemand_jupyter | Y | Y |
66+
| opensearch | Y | Y |
67+
| podman | Y | Y |
68+
| persist_hostkeys | Y | Y |
69+
| prometheus | Y | Y |
70+
| proxy | - | Y |
71+
| resolv_conf | - | ? |
72+
| slurm_exporter | Y | Y |
73+
| slurm_stats | Y | Y |
74+
| squid | - | ? |
75+
| sshd | - | ? |
76+
| sssd | - | ? |
77+
| systemd | Y | Y |
78+
| tuned | - | Y |
79+
| update | - | No |
20780
20881
## Network considerations
20982
210-
Note that even when outbound internet access is not required, the following
211-
(shown as OpenStack security groups/rules as displayed by Horizon) outbound access from nodes is still required to enable deployment
83+
Even when outbound internet access is not required, nodes do require some outbound access, as well as connectivity inbound from the deploy host and
84+
inbound connectivity for users. This section documents the minimal connectivity required, in the form of the minimally-permissive security group rules. Often default security groups are less restrictive than these.
21285
213-
Assuming nodes have a security group `isolated` applied:
86+
Assuming nodes and the deploy host have a security group `isolated` applied then the following rules are required:
21487
21588
# allow outbound DNS
21689
ALLOW IPv4 53/tcp to 0.0.0.0/0
@@ -223,19 +96,18 @@ Assuming nodes have a security group `isolated` applied:
22396
# allow hosts to reach metadata server (e.g. for cloud-init keys):
22497
ALLOW IPv4 80/tcp to 169.254.169.254/32
22598
226-
# allow hosts to reach squid proxy:
99+
# optionally: allow hosts to reach squid proxy for EESSI:
227100
ALLOW IPv4 3128/tcp to <squid cidr>
228101
229-
Note that DNS is required (and is configured by OpenStack if the subnet has
230-
a gateway) because name resolution happens on the hosts, not on the proxy.
102+
Note that name resolution happens on the hosts, not on the proxy, hence DNS is required for nodes even with a proxy.
231103
232-
For nodes running OpenOndemand, inbound ssh and https are also required:
104+
For nodes running OpenOndemand, inbound ssh and https are also required
105+
(e.g. in a security group called `isolated-ssh-https`):
233106
234107
ALLOW IPv4 443/tcp from 0.0.0.0/0
235108
ALLOW IPv4 22/tcp from 0.0.0.0/0
236109
237-
Note the OpenTofu variables `login_security_groups` and
238-
`nonlogin_security_groups` can be used to set security groups if requried:
110+
If non-default security groups are required, then the OpenTofu variables `login_security_groups` and `nonlogin_security_groups` can be used to set these, e.g.:
239111
240112
```terraform
241113
# environments/site/tofu/cluster.auto.tfvars:

0 commit comments

Comments
 (0)