Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions common/configuration/puppet.yaml.tftpl
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ users:
%{ endfor ~}

runcmd:
- "flag_cloud_init_failed () { echo \"$${BASH_COMMAND}\" >> /run/cloud-init-failed; }"
- trap 'flag_cloud_init_failed' ERR
- chmod 755 /etc # avoid issue with Rocky 9.4
- test ! -d /${sudoer_username} && userdel -f -r ${sudoer_username} && cloud-init clean -r
%{ if cloud_provider != "incus" }
Expand All @@ -47,12 +49,13 @@ runcmd:
- |
if ! test -f /etc/magic-castle-release; then
# Enable fastest mirror for distribution using dnf package manager
dnf -y install dnf-plugins-core
dnf config-manager --setopt=fastestmirror=True --save
# If the image has not openssh-server installed but sshd_config still exists
# installing the new RPM will not overwrite the file and depending on the file
# content it might catastrophic (some sshd_config are empty, some miss esential lines).
# Therefore when openssh-server is not installed, we remove sshd_config before installing it.
"[ -z $(rpm -qa openssh-server) ] && rm -f /etc/ssh/sshd_config"
[ -z "$(rpm -qa openssh-server)" ] && rm -f /etc/ssh/sshd_config
dnf -y install openssh openssh-server rsync
echo -e "Include /etc/ssh/sshd_config.d/50-authenticationmethods.conf" >> /etc/ssh/sshd_config
sed -i '/HostKey \/etc\/ssh\/ssh_host_ecdsa_key/ s/^#*/#/' /etc/ssh/sshd_config
Expand All @@ -73,9 +76,11 @@ runcmd:
dnf -y install openvox-agent-8.23.1
install -m 700 /dev/null /opt/puppetlabs/bin/postrun
# kernel configuration
%{ if cloud_provider != "incus" ~}
systemctl disable kdump
grubby --update-kernel=ALL --args="rd.driver.blacklist=nouveau nouveau.modeset=0 crashkernel=0M"
grub2-mkconfig -o /boot/grub2/grub.cfg
%{ endif ~}
fi
%{ if contains(tags, "puppet") }
# Install puppetserver
Expand Down Expand Up @@ -156,6 +161,7 @@ runcmd:
# If the current image has already been configured with Magic Castle Puppet environment,
# we can start puppet and skip reboot, reducing the delay for bringing the node up.
- test -f /etc/magic-castle-release && systemctl start puppet || true
- test -f /run/cloud-init-failed && echo 'WARNING - some steps cloud-init runcmd failed, listed in /run/cloud-init-failed. Manual fixing and rebooting required. ' | tee /etc/motd || true

write_files:
# If the ip addresses of the puppet servers are not known in advance, we cannot restrict the ssh connection to them.
Expand All @@ -172,7 +178,8 @@ write_files:
facts : {
blocklist : [
"EC2", "az_metadata", "cloud.provider", "hypervisors"
%{ if cloud_provider != "gcp" },"GCE",%{ endif }
%{ if cloud_provider != "gcp" },"GCE"%{ endif }
%{ if cloud_provider == "incus" },"kmods"%{ endif }
],
}
path: /etc/puppetlabs/facter/facter.conf
Expand Down Expand Up @@ -255,7 +262,7 @@ output: { all: "| tee -a /var/log/cloud-init-output.log" }
power_state:
delay: now
mode: reboot
condition: test ! -f /etc/magic-castle-release
condition: test ! -f /etc/magic-castle-release && test ! -f /run/cloud-init-failed

# Configure owner of /var/log/cloud-init.log
syslog_fix_perms: root:systemd-journal
4 changes: 4 additions & 0 deletions common/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ variable "config_git_url" {
variable "config_version" {
type = string
description = "Tag, branch, or commit that specifies which Puppet configuration revision is to be used"
validation {
condition = length(var.config_version) >= 1
error_message = "The config_version variable cannot be an empty string. It must match a commit hash, a tag or a branch."
}
}

variable "hieradata" {
Expand Down
76 changes: 52 additions & 24 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -335,10 +335,14 @@ destroy the cluster or change it manually on the Puppet server.

Since Magic Cluster configuration is managed with git, it is possible to specify
which version of the configuration you wish to use. Typically, it will match the
version number of the release you have downloaded (i.e: `9.3`).
version number of the release you have downloaded (i.e: `15.1.0`).

**Requirement**: Must refer to a git commit, tag or branch existing
in the git repository pointed by `config_git_url`.
in the git repository pointed by `config_git_url`. It cannot be an empty string.

**Warning**: The validity of the string as a git reference is not verified. In the
event it is invalid, Magic Castle defaults to using the latest release tag available
and logs a warning in the puppet server message of the day (`/etc/motd`).

**Post build modification effect**: none. To change the Puppet configuration version,
destroy the cluster or change it manually on the Puppet server.
Expand Down Expand Up @@ -617,7 +621,7 @@ available models per region

##### Incus

- `target`: name of the [specific cluster member](https://linuxcontainers.org/incus/docs/main/howto/cluster_manage_instance/#launch-an-instance-on-a-specific-cluster-member) to deploy the instance. **Only use with Incus cluster.**
- `target`: name of the [specific cluster member](https://linuxcontainers.org/incus/docs/main/howto/cluster_manage_instance/#launch-an-instance-on-a-specific-cluster-member) to deploy the instance. **Only use with Incus cluster.**

#### 4.7.3 Post build modification effect

Expand Down Expand Up @@ -1383,35 +1387,59 @@ for more information.

## 8. Deployment

To create the resources defined by your main, enter the following command
```
To create the resources defined in your Terraform configuration, run:

```bash
terraform apply
```

The command will produce the same output as the `plan` command, but after
the output it will ask for a confirmation to perform the proposed actions.
Enter `yes`.
This command will first display the execution plan (equivalent to `terraform plan`) and then prompt you to confirm the proposed actions. Type `yes` to proceed.

Terraform will then create the infrastructure resources defined in the configuration. This step typically takes a few minutes. Once completed, Terraform will output:

- Guest account usernames and passwords
- The sudo-enabled username
- The floating IP address of the login node

### Important: Cluster Readiness

Although Terraform reports completion once the connection information is displayed,
**the cluster is not immediately ready for use**.

Instance creation is only the first phase of the cluster build. A second, automated configuration phase follows, during which Magic Castle installs and configures core services such as:
user accounts, FreeIPA, Slurm, JupyterHub, etc.

Terraform will then proceed to create the resources defined by the
configuration file. It should take a few minutes. Once the creation process
is completed, Terraform will output the guest account usernames and password,
the sudoer username and the floating ip of the login
node.
This configuration phase typically takes **approximately 15 minutes** after the instances are created.

**Warning**: although the instance creation process is finished once Terraform
outputs the connection information, you will not be able to
connect and use the cluster immediately. The instance creation is only the
first phase of the cluster-building process. The configuration: the
creation of the user accounts, installation of FreeIPA, Slurm, configuration
of JupyterHub, etc.; takes around 15 minutes after the instances are created.
### Instance Configuration Process

Each instance goes through a two-stage configuration process:

1. **cloud-init**
- Upgrades operating system packages
- Installs Puppet
2. **Puppet**
- Installs and configures software based on the instance role, as defined by instance tags (e.g. `node`)

#### Logs and Troubleshooting

Logs for each stage are available at:

1. **cloud-init**: `/var/log/cloud-init-output.log`
2. **Puppet**: `journalctl -u puppet`

If an error occurs during the first (cloud-init) stage, a warning is displayed in the instance
message of the day (e.g.: `/etc/motd`). The failed commands are recorded in:

```
/run/cloud-init-failed
```

Once it is booted, you can follow an instance configuration process by looking at:
Because successful completion of the first stage is required for the second stage to proceed, the configuration process halts if cloud-init fails.

* `/var/log/cloud-init-output.log`
* `journalctl -u puppet`
You may resume the configuration by manually re-running the failed commands listed in `/run/cloud-init-failed` once the underlying issue has been resolved.

If unexpected problems occur during configuration, you can provide these
logs to the authors of Magic Castle to help you debug.
Failures during the first stage are rare and are most often caused by external dependencies, such as temporary unavailability of GitHub or package repositories.

### 8.1 Deployment Customization

Expand Down
Loading