Skip to content

Commit e55b6bd

Browse files
MM-60255: Address once and for all the remote-exec error (#973)
* Migrate instance provisioning to cloud-init Replace Terraform `file` + `remote-exec` provisioner blocks across all instance types (app, metrics, proxy, agent, job, keycloak, etc.) with a single `user_data.sh.tpl` template that embeds the provisioner scripts inline via cloud-init. This eliminates the need for SSH connection blocks during `terraform apply` and makes provisioning more reliable. * Set USER and HOME in cloud-init user_data template Cloud-init runs user_data as root with a minimal environment, so $HOME and $USER are unset or point to root. Export them as the AMI user so provisioner scripts (nvm, otel-collector, keycloak) install files to the correct locations. Also replaces `$(whoami)` calls in RHEL scripts with `${USER}` for consistency. * Upload license via SSH instead of file provisioner Replace the Terraform `file` provisioner blocks for the license file on app and job servers with an explicit SSH upload in `setupAppServer`. * Remove connection blocks from Terraform instances These `connection` blocks were used by `remote-exec` and `file` provisioners which have been replaced by cloud-init user_data. They are now unused and can be safely removed. * Remove connection blocks from Terraform instances These `connection` blocks were used by `remote-exec` and `file` provisioners which have been replaced by cloud-init user_data. They are now unused and can be safely removed. * Use templated user for otelcol-contrib service The sed commands that configure the otelcol-contrib systemd service were hardcoding User=ubuntu and Group=ubuntu. Use ${USER} instead so the service runs as the same runtime user as the rest of the provisioner script. * Write failure sentinel when provisioner.sh exits non-zero On failure, write the exit code to provisioning-exitcode and touch provisioning-failed. The waitForProvisioning poller now checks for the failure sentinel each iteration and aborts immediately instead of waiting for the full timeout. * Add per-attempt dial timeout to SSH retry loop NewClientWithRetry now computes remaining time each iteration and passes min(backoff, remaining) as the dial timeout via a new newClientWithTimeout helper that sets ssh.ClientConfig.Timeout, preventing a single ssh.Dial from blocking past the deadline. * Capture provisioner exit code before branching Run /tmp/provisioner.sh with || to capture its exit code into a variable before the conditional, rather than relying on $? inside the else branch which is fragile if any command is inserted between the condition and the echo. * Use ${USER} in Debian provisioner sed commands The agent.sh and rhel/common.sh provisioners already use ${USER} for the otelcol-contrib service unit. Apply the same pattern to proxy.sh, job.sh, and app.sh so the service runs as the configured AMI user. * Fix double-budget issue in provisioning timeout Track elapsed time during SSH connection so the polling loop uses only the remaining budget instead of starting a fresh full timeout.
1 parent 7cdbb5f commit e55b6bd

File tree

17 files changed

+276
-259
lines changed

17 files changed

+276
-259
lines changed

deployment/terraform/assets/cluster.tf

Lines changed: 42 additions & 183 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,6 @@ resource "aws_instance" "app_server" {
5151
Name = "${var.cluster_name}-app-${count.index}"
5252
}
5353

54-
connection {
55-
type = "ssh"
56-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
57-
user = var.aws_ami_user
58-
}
59-
6054
ami = var.aws_ami
6155
instance_type = var.app_instance_type
6256
key_name = aws_key_pair.key.id
@@ -75,29 +69,12 @@ resource "aws_instance" "app_server" {
7569
volume_type = var.block_device_type
7670
}
7771

78-
provisioner "file" {
79-
source = var.mattermost_license_file
80-
destination = "/home/${var.aws_ami_user}/mattermost.mattermost-license"
81-
}
82-
83-
provisioner "file" {
84-
source = "provisioners/${var.operating_system_kind}/common.sh"
85-
destination = "/tmp/common.sh"
86-
}
87-
88-
provisioner "file" {
89-
source = "provisioners/${var.operating_system_kind}/app.sh"
90-
destination = "/tmp/provisioner.sh"
91-
}
92-
93-
provisioner "remote-exec" {
94-
inline = [
95-
"cd /tmp",
96-
"chmod +x /tmp/common.sh",
97-
"chmod +x /tmp/provisioner.sh",
98-
"/tmp/provisioner.sh",
99-
]
100-
}
72+
user_data_replace_on_change = true
73+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
74+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
75+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/app.sh")
76+
ami_user = var.aws_ami_user
77+
})
10178
}
10279
data "aws_vpc" "selected" {
10380
tags = {
@@ -204,12 +181,6 @@ resource "aws_instance" "metrics_server" {
204181
Name = "${var.cluster_name}-metrics"
205182
}
206183

207-
connection {
208-
type = "ssh"
209-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
210-
user = var.aws_ami_user
211-
}
212-
213184
ami = var.aws_ami
214185
instance_type = var.metrics_instance_type
215186
count = var.enable_metrics_instance ? 1 : 0
@@ -228,25 +199,12 @@ resource "aws_instance" "metrics_server" {
228199
volume_type = var.block_device_type
229200
}
230201

231-
provisioner "file" {
232-
source = "provisioners/${var.operating_system_kind}/common.sh"
233-
destination = "/tmp/common.sh"
234-
}
235-
236-
provisioner "file" {
237-
source = "provisioners/${var.operating_system_kind}/metrics.sh"
238-
destination = "/tmp/provisioner.sh"
239-
}
240-
241-
provisioner "remote-exec" {
242-
inline = [
243-
"cd /tmp",
244-
"chmod +x /tmp/common.sh",
245-
"chmod +x /tmp/provisioner.sh",
246-
"/tmp/provisioner.sh",
247-
]
248-
}
249-
202+
user_data_replace_on_change = true
203+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
204+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
205+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/metrics.sh")
206+
ami_user = var.aws_ami_user
207+
})
250208
}
251209

252210
resource "aws_instance" "proxy_server" {
@@ -271,31 +229,12 @@ resource "aws_instance" "proxy_server" {
271229
volume_type = var.block_device_type
272230
}
273231

274-
connection {
275-
type = "ssh"
276-
user = var.aws_ami_user
277-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
278-
}
279-
280-
provisioner "file" {
281-
source = "provisioners/${var.operating_system_kind}/common.sh"
282-
destination = "/tmp/common.sh"
283-
}
284-
285-
provisioner "file" {
286-
source = "provisioners/${var.operating_system_kind}/proxy.sh"
287-
destination = "/tmp/provisioner.sh"
288-
}
289-
290-
provisioner "remote-exec" {
291-
inline = [
292-
"cd /tmp",
293-
"chmod +x /tmp/common.sh",
294-
"chmod +x /tmp/provisioner.sh",
295-
"/tmp/provisioner.sh",
296-
]
297-
}
298-
232+
user_data_replace_on_change = true
233+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
234+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
235+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/proxy.sh")
236+
ami_user = var.aws_ami_user
237+
})
299238
}
300239

301240
resource "aws_iam_user" "s3user" {
@@ -448,12 +387,6 @@ resource "aws_instance" "loadtest_agent" {
448387
Name = "${var.cluster_name}-agent-${count.index}"
449388
}
450389

451-
connection {
452-
type = "ssh"
453-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
454-
user = var.aws_ami_user
455-
}
456-
457390
ami = var.aws_ami
458391
instance_type = var.agent_instance_type
459392
key_name = aws_key_pair.key.id
@@ -470,38 +403,19 @@ resource "aws_instance" "loadtest_agent" {
470403
volume_type = var.block_device_type
471404
}
472405

473-
provisioner "file" {
474-
source = "provisioners/${var.operating_system_kind}/common.sh"
475-
destination = "/tmp/common.sh"
476-
}
477-
478-
provisioner "file" {
479-
source = "provisioners/${var.operating_system_kind}/agent.sh"
480-
destination = "/tmp/provisioner.sh"
481-
}
482-
483-
provisioner "remote-exec" {
484-
inline = [
485-
"cd /tmp",
486-
"chmod +x /tmp/common.sh",
487-
"chmod +x /tmp/provisioner.sh",
488-
"/tmp/provisioner.sh",
489-
]
490-
}
491-
406+
user_data_replace_on_change = true
407+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
408+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
409+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/agent.sh")
410+
ami_user = var.aws_ami_user
411+
})
492412
}
493413

494414
resource "aws_instance" "loadtest_browser_agent" {
495415
tags = {
496416
Name = "${var.cluster_name}-browser-agent-${count.index}"
497417
}
498418

499-
connection {
500-
type = "ssh"
501-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
502-
user = var.aws_ami_user
503-
}
504-
505419
ami = var.aws_ami
506420
instance_type = var.browser_agent_instance_type
507421
key_name = aws_key_pair.key.id
@@ -518,25 +432,12 @@ resource "aws_instance" "loadtest_browser_agent" {
518432
volume_type = var.block_device_type
519433
}
520434

521-
provisioner "file" {
522-
source = "provisioners/${var.operating_system_kind}/common.sh"
523-
destination = "/tmp/common.sh"
524-
}
525-
526-
provisioner "file" {
527-
source = "provisioners/${var.operating_system_kind}/agent.sh"
528-
destination = "/tmp/provisioner.sh"
529-
}
530-
531-
provisioner "remote-exec" {
532-
inline = [
533-
"cd /tmp",
534-
"chmod +x /tmp/common.sh",
535-
"chmod +x /tmp/provisioner.sh",
536-
"/tmp/provisioner.sh",
537-
]
538-
}
539-
435+
user_data_replace_on_change = true
436+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
437+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
438+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/agent.sh")
439+
ami_user = var.aws_ami_user
440+
})
540441
}
541442

542443
resource "aws_security_group" "app" {
@@ -936,12 +837,6 @@ resource "aws_instance" "job_server" {
936837
Name = "${var.cluster_name}-job-server-${count.index}"
937838
}
938839

939-
connection {
940-
type = "ssh"
941-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
942-
user = var.aws_ami_user
943-
}
944-
945840
ami = var.aws_ami
946841
instance_type = var.job_server_instance_type
947842
key_name = aws_key_pair.key.id
@@ -958,30 +853,12 @@ resource "aws_instance" "job_server" {
958853
volume_type = var.block_device_type
959854
}
960855

961-
provisioner "file" {
962-
source = var.mattermost_license_file
963-
destination = "/home/${var.aws_ami_user}/mattermost.mattermost-license"
964-
}
965-
966-
967-
provisioner "file" {
968-
source = "provisioners/${var.operating_system_kind}/common.sh"
969-
destination = "/tmp/common.sh"
970-
}
971-
972-
provisioner "file" {
973-
source = "provisioners/${var.operating_system_kind}/job.sh"
974-
destination = "/tmp/provisioner.sh"
975-
}
976-
977-
provisioner "remote-exec" {
978-
inline = [
979-
"cd /tmp",
980-
"chmod +x /tmp/common.sh",
981-
"chmod +x /tmp/provisioner.sh",
982-
"/tmp/provisioner.sh",
983-
]
984-
}
856+
user_data_replace_on_change = true
857+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
858+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
859+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/job.sh")
860+
ami_user = var.aws_ami_user
861+
})
985862
}
986863

987864
locals {
@@ -1002,12 +879,6 @@ resource "aws_instance" "keycloak" {
1002879
Name = "${var.cluster_name}-keycloak"
1003880
}
1004881

1005-
connection {
1006-
type = "ssh"
1007-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
1008-
user = var.aws_ami_user
1009-
}
1010-
1011882
ami = var.aws_ami
1012883
instance_type = var.keycloak_instance_type
1013884
count = var.keycloak_enabled ? 1 : 0
@@ -1024,24 +895,12 @@ resource "aws_instance" "keycloak" {
1024895
volume_type = var.block_device_type
1025896
}
1026897

1027-
provisioner "file" {
1028-
source = "provisioners/${var.operating_system_kind}/common.sh"
1029-
destination = "/tmp/common.sh"
1030-
}
1031-
1032-
provisioner "file" {
1033-
source = "provisioners/${var.operating_system_kind}/keycloak.sh"
1034-
destination = "/tmp/provisioner.sh"
1035-
}
1036-
1037-
provisioner "remote-exec" {
1038-
inline = [
1039-
"cd /tmp",
1040-
"chmod +x /tmp/common.sh",
1041-
"chmod +x /tmp/provisioner.sh",
1042-
"/tmp/provisioner.sh",
1043-
]
1044-
}
898+
user_data_replace_on_change = true
899+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
900+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
901+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/keycloak.sh")
902+
ami_user = var.aws_ami_user
903+
})
1045904
}
1046905

1047906
resource "aws_security_group" "keycloak" {

deployment/terraform/assets/ldap.tf

Lines changed: 6 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,6 @@ resource "aws_instance" "openldap" {
44
Name = "${var.cluster_name}-openldap"
55
}
66

7-
connection {
8-
type = "ssh"
9-
host = var.connection_type == "public" ? self.public_ip : self.private_ip
10-
user = var.aws_ami_user
11-
}
12-
137
ami = var.aws_ami
148
instance_type = var.openldap_instance_type
159
count = var.openldap_enabled ? 1 : 0
@@ -26,24 +20,12 @@ resource "aws_instance" "openldap" {
2620
volume_type = var.block_device_type
2721
}
2822

29-
provisioner "file" {
30-
source = "provisioners/${var.operating_system_kind}/common.sh"
31-
destination = "/tmp/common.sh"
32-
}
33-
34-
provisioner "file" {
35-
source = "provisioners/${var.operating_system_kind}/openldap.sh"
36-
destination = "/tmp/provisioner.sh"
37-
}
38-
39-
provisioner "remote-exec" {
40-
inline = [
41-
"cd /tmp",
42-
"chmod +x /tmp/common.sh",
43-
"chmod +x /tmp/provisioner.sh",
44-
"/tmp/provisioner.sh",
45-
]
46-
}
23+
user_data_replace_on_change = true
24+
user_data = templatefile("${path.module}/user_data.sh.tpl", {
25+
common_sh = file("${path.module}/provisioners/${var.operating_system_kind}/common.sh")
26+
provisioner_sh = file("${path.module}/provisioners/${var.operating_system_kind}/openldap.sh")
27+
ami_user = var.aws_ami_user
28+
})
4729
}
4830

4931
resource "aws_security_group" "openldap" {

deployment/terraform/assets/provisioners/debian/agent.sh

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,6 @@
22

33
set -euo pipefail
44

5-
# Wait for boot to be finished (e.g. networking to be up).
6-
while [ ! -f /var/lib/cloud/instance/boot-finished ]; do
7-
echo 'Waiting for cloud-init...'
8-
sleep 1
9-
done
10-
115
# Retry loop (up to 3 times)
126
n=0
137
until [ "$n" -ge 3 ]; do
@@ -26,12 +20,13 @@ until [ "$n" -ge 3 ]; do
2620
nvm install 24.11 &&
2721
nvm use 24.11 &&
2822
echo "Node.js installed successfully with version $(node --version)" &&
29-
# Install OpenTelemetry collector, using ubuntu user to avoid permission issues
23+
# Install OpenTelemetry collector, using current user to avoid permission issues
3024
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.120.0/otelcol-contrib_0.120.0_linux_amd64.deb &&
3125
sudo dpkg -i otelcol-contrib_0.120.0_linux_amd64.deb &&
32-
sudo sed -i 's/User=.*/User=ubuntu/g' /lib/systemd/system/otelcol-contrib.service &&
33-
sudo sed -i 's/Group=.*/Group=ubuntu/g' /lib/systemd/system/otelcol-contrib.service &&
26+
sudo sed -i "s/User=.*/User=${USER}/g" /lib/systemd/system/otelcol-contrib.service &&
27+
sudo sed -i "s/Group=.*/Group=${USER}/g" /lib/systemd/system/otelcol-contrib.service &&
3428
sudo systemctl daemon-reload && sudo systemctl restart otelcol-contrib &&
29+
sudo chown -R ${USER}:${USER} ${HOME}/.nvm &&
3530
exit 0
3631
n=$((n + 1))
3732
sleep 2

0 commit comments

Comments
 (0)