Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions hack/simulate-disaster.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
#!/usr/bin/env bash

set -o errexit
set -o pipefail

script_dir=$( cd -- "$(dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
project_dir="${script_dir}/.."
fixtures_dir="${project_dir}/e2e/fixtures"
fixture_variant="${FIXTURE_VARIANT:-large}"
fixture_extra_vars="${FIXTURE_EXTRA_VARS}"

# Simulates losing a Docker Swarm node, but retains all Etcd data so that we can
# focus on just the Docker Swarm and database instance recovery steps
simulate_swarm_node_loss() {
local host_id

for host_id in $@; do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n hack/simulate-disaster.sh

Repository: pgEdge/control-plane

Length of output: 6082


🌐 Web query:

shell script quoting $@ best practices word splitting globbing

💡 Result:

Best practices for $@ (and avoiding word splitting + globbing)

  • Forward all arguments exactly:
    Use "$@" (quoted). In bash, "$@" expands to separate words: "$1" "$2" ... (preserves spaces, tabs, newlines, and literal * ? [ characters in each argument). [1][2]

  • Don’t use $* for forwarding:
    $* (especially unquoted) merges args and then undergoes splitting/globbing; "$*" turns all args into one string separated by the first char of IFS. Prefer "$@". [1][3]

  • Loop over args safely:
    Prefer either:

    for arg in "$@"; do ...; done

    or (bash) simply:

    for arg; do ...; done

    Avoid for arg in $@; ... because unquoted $@ triggers word splitting and pathname expansion (globbing). [2]

  • Quote variable expansions by default:
    Unquoted expansions are subject to field splitting and then pathname expansion (globbing). Quoting suppresses both. [2][4]

  • When mixing variables with globs, quote only the variable parts:

    for f in "${dir}"/*.txt; do ...; done

    not "${dir}/*.txt" (that would disable the glob). [5]

  • If you actually want splitting, do it explicitly (not via unquoted $var/$@):
    Use read -r -a / mapfile (bash) or controlled IFS handling, rather than relying on “sloppy” splitting + globbing. [2]

Sources: Bash Reference Manual on special parameters @/* and quoting [1], ShellCheck guidance on quoting and "$@" [2][3][5], POSIX shell quoting/expansion rules [4].


Fix quoting of $@ expansions and correct the function name on line 172.

Unquoted $@ expansions in this script cause word-splitting and globbing issues, making them fragile with inputs containing spaces or glob characters. Additionally, line 172 calls simulate_full_node_loss which doesn't exist; the function is defined as simulate_full_loss on line 70, causing a runtime error.

Apply these fixes:

  • Line 17: for host_id in "$@"; do
  • Line 39: for host_id in "$@"; do
  • Line 73: for host_id in "$@"; do
  • Line 166: simulate_swarm_node_loss "${@:2}"
  • Line 169: simulate_etcd_node_loss "${@:2}"
  • Line 172: simulate_full_loss "${@:2}" (corrected function name and quoted)
  • Line 187: main "$@"
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 17-17: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/simulate-disaster.sh` at line 17, Fix the unquoted $@ expansions in the
for-loops and correct the misnamed function call: change the three loops that
iterate host_id (the ones using for host_id in $@) to use quoted expansion (for
host_id in "$@") to prevent word-splitting and globbing, update the calls that
forward args to use quoted slices where shown (simulate_swarm_node_loss
"${@:2}", simulate_etcd_node_loss "${@:2}"), replace the invalid
simulate_full_node_loss call with the actual function name simulate_full_loss
and pass quoted args (simulate_full_loss "${@:2}"), and ensure the script's
entry call uses main "$@" instead of unquoted arguments.

echo "=== simulating swarm node loss on ${host_id} ==="
echo

ssh -T -F ~/.lima/${host_id}/ssh.config lima-${host_id} <<-'EOF'
if [[ $(docker info --format '{{.Swarm.LocalNodeState}}') == "active" ]]; then
docker swarm leave --force
else
echo "node already left swarm"
fi
echo "removing instances data directory"
sudo rm -rf /data/control-plane/instances
EOF
echo
done
}

# Simulates losing an Etcd node, but retains Docker Swarm so that we can focus
# on just the Control Plane and database instance recovery steps
simulate_etcd_node_loss() {
local host_id

for host_id in $@; do
echo "=== simulating etcd node loss on ${host_id} ==="
echo

# We're using xargs here to gracefully ignore when the services do not
# exist
ssh -T -F ~/.lima/host-1/ssh.config lima-host-1 <<-EOF
echo "removing control-plane swarm service"
docker service ls \
--filter 'name=control-plane_${host_id}' \
--format '{{ .Name }}' \
| xargs -r docker service rm

echo "removing all database swarm services"
docker service ls \
--filter 'label=pgedge.host.id=${host_id}' \
--format '{{ .Name }}' \
| xargs -r docker service rm
EOF

ssh -T -F ~/.lima/${host_id}/ssh.config lima-${host_id} <<-EOF
echo "removing control-plane data directory"
sudo rm -rf /data/control-plane
EOF
echo
done
}

# This is most similar to a real disaster recovery scenario. We're losing the
# entire machine as well as all of its storage. Whatever replacement machine we
# start up may or may not have the same IP address.
simulate_full_loss() {
local host_id

for host_id in $@; do
echo "=== simulating full loss of ${host_id} ==="
echo

limactl stop ${host_id}
limactl delete ${host_id}

echo
done
}

# Resets Swarm and the Control Plane on all hosts and returns the Control Plane
# to an uninitialized state.
reset() {
echo "=== resetting all hosts ==="
echo

VARIANT="${fixture_variant}" \
EXTRA_VARS="${fixture_extra_vars}" \
make -C "${fixtures_dir}" \
deploy-lima-machines

for host_id in $(limactl ls | awk '$1~/^host-/ && $2 == "Running" { print $1 }'); do
echo "resetting swarm on ${host_id}"

ssh -T -F ~/.lima/${host_id}/ssh.config lima-${host_id} <<-'EOF'
if [[ $(docker info --format '{{.Swarm.LocalNodeState}}') == "active" ]]; then
docker swarm leave --force
else
echo "node already left swarm"
fi
echo "removing control-plane data directory"
sudo rm -rf /data/control-plane
EOF
done

make -C "${project_dir}" goreleaser-build

VARIANT="${fixture_variant}" \
EXTRA_VARS="${fixture_extra_vars}" \
make -C "${fixtures_dir}" \
setup-lima-hosts \
teardown-lima-control-plane \
deploy-lima-control-plane
}

usage() {
cat <<EOF
Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]

Simulates disasters against the Lima test fixtures. Supports three different
different types of disasters to enable us to develop some recovery steps in
parallel:
Comment on lines +121 to +125
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update usage text (typo + include reset).

The synopsis omits reset, and the description says “different different”.

✏️ Suggested fix
-Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]
+Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...]
@@
-Simulates disasters against the Lima test fixtures. Supports three different
-different types of disasters to enable us to develop some recovery steps in
+Simulates disasters against the Lima test fixtures. Supports three different
+types of disasters to enable us to develop some recovery steps in
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Usage: $1 <swarm|etcd|full> <host-id> [host-id ...]
Simulates disasters against the Lima test fixtures. Supports three different
different types of disasters to enable us to develop some recovery steps in
parallel:
Usage: $1 <swarm|etcd|full|reset> <host-id> [host-id ...]
Simulates disasters against the Lima test fixtures. Supports three different
types of disasters to enable us to develop some recovery steps in
parallel:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/simulate-disaster.sh` around lines 121 - 125, Update the usage text in
the simulate-disaster.sh header: fix the duplicated word "different different"
and include the missing `reset` option in the synopsis string (change "Usage: $1
<swarm|etcd|full> <host-id> [host-id ...]" to include `reset`, e.g. "Usage: $1
<swarm|etcd|full|reset> <host-id> [host-id ...]") and adjust the descriptive
paragraph to remove the duplicate word so it reads "three different types of
disasters" (or similar). Ensure you update both the usage line and the
description near that header.


- swarm: simulates losing a Swarm node and database instance data without losing
Etcd data
- etcd: simulates losing a Control Plane/Etcd instance without losing Swarm
quorum.
- full: simulates losing an entire host, affecting both Swarm and Control
Plane/Etcd.

NOTE: This is only intended to be run against swarm manager/etcd server hosts.

Examples:
# Simulating losing Swarm on one host
$1 swarm host-1

# Simulate losing Swarm on two hosts in order to lose quorum
$1 swarm host-1 host-3

# Simulate losing Control Plane/Etcd on one host
$1 etcd host-1

# Simulate losing Control Plane/Etcd on two hosts in order to lose quorum
$1 etcd host-1 host-3

# Simulate full loss of one host
$1 full host-1

# Simulate full loss of two hosts to lose quorum
$1 full host-1 host-3

# Reset the fixture back to its initial state
$1 reset

# Remember to include the fixture variant if you're using a non-default one
FIXTURE_VARIANT=small $1 reset
EOF
}

main() {
case $1 in
swarm)
simulate_swarm_node_loss ${@:2}
;;
etcd)
simulate_etcd_node_loss ${@:2}
;;
full)
simulate_full_node_loss ${@:2}
;;
Comment on lines +171 to +173
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the full dispatch: undefined function call.

full currently calls simulate_full_node_loss, but the defined function is simulate_full_loss, so the full path will fail at runtime.

✅ Suggested fix
-		full)
-			simulate_full_node_loss ${@:2}
+		full)
+			simulate_full_loss ${@:2}
 			;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
full)
simulate_full_node_loss ${@:2}
;;
full)
simulate_full_loss ${@:2}
;;
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 172-172: Double quote array expansions to avoid re-splitting elements.

(SC2068)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/simulate-disaster.sh` around lines 171 - 173, The dispatch case labeled
"full" calls a nonexistent function simulate_full_node_loss; update that case to
call the actual function simulate_full_loss (replace simulate_full_node_loss
with simulate_full_loss) so the "full" branch invokes the defined function and
won't fail at runtime.

reset)
reset
;;
--help|-h)
usage $0
;;
*)
usage $0
exit 1
;;
esac
}

main $@