Skip to content

Commit 7dbbdc6

Browse files
authored
Merge pull request #879 from nebius/feature/automate-fs-csi-driver-install
Automate filesystem CSI driver installation, tests, and cleanup
2 parents 39bd07c + f3ec9fc commit 7dbbdc6

File tree

17 files changed

+1003
-16
lines changed

17 files changed

+1003
-16
lines changed

k8s-training/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,24 @@ You can use Filestore to add external storage to K8s clusters, this allows you t
127127
128128
For more information on how to access storage in K8s, refer [here](#accessing-storage).
129129
130+
### Shared filesystem CSI automation
131+
132+
When a shared filesystem is present, either because this stack created it or because `existing_filestore` was provided, Terraform can also install the Nebius Shared Filesystem CSI driver and promote its StorageClass to the cluster default.
133+
134+
```hcl
135+
enable_filestore = true
136+
existing_filestore = "" # or an existing filesystem ID
137+
filestore_mount_path = "/mnt/data"
138+
filesystem_csi = {
139+
chart_version = "0.1.5"
140+
namespace = "kube-system"
141+
make_default_storage_class = true
142+
previous_default_storage_class_name = "compute-csi-default-sc"
143+
}
144+
```
145+
146+
This Terraform automation installs the CSI driver and configures the StorageClass only. Verification, pod-level validation, and cleanup remain in `filesystem-csi-validation/` as an explicit opt-in workflow.
147+
130148
## Connecting to the cluster
131149
132150
### Preparing the environment
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.state/
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
#!/usr/bin/env bash
2+
# -----------------------------------------------------------------------------
3+
# File: 01-verify-node-filesystem-mounts.sh
4+
# Purpose:
5+
# Verify that the Nebius Shared Filesystem is mounted on every Kubernetes
6+
# node at the expected host path before any pod-level storage testing begins.
7+
#
8+
# Why We Run This:
9+
# The Nebius CSI workflow in this repo depends on the shared filesystem
10+
# already being attached and mounted on each node. If a node is missing the
11+
# host mount, later PVC or pod checks can fail in ways that are harder to
12+
# diagnose.
13+
#
14+
# Reference Docs:
15+
# https://docs.nebius.com/kubernetes/storage/filesystem-over-csi
16+
#
17+
# Repo Sources of Truth:
18+
# - ../../modules/cloud-init/k8s-cloud-init.tftpl
19+
# - ../main.tf
20+
#
21+
# What This Script Checks:
22+
# - The mount exists at /mnt/data (or the value of MOUNT_POINT)
23+
# - The mount is present in /etc/fstab
24+
# - The mounted filesystem reports capacity via df
25+
# - The target directory exists on the host
26+
#
27+
# Usage:
28+
# ./01-verify-node-filesystem-mounts.sh
29+
#
30+
# Optional Environment Variables:
31+
# TEST_NAMESPACE Namespace used for the temporary node-debugger pods.
32+
# Defaults to the current kubectl namespace or default.
33+
# MOUNT_POINT Host path to validate. Defaults to the Terraform mount.
34+
# DEBUG_IMAGE Image used by kubectl debug. Defaults to ubuntu.
35+
# VERIFY_ALL_NODES When true, validates every node in the cluster. Defaults
36+
# to false.
37+
# TARGET_NODE Specific node to validate. Accepts either
38+
# node/<name> or <name>. Overrides VERIFY_ALL_NODES.
39+
#
40+
# Created By: Aaron Fagan
41+
# Created On: 2026-03-17
42+
# Version: 0.1.0
43+
# -----------------------------------------------------------------------------
44+
set -euo pipefail
45+
46+
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
47+
source "${SCRIPT_DIR}/common.sh"
48+
49+
DEBUG_IMAGE="${DEBUG_IMAGE:-ubuntu}"
50+
VERIFY_ALL_NODES="${VERIFY_ALL_NODES:-false}"
51+
TARGET_NODE="${TARGET_NODE:-}"
52+
FAILED=0
53+
54+
normalize_node_name() {
55+
local node_name="$1"
56+
if [[ "${node_name}" == node/* ]]; then
57+
printf '%s\n' "${node_name}"
58+
else
59+
printf 'node/%s\n' "${node_name}"
60+
fi
61+
}
62+
63+
log_step "Starting Nebius Shared Filesystem mount verification"
64+
log_info "Namespace for temporary debug pods: ${TEST_NAMESPACE}"
65+
log_info "Expected mount point: ${MOUNT_POINT}"
66+
log_info "Debug image: ${DEBUG_IMAGE}"
67+
68+
log_step "Checking required local dependencies"
69+
require_command kubectl
70+
require_command awk
71+
require_command mktemp
72+
log_pass "Required local commands for node mount verification are available"
73+
74+
log_step "Preparing local state for debugger pod cleanup"
75+
ensure_state_dir
76+
touch "${DEBUG_POD_RECORD_FILE}"
77+
log_info "Debugger pod record file: ${DEBUG_POD_RECORD_FILE}"
78+
log_info "New debugger pods from this run will be appended for later cleanup"
79+
80+
log_step "Selecting which nodes to validate"
81+
ALL_NODES=()
82+
while IFS= read -r node; do
83+
[[ -n "${node}" ]] && ALL_NODES+=("${node}")
84+
done < <(kubectl get nodes -o name)
85+
86+
if [[ "${#ALL_NODES[@]}" -eq 0 ]]; then
87+
log_fail "No Kubernetes nodes were returned by kubectl"
88+
exit 1
89+
fi
90+
91+
if [[ -n "${TARGET_NODE}" ]]; then
92+
TARGET_NODE="$(normalize_node_name "${TARGET_NODE}")"
93+
NODES_TO_CHECK=("${TARGET_NODE}")
94+
log_info "Using explicitly requested node: ${TARGET_NODE}"
95+
elif [[ "${VERIFY_ALL_NODES}" == "true" ]]; then
96+
NODES_TO_CHECK=("${ALL_NODES[@]}")
97+
log_info "VERIFY_ALL_NODES=true, so every node will be checked"
98+
else
99+
NODES_TO_CHECK=("${ALL_NODES[0]}")
100+
log_info "Defaulting to a single-node validation using: ${NODES_TO_CHECK[0]}"
101+
fi
102+
103+
log_pass "Selected ${#NODES_TO_CHECK[@]} node(s) for shared filesystem mount validation"
104+
105+
log_step "Checking Nebius Shared Filesystem mounts on the selected Kubernetes nodes"
106+
for node in "${NODES_TO_CHECK[@]}"; do
107+
echo
108+
echo "------------------------------------------------------------"
109+
echo "=== ${node} ==="
110+
output_file="$(mktemp)"
111+
if ! kubectl debug -n "${TEST_NAMESPACE}" "${node}" \
112+
--attach=true \
113+
--quiet \
114+
--image="${DEBUG_IMAGE}" \
115+
--profile=sysadmin -- \
116+
chroot /host sh -lc "
117+
set -eu
118+
echo '[check] Verifying that the Nebius Shared Filesystem is actively mounted at ${MOUNT_POINT}'
119+
mount | awk '\$3 == \"${MOUNT_POINT}\" { print; found=1 } END { exit found ? 0 : 1 }'
120+
echo '[check] Verifying that the mount is persisted in /etc/fstab for node reboot safety'
121+
awk '\$2 == \"${MOUNT_POINT}\" { print; found=1 } END { exit found ? 0 : 1 }' /etc/fstab
122+
echo '[check] Verifying that the mounted filesystem reports capacity and is readable'
123+
df -h ${MOUNT_POINT}
124+
echo '[check] Verifying that the target directory exists on the host'
125+
test -d ${MOUNT_POINT}
126+
echo '[result] PASS: shared filesystem host mount is active and healthy at ${MOUNT_POINT} on this node'
127+
" 2>&1 | tee "${output_file}"; then
128+
FAILED=1
129+
echo "[result] FAIL: ${node} does not have a healthy shared filesystem mount at ${MOUNT_POINT}" >&2
130+
fi
131+
132+
debug_pod_name="$(awk '/Creating debugging pod / { print $4 }' "${output_file}" | tail -n 1)"
133+
if [[ -n "${debug_pod_name}" ]]; then
134+
printf '%s %s\n' "${TEST_NAMESPACE}" "${debug_pod_name}" >> "${DEBUG_POD_RECORD_FILE}"
135+
fi
136+
rm -f "${output_file}"
137+
done
138+
139+
if [[ "${FAILED}" -eq 0 ]]; then
140+
log_step "Shared filesystem mount verification completed successfully"
141+
log_info "All checked nodes reported a healthy mount at ${MOUNT_POINT}"
142+
else
143+
log_step "Shared filesystem mount verification completed with failures"
144+
log_info "Review the node output above for the failing mount checks"
145+
fi
146+
147+
exit "${FAILED}"
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
#!/usr/bin/env bash
2+
# -----------------------------------------------------------------------------
3+
# File: 02-run-csi-smoke-test.sh
4+
# Purpose:
5+
# Run a minimal end-to-end validation using one PVC and one pod that mounts
6+
# the shared volume at /data.
7+
#
8+
# Why We Run This:
9+
# This is the fastest proof that the Terraform-managed default StorageClass
10+
# works, the PVC binds, and a pod can read and write data through the
11+
# shared filesystem exposed through CSI.
12+
#
13+
# Reference Docs:
14+
# https://docs.nebius.com/kubernetes/storage/filesystem-over-csi
15+
#
16+
# What This Script Does:
17+
# - Applies the single-pod smoke test manifest
18+
# - Waits for the PVC to bind
19+
# - Verifies that the PVC inherited the expected default StorageClass
20+
# - Waits for the pod to become ready
21+
# - Writes and reads a small probe file inside /data
22+
#
23+
# Usage:
24+
# ./02-run-csi-smoke-test.sh
25+
#
26+
# Optional Environment Variables:
27+
# TEST_NAMESPACE Namespace where the validation resources should be created.
28+
# Defaults to the current kubectl namespace or default.
29+
#
30+
# Manifest Used:
31+
# manifests/01-csi-smoke-test.yaml
32+
#
33+
# Created By: Aaron Fagan
34+
# Created On: 2026-03-17
35+
# Version: 0.1.0
36+
# -----------------------------------------------------------------------------
37+
set -euo pipefail
38+
39+
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
40+
source "${SCRIPT_DIR}/common.sh"
41+
42+
log_step "Starting single-pod shared filesystem smoke test"
43+
log_info "Namespace: ${TEST_NAMESPACE}"
44+
log_info "Manifest: ${FILESYSTEM_SMOKE_MANIFEST_PATH}"
45+
log_info "PVC name: ${FILESYSTEM_SMOKE_PVC_NAME}"
46+
log_info "Pod name: ${FILESYSTEM_SMOKE_POD_NAME}"
47+
log_info "Expected default StorageClass: ${FILESYSTEM_DEFAULT_STORAGE_CLASS_NAME}"
48+
49+
log_step "Checking required local dependencies"
50+
require_command kubectl
51+
log_pass "Required local commands for the smoke test are available"
52+
53+
log_step "Applying the smoke test manifest"
54+
kubectl apply -n "${TEST_NAMESPACE}" -f "${FILESYSTEM_SMOKE_MANIFEST_PATH}"
55+
log_pass "Smoke test manifest applied in namespace '${TEST_NAMESPACE}'"
56+
57+
log_step "Waiting for the smoke test PVC to bind"
58+
kubectl wait -n "${TEST_NAMESPACE}" \
59+
--for=jsonpath='{.status.phase}'=Bound \
60+
"pvc/${FILESYSTEM_SMOKE_PVC_NAME}" \
61+
--timeout=120s
62+
log_info "PVC '${FILESYSTEM_SMOKE_PVC_NAME}' is bound"
63+
log_pass "Smoke test PVC '${FILESYSTEM_SMOKE_PVC_NAME}' bound successfully"
64+
65+
log_step "Verifying that the smoke test PVC inherited the default StorageClass"
66+
SMOKE_STORAGE_CLASS_NAME="$(kubectl get pvc -n "${TEST_NAMESPACE}" "${FILESYSTEM_SMOKE_PVC_NAME}" -o jsonpath='{.spec.storageClassName}')"
67+
if [[ -z "${SMOKE_STORAGE_CLASS_NAME}" ]]; then
68+
log_fail "Smoke test PVC '${FILESYSTEM_SMOKE_PVC_NAME}' did not receive a StorageClass from the cluster default"
69+
exit 1
70+
fi
71+
if [[ "${SMOKE_STORAGE_CLASS_NAME}" != "${FILESYSTEM_DEFAULT_STORAGE_CLASS_NAME}" ]]; then
72+
log_fail "Smoke test PVC '${FILESYSTEM_SMOKE_PVC_NAME}' used StorageClass '${SMOKE_STORAGE_CLASS_NAME}', expected '${FILESYSTEM_DEFAULT_STORAGE_CLASS_NAME}'"
73+
exit 1
74+
fi
75+
log_info "PVC '${FILESYSTEM_SMOKE_PVC_NAME}' was assigned StorageClass '${SMOKE_STORAGE_CLASS_NAME}'"
76+
log_pass "Smoke test PVC '${FILESYSTEM_SMOKE_PVC_NAME}' inherited the expected default StorageClass"
77+
78+
log_step "Waiting for the smoke test pod to become ready"
79+
kubectl wait -n "${TEST_NAMESPACE}" \
80+
--for=condition=Ready \
81+
"pod/${FILESYSTEM_SMOKE_POD_NAME}" \
82+
--timeout=120s
83+
log_info "Pod '${FILESYSTEM_SMOKE_POD_NAME}' is ready"
84+
log_pass "Smoke test pod '${FILESYSTEM_SMOKE_POD_NAME}' reached Ready state"
85+
86+
log_step "Writing and reading a probe file through the mounted volume"
87+
kubectl exec -n "${TEST_NAMESPACE}" "${FILESYSTEM_SMOKE_POD_NAME}" -- sh -lc '
88+
set -eu
89+
df -h /data
90+
echo ok > /data/probe.txt
91+
ls -l /data
92+
cat /data/probe.txt
93+
'
94+
log_pass "Pod '${FILESYSTEM_SMOKE_POD_NAME}' successfully wrote and read the probe file on the shared volume"
95+
96+
log_step "Smoke test completed successfully"
97+
log_info "The PVC inherited the cluster default StorageClass and the mounted shared filesystem accepted a write and returned the probe file"
98+
log_pass "Single-pod shared filesystem smoke test confirmed default StorageClass behavior and working storage access"
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
#!/usr/bin/env bash
2+
# -----------------------------------------------------------------------------
3+
# File: 03-run-csi-rwx-cross-node-test.sh
4+
# Purpose:
5+
# Validate ReadWriteMany behavior across nodes by mounting the same PVC into
6+
# two pods scheduled onto different hosts.
7+
#
8+
# Why We Run This:
9+
# A single-pod test proves basic functionality, but shared filesystems are
10+
# most valuable when data written from one node can be read from another. This
11+
# script confirms that cross-node sharing works in practice.
12+
#
13+
# Reference Docs:
14+
# https://docs.nebius.com/kubernetes/storage/filesystem-over-csi
15+
#
16+
# What This Script Does:
17+
# - Applies a RWX PVC plus reader/writer pod manifest
18+
# - Uses pod anti-affinity to encourage placement on different nodes
19+
# - Waits for the PVC and both pods to become ready
20+
# - Verifies that the PVC inherited the expected default StorageClass
21+
# - Writes a file from one pod and reads it from the other
22+
#
23+
# Usage:
24+
# ./03-run-csi-rwx-cross-node-test.sh
25+
#
26+
# Optional Environment Variables:
27+
# TEST_NAMESPACE Namespace where the validation resources should be created.
28+
# Defaults to the current kubectl namespace or default.
29+
#
30+
# Manifest Used:
31+
# manifests/02-csi-rwx-cross-node.yaml
32+
#
33+
# Created By: Aaron Fagan
34+
# Created On: 2026-03-17
35+
# Version: 0.1.0
36+
# -----------------------------------------------------------------------------
37+
set -euo pipefail
38+
39+
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
40+
source "${SCRIPT_DIR}/common.sh"
41+
42+
log_step "Starting cross-node RWX validation"
43+
log_info "Namespace: ${TEST_NAMESPACE}"
44+
log_info "Manifest: ${FILESYSTEM_RWX_MANIFEST_PATH}"
45+
log_info "PVC name: ${FILESYSTEM_RWX_PVC_NAME}"
46+
log_info "Writer pod: ${FILESYSTEM_RWX_WRITER_POD_NAME}"
47+
log_info "Reader pod: ${FILESYSTEM_RWX_READER_POD_NAME}"
48+
log_info "Expected default StorageClass: ${FILESYSTEM_DEFAULT_STORAGE_CLASS_NAME}"
49+
50+
log_step "Checking required local dependencies"
51+
require_command kubectl
52+
log_pass "Required local commands for the RWX validation are available"
53+
54+
log_step "Applying the RWX validation manifest"
55+
kubectl apply -n "${TEST_NAMESPACE}" -f "${FILESYSTEM_RWX_MANIFEST_PATH}"
56+
log_pass "RWX validation manifest applied in namespace '${TEST_NAMESPACE}'"
57+
58+
log_step "Waiting for the RWX PVC to bind"
59+
kubectl wait -n "${TEST_NAMESPACE}" \
60+
--for=jsonpath='{.status.phase}'=Bound \
61+
"pvc/${FILESYSTEM_RWX_PVC_NAME}" \
62+
--timeout=120s
63+
log_info "PVC '${FILESYSTEM_RWX_PVC_NAME}' is bound"
64+
log_pass "RWX PVC '${FILESYSTEM_RWX_PVC_NAME}' bound successfully"
65+
66+
log_step "Verifying that the RWX PVC inherited the default StorageClass"
67+
RWX_STORAGE_CLASS_NAME="$(kubectl get pvc -n "${TEST_NAMESPACE}" "${FILESYSTEM_RWX_PVC_NAME}" -o jsonpath='{.spec.storageClassName}')"
68+
if [[ -z "${RWX_STORAGE_CLASS_NAME}" ]]; then
69+
log_fail "RWX PVC '${FILESYSTEM_RWX_PVC_NAME}' did not receive a StorageClass from the cluster default"
70+
exit 1
71+
fi
72+
if [[ "${RWX_STORAGE_CLASS_NAME}" != "${FILESYSTEM_DEFAULT_STORAGE_CLASS_NAME}" ]]; then
73+
log_fail "RWX PVC '${FILESYSTEM_RWX_PVC_NAME}' used StorageClass '${RWX_STORAGE_CLASS_NAME}', expected '${FILESYSTEM_DEFAULT_STORAGE_CLASS_NAME}'"
74+
exit 1
75+
fi
76+
log_info "PVC '${FILESYSTEM_RWX_PVC_NAME}' was assigned StorageClass '${RWX_STORAGE_CLASS_NAME}'"
77+
log_pass "RWX PVC '${FILESYSTEM_RWX_PVC_NAME}' inherited the expected default StorageClass"
78+
79+
log_step "Waiting for both RWX test pods to become ready"
80+
kubectl wait -n "${TEST_NAMESPACE}" \
81+
--for=condition=Ready \
82+
"pod/${FILESYSTEM_RWX_WRITER_POD_NAME}" \
83+
--timeout=180s
84+
kubectl wait -n "${TEST_NAMESPACE}" \
85+
--for=condition=Ready \
86+
"pod/${FILESYSTEM_RWX_READER_POD_NAME}" \
87+
--timeout=180s
88+
log_info "Both RWX test pods are ready"
89+
log_pass "RWX writer and reader pods both reached Ready state"
90+
91+
log_step "Checking the node placement for the reader and writer pods"
92+
WRITER_NODE="$(kubectl get pod -n "${TEST_NAMESPACE}" "${FILESYSTEM_RWX_WRITER_POD_NAME}" -o jsonpath='{.spec.nodeName}')"
93+
READER_NODE="$(kubectl get pod -n "${TEST_NAMESPACE}" "${FILESYSTEM_RWX_READER_POD_NAME}" -o jsonpath='{.spec.nodeName}')"
94+
95+
echo "writer node: ${WRITER_NODE}"
96+
echo "reader node: ${READER_NODE}"
97+
98+
kubectl get pods -n "${TEST_NAMESPACE}" "${FILESYSTEM_RWX_WRITER_POD_NAME}" "${FILESYSTEM_RWX_READER_POD_NAME}" -o wide
99+
log_pass "RWX pod placement details collected for both nodes"
100+
101+
log_step "Writing shared data from the writer pod"
102+
kubectl exec -n "${TEST_NAMESPACE}" "${FILESYSTEM_RWX_WRITER_POD_NAME}" -- sh -lc '
103+
set -eu
104+
echo "shared-check" > /data/shared.txt
105+
cat /data/shared.txt
106+
'
107+
log_pass "Writer pod '${FILESYSTEM_RWX_WRITER_POD_NAME}' wrote shared data to the mounted volume"
108+
109+
log_step "Reading the same shared data from the reader pod"
110+
kubectl exec -n "${TEST_NAMESPACE}" "${FILESYSTEM_RWX_READER_POD_NAME}" -- sh -lc '
111+
set -eu
112+
ls -l /data
113+
cat /data/shared.txt
114+
'
115+
log_pass "Reader pod '${FILESYSTEM_RWX_READER_POD_NAME}' read the shared file created by the writer pod"
116+
117+
log_step "Cross-node RWX validation completed successfully"
118+
log_info "The PVC inherited the cluster default StorageClass and the same file was visible from both pods through the shared volume"
119+
log_pass "Cross-node ReadWriteMany storage behavior and default StorageClass inheritance confirmed"

0 commit comments

Comments
 (0)