Skip to content

test: add multi-node k8s test script for cross-node gRPC transfers#164

Open
nicolasnoble wants to merge 2 commits intomainfrom
nnoble/multinode-k8s-tests
Open

test: add multi-node k8s test script for cross-node gRPC transfers#164
nicolasnoble wants to merge 2 commits intomainfrom
nnoble/multinode-k8s-tests

Conversation

@nicolasnoble
Copy link
Contributor

@nicolasnoble nicolasnoble commented Mar 13, 2026

Summary

  • Adds Rust integration tests (workspace-tests/tests/k8s_multinode_tests.rs) using the kube crate for multi-node k8s testing, all #[ignore] by default
  • 5 tests: cross-node gRPC transfer with default/64KB/4MB chunk sizes, transfer integrity via md5 content hash comparison, multi-replica independent cache verification
  • Minimal shell wrapper (test_multinode_k8s.sh) that builds the Docker image, distributes via microk8s images import, and runs cargo test
  • Adds kube (ws feature), k8s-openapi, and anyhow to workspace-tests dependencies

Test plan

  • All 5 tests verified on 3-node microk8s cluster
  • cargo clippy passes with no warnings
  • pre-commit run --all-files passes
  • CI passes (tests are #[ignore], won't run in CI)

Summary by CodeRabbit

  • Tests

    • Added comprehensive multi-node Kubernetes integration test suite with automated test runner and orchestration capabilities.
  • Chores

    • Added testing-related dependencies to workspace configuration.

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Walkthrough

The PR introduces a multi-node Kubernetes integration testing framework for ModelExpress. It adds a Bash script to orchestrate test execution, extends workspace dependencies with Kubernetes client libraries, and provides comprehensive Rust test cases for cross-node data transfer and integrity verification scenarios.

Changes

Cohort / File(s) Summary
Test Orchestration
test_multinode_k8s.sh
New Bash script to build the multinode Docker image, distribute to Kubernetes nodes if microk8s is available, and execute Rust integration tests with cargo test, supporting skip-build flag and passthrough of cargo arguments.
Workspace Dependencies
workspace-tests/Cargo.toml
Added three new dependencies: kube (with ws feature), k8s-openapi, and anyhow for Kubernetes API interactions and error handling.
Integration Tests
workspace-tests/tests/k8s_multinode_tests.rs
New comprehensive test suite implementing multi-node Kubernetes integration scenarios, including node selection, namespace management, server/client deployment orchestration, cross-node data transfer validation, and content integrity verification via md5 checksums.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit's ode to Kubernetes dreams,
Nodes dance together in test-driven schemes,
From script to code, with K8s in hand,
Multi-node magic across the land! 🚀✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a multi-node Kubernetes test script for cross-node gRPC transfers, which aligns with the three new files (test script, Cargo.toml updates, and comprehensive test suite).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can validate your CodeRabbit configuration file in your editor.

If your editor has YAML language server, you can enable auto-completion and validation by adding # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json at the top of your CodeRabbit configuration file.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
test_multinode_k8s.sh (1)

24-25: Enable pipefail for the image import pipeline.

A failing docker save can be masked here because only the last command in the pipeline controls the exit status. set -euo pipefail makes this fail closed instead of running the tests against a stale or partial image.

Suggested change
-set -e
+set -euo pipefail

Also applies to: 43-45

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test_multinode_k8s.sh` around lines 24 - 25, Replace the existing bare "set
-e" invocations in the script (the top-level shell options and the later block
around the image import pipeline) with a stricter shell option set to prevent
masked failures: use "set -euo pipefail" instead of "set -e". Update the
occurrences referenced (the initial script header where cd "$(dirname "$0")"
follows, and the later image import pipeline block currently using "set -e") so
that any failing command in a pipeline (e.g., docker save | docker import)
causes the script to exit and unset variables are treated as errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test_multinode_k8s.sh`:
- Around line 17-22: The script's CARGO_ARGS are being appended after the
literal "--" (so they become libtest/test-binary args), which breaks Cargo
options like "--test"; update the script (test_multinode_k8s.sh) to either (A)
pass CARGO_ARGS to Cargo before the "--" separator (move where CARGO_ARGS is
injected into the final command) so Cargo sees them as Cargo flags, or (B)
rename/document CARGO_ARGS as TEST_BINARY_ARGS and leave them after "--" while
adding a separate CARGO_ARGS slot that is inserted before "--"; locate the code
that constructs the test invocation (the place that appends "-- $CARGO_ARGS")
and modify it accordingly and update the example/help text to reflect the
correct flag placement.

In `@workspace-tests/Cargo.toml`:
- Line 25: Move the "ws" feature declaration for the kube dependency to the root
workspace manifest so member crates inherit it: add "ws" to the features array
for the kube dependency in the root Cargo.toml (so tests using AttachParams and
.exec() get the feature enabled) and update the workspace-tests Cargo.toml line
referencing kube to simply kube = { workspace = true } so it inherits the root
configuration.

In `@workspace-tests/tests/k8s_multinode_tests.rs`:
- Around line 42-65: select_nodes currently returns metadata.name and treats
that as the kubernetes.io/hostname, and it also ignores NoSchedule taints and
PodReady status; update select_nodes to instead pick nodes by their actual
scheduling identity and readiness: for each Node in node_list.items only include
nodes where the Ready condition is True, there is no NoSchedule taint in
spec.taints, and then prefer spec.node_name (or if you truly need the hostname
label, use metadata.labels.get("kubernetes.io/hostname")) when returning the two
identifiers; adjust the returned tuple and any callers that assume metadata.name
== hostname to use the chosen value (reference function select_nodes, variable
schedulable, and look at places currently using the hostname selector) so pods
are scheduled onto truly ready, untainted nodes.
- Around line 68-82: create_namespace currently uses fixed names and a
fire-and-forget delete (ns_api.delete + sleep) which causes collisions and
Terminating namespaces; change create_namespace to generate a unique namespace
name per test (e.g., append a UUID or random suffix to the provided name), and
replace the pre-create fire-and-forget delete with an awaited deletion sequence:
call ns_api.delete(&name, &DeleteParams::default()).await when a namespace
exists, then poll ns_api.get(&name).await (or handle 404) until the namespace is
fully removed (not in Terminating) before calling
ns_api.create(&PostParams::default(), &ns). Also remove reliance on
NamespaceGuard::drop() spawning un-awaited deletes — add an explicit async
cleanup method on NamespaceGuard (e.g., async fn delete_and_wait(&self, client:
&Client)) that performs ns_api.delete(...) .await and polls until deletion
completes, and invoke that in test teardown so deletions are awaited instead of
fire-and-forget.

---

Nitpick comments:
In `@test_multinode_k8s.sh`:
- Around line 24-25: Replace the existing bare "set -e" invocations in the
script (the top-level shell options and the later block around the image import
pipeline) with a stricter shell option set to prevent masked failures: use "set
-euo pipefail" instead of "set -e". Update the occurrences referenced (the
initial script header where cd "$(dirname "$0")" follows, and the later image
import pipeline block currently using "set -e") so that any failing command in a
pipeline (e.g., docker save | docker import) causes the script to exit and unset
variables are treated as errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4db5ff5a-ce29-44db-95e4-ffaa1c9f9720

📥 Commits

Reviewing files that changed from the base of the PR and between 48b393e and ecfaecc.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • test_multinode_k8s.sh
  • workspace-tests/Cargo.toml
  • workspace-tests/tests/k8s_multinode_tests.rs

Comment on lines +68 to +82
/// Create a namespace, returning a guard that deletes it on drop.
async fn create_namespace(client: &Client, name: &str) -> Result<NamespaceGuard> {
let ns_api: Api<Namespace> = Api::all(client.clone());
let ns: Namespace = serde_json::from_value(serde_json::json!({
"apiVersion": "v1",
"kind": "Namespace",
"metadata": { "name": name }
}))?;

// Delete if leftover from a previous failed run
let _ = ns_api.delete(name, &DeleteParams::default()).await;
tokio::time::sleep(Duration::from_secs(2)).await;

ns_api.create(&PostParams::default(), &ns).await?;

Copy link

@coderabbitai coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Static namespace names plus fire-and-forget cleanup will collide across runs.

These tests pass fixed namespace names into a helper that deletes any existing namespace with that name, sleeps two seconds, and recreates it. Because NamespaceGuard::drop() only spawns the delete and never waits for completion, reruns can hit namespaces stuck in Terminating, and concurrent runs can delete each other's resources. Use a unique namespace per test and perform awaited deletion/polling instead of fire-and-forget cleanup.

Also applies to: 94-103, 487-488, 510-511, 541-542, 574-575, 610-611

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspace-tests/tests/k8s_multinode_tests.rs` around lines 68 - 82,
create_namespace currently uses fixed names and a fire-and-forget delete
(ns_api.delete + sleep) which causes collisions and Terminating namespaces;
change create_namespace to generate a unique namespace name per test (e.g.,
append a UUID or random suffix to the provided name), and replace the pre-create
fire-and-forget delete with an awaited deletion sequence: call
ns_api.delete(&name, &DeleteParams::default()).await when a namespace exists,
then poll ns_api.get(&name).await (or handle 404) until the namespace is fully
removed (not in Terminating) before calling
ns_api.create(&PostParams::default(), &ns). Also remove reliance on
NamespaceGuard::drop() spawning un-awaited deletes — add an explicit async
cleanup method on NamespaceGuard (e.g., async fn delete_and_wait(&self, client:
&Client)) that performs ns_api.delete(...) .await and polls until deletion
completes, and invoke that in test teardown so deletions are awaited instead of
fire-and-forget.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolasnoble could you address this comment? Claude also pointed the same issue and suggested polling until the namespace is gone (404 on GET).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

- workspace-tests/tests/k8s_multinode_tests.rs: Rust integration tests using
  the kube crate, all #[ignore] by default. 5 tests:
  - Cross-node gRPC transfer (default, 64KB, 4MB chunk sizes)
  - Transfer integrity with md5 content hash comparison
  - Multi-replica independent cache verification
- test_multinode_k8s.sh: minimal wrapper that builds the Docker image,
  distributes it via microk8s images import, and runs cargo test
- Adds kube ws feature and anyhow to workspace-tests dependencies

Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
- Rename CARGO_ARGS to TEST_ARGS in shell script (they're libtest args, not cargo args)
- Use nodeName instead of nodeSelector with kubernetes.io/hostname label for pod placement (metadata.name is the node's API name, not necessarily its hostname label)
- Use set -euo pipefail for stricter shell error handling

Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
@nicolasnoble
Copy link
Contributor Author

Ready for review. @AndyDai-nv @zhengluo-nv @nv-hwoo @KavinKrishnan

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment on lines +25 to +27
kube = { workspace = true, features = ["ws"] }
k8s-openapi = { workspace = true }
anyhow = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move these to under dev dependencies as they are only used for testing.

if [ $? -ne 0 ]; then echo "RESULT=FAILED"; exit 1; fi

echo "CHECKSUMS_START"
find /cache -type f -exec md5sum {{}} \; | awk '{{print $1}}' | sort -u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why client does not filter .lock and refs/ while the server checksum does? Should we add the same filter on the client side as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nicolasnoble for adding these. These are great value adds since we need more k8s testings. It seems like this test is to be ran manually by devs locally (correct me if I'm wrong). Do you think we can also make these to be triggered by our CI and make it run on the company CI clusters? (unless that requires some support from Ops team)

let svc = deploy_server(&client, ns, "mx-server", &server_node).await?;
submit_client_job(&client, ns, "dl-default", &client_node, &svc, "").await?;

let success = timeout(TIMEOUT, wait_for_job(&client, ns, "dl-default")).await??;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the outer timeouts when wait_for_job() already has a timeout within the function?

Comment on lines +68 to +82
/// Create a namespace, returning a guard that deletes it on drop.
async fn create_namespace(client: &Client, name: &str) -> Result<NamespaceGuard> {
let ns_api: Api<Namespace> = Api::all(client.clone());
let ns: Namespace = serde_json::from_value(serde_json::json!({
"apiVersion": "v1",
"kind": "Namespace",
"metadata": { "name": name }
}))?;

// Delete if leftover from a previous failed run
let _ = ns_api.delete(name, &DeleteParams::default()).await;
tokio::time::sleep(Duration::from_secs(2)).await;

ns_api.create(&PostParams::default(), &ns).await?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolasnoble could you address this comment? Claude also pointed the same issue and suggested polling until the namespace is gone (404 on GET).

// ---------------------------------------------------------------------------

/// Pick two different schedulable nodes from the cluster.
async fn select_nodes(client: &Client) -> Result<(String, String)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when there are no nodes available? Does it wait?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants