test: add multi-node k8s test script for cross-node gRPC transfers by nicolasnoble · Pull Request #164 · ai-dynamo/modelexpress

nicolasnoble · 2026-03-13T06:12:36Z

Summary

Adds Rust integration tests (workspace-tests/tests/k8s_multinode_tests.rs) using the kube crate for multi-node k8s testing, all #[ignore] by default
5 tests: cross-node gRPC transfer with default/64KB/4MB chunk sizes, transfer integrity via md5 content hash comparison, multi-replica independent cache verification
Minimal shell wrapper (test_multinode_k8s.sh) that builds the Docker image, distributes via microk8s images import, and runs cargo test
Adds kube (ws feature), k8s-openapi, and anyhow to workspace-tests dependencies

Test plan

All 5 tests verified on 3-node microk8s cluster
cargo clippy passes with no warnings
pre-commit run --all-files passes
CI passes (tests are #[ignore], won't run in CI)

Summary by CodeRabbit

Tests
- Added comprehensive multi-node Kubernetes integration test suite with automated test runner and orchestration capabilities.
Chores
- Added testing-related dependencies to workspace configuration.

coderabbitai · 2026-03-13T06:25:01Z

Walkthrough

The PR introduces a multi-node Kubernetes integration testing framework for ModelExpress. It adds a Bash script to orchestrate test execution, extends workspace dependencies with Kubernetes client libraries, and provides comprehensive Rust test cases for cross-node data transfer and integrity verification scenarios.

Changes

Cohort / File(s)	Summary
Test Orchestration `test_multinode_k8s.sh`	New Bash script to build the multinode Docker image, distribute to Kubernetes nodes if microk8s is available, and execute Rust integration tests with cargo test, supporting skip-build flag and passthrough of cargo arguments.
Workspace Dependencies `workspace-tests/Cargo.toml`	Added three new dependencies: kube (with ws feature), k8s-openapi, and anyhow for Kubernetes API interactions and error handling.
Integration Tests `workspace-tests/tests/k8s_multinode_tests.rs`	New comprehensive test suite implementing multi-node Kubernetes integration scenarios, including node selection, namespace management, server/client deployment orchestration, cross-node data transfer validation, and content integrity verification via md5 checksums.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit's ode to Kubernetes dreams,
Nodes dance together in test-driven schemes,
From script to code, with K8s in hand,
Multi-node magic across the land! 🚀✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a multi-node Kubernetes test script for cross-node gRPC transfers, which aligns with the three new files (test script, Cargo.toml updates, and comprehensive test suite).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can validate your CodeRabbit configuration file in your editor.

If your editor has YAML language server, you can enable auto-completion and validation by adding # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json at the top of your CodeRabbit configuration file.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

test_multinode_k8s.sh (1)
24-25: Enable pipefail for the image import pipeline.

A failing docker save can be masked here because only the last command in the pipeline controls the exit status. set -euo pipefail makes this fail closed instead of running the tests against a stale or partial image.
Suggested change
-set -e
+set -euo pipefail
Also applies to: 43-45
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test_multinode_k8s.sh` around lines 24 - 25, Replace the existing bare "set
-e" invocations in the script (the top-level shell options and the later block
around the image import pipeline) with a stricter shell option set to prevent
masked failures: use "set -euo pipefail" instead of "set -e". Update the
occurrences referenced (the initial script header where cd "$(dirname "$0")"
follows, and the later image import pipeline block currently using "set -e") so
that any failing command in a pipeline (e.g., docker save | docker import)
causes the script to exit and unset variables are treated as errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test_multinode_k8s.sh`:
- Around line 17-22: The script's CARGO_ARGS are being appended after the
literal "--" (so they become libtest/test-binary args), which breaks Cargo
options like "--test"; update the script (test_multinode_k8s.sh) to either (A)
pass CARGO_ARGS to Cargo before the "--" separator (move where CARGO_ARGS is
injected into the final command) so Cargo sees them as Cargo flags, or (B)
rename/document CARGO_ARGS as TEST_BINARY_ARGS and leave them after "--" while
adding a separate CARGO_ARGS slot that is inserted before "--"; locate the code
that constructs the test invocation (the place that appends "-- $CARGO_ARGS")
and modify it accordingly and update the example/help text to reflect the
correct flag placement.

In `@workspace-tests/Cargo.toml`:
- Line 25: Move the "ws" feature declaration for the kube dependency to the root
workspace manifest so member crates inherit it: add "ws" to the features array
for the kube dependency in the root Cargo.toml (so tests using AttachParams and
.exec() get the feature enabled) and update the workspace-tests Cargo.toml line
referencing kube to simply kube = { workspace = true } so it inherits the root
configuration.

In `@workspace-tests/tests/k8s_multinode_tests.rs`:
- Around line 42-65: select_nodes currently returns metadata.name and treats
that as the kubernetes.io/hostname, and it also ignores NoSchedule taints and
PodReady status; update select_nodes to instead pick nodes by their actual
scheduling identity and readiness: for each Node in node_list.items only include
nodes where the Ready condition is True, there is no NoSchedule taint in
spec.taints, and then prefer spec.node_name (or if you truly need the hostname
label, use metadata.labels.get("kubernetes.io/hostname")) when returning the two
identifiers; adjust the returned tuple and any callers that assume metadata.name
== hostname to use the chosen value (reference function select_nodes, variable
schedulable, and look at places currently using the hostname selector) so pods
are scheduled onto truly ready, untainted nodes.
- Around line 68-82: create_namespace currently uses fixed names and a
fire-and-forget delete (ns_api.delete + sleep) which causes collisions and
Terminating namespaces; change create_namespace to generate a unique namespace
name per test (e.g., append a UUID or random suffix to the provided name), and
replace the pre-create fire-and-forget delete with an awaited deletion sequence:
call ns_api.delete(&name, &DeleteParams::default()).await when a namespace
exists, then poll ns_api.get(&name).await (or handle 404) until the namespace is
fully removed (not in Terminating) before calling
ns_api.create(&PostParams::default(), &ns). Also remove reliance on
NamespaceGuard::drop() spawning un-awaited deletes — add an explicit async
cleanup method on NamespaceGuard (e.g., async fn delete_and_wait(&self, client:
&Client)) that performs ns_api.delete(...) .await and polls until deletion
completes, and invoke that in test teardown so deletions are awaited instead of
fire-and-forget.

---

Nitpick comments:
In `@test_multinode_k8s.sh`:
- Around line 24-25: Replace the existing bare "set -e" invocations in the
script (the top-level shell options and the later block around the image import
pipeline) with a stricter shell option set to prevent masked failures: use "set
-euo pipefail" instead of "set -e". Update the occurrences referenced (the
initial script header where cd "$(dirname "$0")" follows, and the later image
import pipeline block currently using "set -e") so that any failing command in a
pipeline (e.g., docker save | docker import) causes the script to exit and unset
variables are treated as errors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4db5ff5a-ce29-44db-95e4-ffaa1c9f9720

📥 Commits

Reviewing files that changed from the base of the PR and between 48b393e and ecfaecc.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

test_multinode_k8s.sh
workspace-tests/Cargo.toml
workspace-tests/tests/k8s_multinode_tests.rs

test_multinode_k8s.sh

workspace-tests/Cargo.toml

workspace-tests/tests/k8s_multinode_tests.rs

coderabbitai · 2026-03-13T06:25:04Z

workspace-tests/tests/k8s_multinode_tests.rs

+/// Create a namespace, returning a guard that deletes it on drop.
+async fn create_namespace(client: &Client, name: &str) -> Result<NamespaceGuard> {
+    let ns_api: Api<Namespace> = Api::all(client.clone());
+    let ns: Namespace = serde_json::from_value(serde_json::json!({
+        "apiVersion": "v1",
+        "kind": "Namespace",
+        "metadata": { "name": name }
+    }))?;
+
+    // Delete if leftover from a previous failed run
+    let _ = ns_api.delete(name, &DeleteParams::default()).await;
+    tokio::time::sleep(Duration::from_secs(2)).await;
+
+    ns_api.create(&PostParams::default(), &ns).await?;
+


⚠️ Potential issue | 🟠 Major

Static namespace names plus fire-and-forget cleanup will collide across runs.

These tests pass fixed namespace names into a helper that deletes any existing namespace with that name, sleeps two seconds, and recreates it. Because NamespaceGuard::drop() only spawns the delete and never waits for completion, reruns can hit namespaces stuck in Terminating, and concurrent runs can delete each other's resources. Use a unique namespace per test and perform awaited deletion/polling instead of fire-and-forget cleanup.

Also applies to: 94-103, 487-488, 510-511, 541-542, 574-575, 610-611

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@workspace-tests/tests/k8s_multinode_tests.rs` around lines 68 - 82, create_namespace currently uses fixed names and a fire-and-forget delete (ns_api.delete + sleep) which causes collisions and Terminating namespaces; change create_namespace to generate a unique namespace name per test (e.g., append a UUID or random suffix to the provided name), and replace the pre-create fire-and-forget delete with an awaited deletion sequence: call ns_api.delete(&name, &DeleteParams::default()).await when a namespace exists, then poll ns_api.get(&name).await (or handle 404) until the namespace is fully removed (not in Terminating) before calling ns_api.create(&PostParams::default(), &ns). Also remove reliance on NamespaceGuard::drop() spawning un-awaited deletes — add an explicit async cleanup method on NamespaceGuard (e.g., async fn delete_and_wait(&self, client: &Client)) that performs ns_api.delete(...) .await and polls until deletion completes, and invoke that in test teardown so deletions are awaited instead of fire-and-forget.

@nicolasnoble could you address this comment? Claude also pointed the same issue and suggested polling until the namespace is gone (404 on GET).

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

- workspace-tests/tests/k8s_multinode_tests.rs: Rust integration tests using the kube crate, all #[ignore] by default. 5 tests: - Cross-node gRPC transfer (default, 64KB, 4MB chunk sizes) - Transfer integrity with md5 content hash comparison - Multi-replica independent cache verification - test_multinode_k8s.sh: minimal wrapper that builds the Docker image, distributes it via microk8s images import, and runs cargo test - Adds kube ws feature and anyhow to workspace-tests dependencies Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>

- Rename CARGO_ARGS to TEST_ARGS in shell script (they're libtest args, not cargo args) - Use nodeName instead of nodeSelector with kubernetes.io/hostname label for pod placement (metadata.name is the node's API name, not necessarily its hostname label) - Use set -euo pipefail for stricter shell error handling Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>

nicolasnoble · 2026-03-17T18:51:22Z

Ready for review. @AndyDai-nv @zhengluo-nv @nv-hwoo @KavinKrishnan

codecov · 2026-03-17T18:54:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

nv-hwoo · 2026-03-18T00:23:02Z

workspace-tests/Cargo.toml

+kube = { workspace = true, features = ["ws"] }
+k8s-openapi = { workspace = true }
+anyhow = { workspace = true }


I think we should move these to under dev dependencies as they are only used for testing.

nv-hwoo · 2026-03-18T00:26:43Z

workspace-tests/tests/k8s_multinode_tests.rs

+        if [ $? -ne 0 ]; then echo "RESULT=FAILED"; exit 1; fi
+
+        echo "CHECKSUMS_START"
+        find /cache -type f -exec md5sum {{}} \; | awk '{{print $1}}' | sort -u


Is there a reason why client does not filter .lock and refs/ while the server checksum does? Should we add the same filter on the client side as well?

nv-hwoo · 2026-03-18T00:38:40Z

workspace-tests/tests/k8s_multinode_tests.rs

Thanks @nicolasnoble for adding these. These are great value adds since we need more k8s testings. It seems like this test is to be ran manually by devs locally (correct me if I'm wrong). Do you think we can also make these to be triggered by our CI and make it run on the company CI clusters? (unless that requires some support from Ops team)

nv-hwoo · 2026-03-18T00:42:35Z

workspace-tests/tests/k8s_multinode_tests.rs

+    let svc = deploy_server(&client, ns, "mx-server", &server_node).await?;
+    submit_client_job(&client, ns, "dl-default", &client_node, &svc, "").await?;
+
+    let success = timeout(TIMEOUT, wait_for_job(&client, ns, "dl-default")).await??;


Do we need the outer timeouts when wait_for_job() already has a timeout within the function?

nv-hwoo · 2026-03-18T00:48:39Z

workspace-tests/tests/k8s_multinode_tests.rs

+/// Create a namespace, returning a guard that deletes it on drop.
+async fn create_namespace(client: &Client, name: &str) -> Result<NamespaceGuard> {
+    let ns_api: Api<Namespace> = Api::all(client.clone());
+    let ns: Namespace = serde_json::from_value(serde_json::json!({
+        "apiVersion": "v1",
+        "kind": "Namespace",
+        "metadata": { "name": name }
+    }))?;
+
+    // Delete if leftover from a previous failed run
+    let _ = ns_api.delete(name, &DeleteParams::default()).await;
+    tokio::time::sleep(Duration::from_secs(2)).await;
+
+    ns_api.create(&PostParams::default(), &ns).await?;
+


@nicolasnoble could you address this comment? Claude also pointed the same issue and suggested polling until the namespace is gone (404 on GET).

nv-hwoo · 2026-03-18T00:50:49Z

workspace-tests/tests/k8s_multinode_tests.rs

+// ---------------------------------------------------------------------------
+
+/// Pick two different schedulable nodes from the cluster.
+async fn select_nodes(client: &Client) -> Result<(String, String)> {


What happens when there are no nodes available? Does it wait?

pull-request-size bot added the size/XL label Mar 13, 2026

nicolasnoble had a problem deploying to GITLAB March 13, 2026 06:12 — with GitHub Actions Failure

github-actions bot added the test label Mar 13, 2026

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

nicolasnoble force-pushed the nnoble/multinode-k8s-tests branch from ecfaecc to 84f5814 Compare March 13, 2026 11:04

nicolasnoble had a problem deploying to GITLAB March 13, 2026 11:04 — with GitHub Actions Failure

nicolasnoble requested review from AndyDai-nv, KavinKrishnan, nv-hwoo and zhengluo-nv March 17, 2026 18:18

nicolasnoble temporarily deployed to GITLAB March 17, 2026 18:46 — with GitHub Actions Inactive

nv-hwoo reviewed Mar 18, 2026

View reviewed changes

Conversation

nicolasnoble commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 13, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nicolasnoble commented Mar 17, 2026

Uh oh!

codecov bot commented Mar 17, 2026

Codecov Report

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nicolasnoble commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot Mar 13, 2026 •

edited

Loading