test: add multi-node k8s test script for cross-node gRPC transfers#164
test: add multi-node k8s test script for cross-node gRPC transfers#164nicolasnoble wants to merge 2 commits intomainfrom
Conversation
WalkthroughThe PR introduces a multi-node Kubernetes integration testing framework for ModelExpress. It adds a Bash script to orchestrate test execution, extends workspace dependencies with Kubernetes client libraries, and provides comprehensive Rust test cases for cross-node data transfer and integrity verification scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can validate your CodeRabbit configuration file in your editor.If your editor has YAML language server, you can enable auto-completion and validation by adding |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
test_multinode_k8s.sh (1)
24-25: Enablepipefailfor the image import pipeline.A failing
docker savecan be masked here because only the last command in the pipeline controls the exit status.set -euo pipefailmakes this fail closed instead of running the tests against a stale or partial image.Suggested change
-set -e +set -euo pipefailAlso applies to: 43-45
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test_multinode_k8s.sh` around lines 24 - 25, Replace the existing bare "set -e" invocations in the script (the top-level shell options and the later block around the image import pipeline) with a stricter shell option set to prevent masked failures: use "set -euo pipefail" instead of "set -e". Update the occurrences referenced (the initial script header where cd "$(dirname "$0")" follows, and the later image import pipeline block currently using "set -e") so that any failing command in a pipeline (e.g., docker save | docker import) causes the script to exit and unset variables are treated as errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test_multinode_k8s.sh`:
- Around line 17-22: The script's CARGO_ARGS are being appended after the
literal "--" (so they become libtest/test-binary args), which breaks Cargo
options like "--test"; update the script (test_multinode_k8s.sh) to either (A)
pass CARGO_ARGS to Cargo before the "--" separator (move where CARGO_ARGS is
injected into the final command) so Cargo sees them as Cargo flags, or (B)
rename/document CARGO_ARGS as TEST_BINARY_ARGS and leave them after "--" while
adding a separate CARGO_ARGS slot that is inserted before "--"; locate the code
that constructs the test invocation (the place that appends "-- $CARGO_ARGS")
and modify it accordingly and update the example/help text to reflect the
correct flag placement.
In `@workspace-tests/Cargo.toml`:
- Line 25: Move the "ws" feature declaration for the kube dependency to the root
workspace manifest so member crates inherit it: add "ws" to the features array
for the kube dependency in the root Cargo.toml (so tests using AttachParams and
.exec() get the feature enabled) and update the workspace-tests Cargo.toml line
referencing kube to simply kube = { workspace = true } so it inherits the root
configuration.
In `@workspace-tests/tests/k8s_multinode_tests.rs`:
- Around line 42-65: select_nodes currently returns metadata.name and treats
that as the kubernetes.io/hostname, and it also ignores NoSchedule taints and
PodReady status; update select_nodes to instead pick nodes by their actual
scheduling identity and readiness: for each Node in node_list.items only include
nodes where the Ready condition is True, there is no NoSchedule taint in
spec.taints, and then prefer spec.node_name (or if you truly need the hostname
label, use metadata.labels.get("kubernetes.io/hostname")) when returning the two
identifiers; adjust the returned tuple and any callers that assume metadata.name
== hostname to use the chosen value (reference function select_nodes, variable
schedulable, and look at places currently using the hostname selector) so pods
are scheduled onto truly ready, untainted nodes.
- Around line 68-82: create_namespace currently uses fixed names and a
fire-and-forget delete (ns_api.delete + sleep) which causes collisions and
Terminating namespaces; change create_namespace to generate a unique namespace
name per test (e.g., append a UUID or random suffix to the provided name), and
replace the pre-create fire-and-forget delete with an awaited deletion sequence:
call ns_api.delete(&name, &DeleteParams::default()).await when a namespace
exists, then poll ns_api.get(&name).await (or handle 404) until the namespace is
fully removed (not in Terminating) before calling
ns_api.create(&PostParams::default(), &ns). Also remove reliance on
NamespaceGuard::drop() spawning un-awaited deletes — add an explicit async
cleanup method on NamespaceGuard (e.g., async fn delete_and_wait(&self, client:
&Client)) that performs ns_api.delete(...) .await and polls until deletion
completes, and invoke that in test teardown so deletions are awaited instead of
fire-and-forget.
---
Nitpick comments:
In `@test_multinode_k8s.sh`:
- Around line 24-25: Replace the existing bare "set -e" invocations in the
script (the top-level shell options and the later block around the image import
pipeline) with a stricter shell option set to prevent masked failures: use "set
-euo pipefail" instead of "set -e". Update the occurrences referenced (the
initial script header where cd "$(dirname "$0")" follows, and the later image
import pipeline block currently using "set -e") so that any failing command in a
pipeline (e.g., docker save | docker import) causes the script to exit and unset
variables are treated as errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4db5ff5a-ce29-44db-95e4-ffaa1c9f9720
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
test_multinode_k8s.shworkspace-tests/Cargo.tomlworkspace-tests/tests/k8s_multinode_tests.rs
| /// Create a namespace, returning a guard that deletes it on drop. | ||
| async fn create_namespace(client: &Client, name: &str) -> Result<NamespaceGuard> { | ||
| let ns_api: Api<Namespace> = Api::all(client.clone()); | ||
| let ns: Namespace = serde_json::from_value(serde_json::json!({ | ||
| "apiVersion": "v1", | ||
| "kind": "Namespace", | ||
| "metadata": { "name": name } | ||
| }))?; | ||
|
|
||
| // Delete if leftover from a previous failed run | ||
| let _ = ns_api.delete(name, &DeleteParams::default()).await; | ||
| tokio::time::sleep(Duration::from_secs(2)).await; | ||
|
|
||
| ns_api.create(&PostParams::default(), &ns).await?; | ||
|
|
There was a problem hiding this comment.
Static namespace names plus fire-and-forget cleanup will collide across runs.
These tests pass fixed namespace names into a helper that deletes any existing namespace with that name, sleeps two seconds, and recreates it. Because NamespaceGuard::drop() only spawns the delete and never waits for completion, reruns can hit namespaces stuck in Terminating, and concurrent runs can delete each other's resources. Use a unique namespace per test and perform awaited deletion/polling instead of fire-and-forget cleanup.
Also applies to: 94-103, 487-488, 510-511, 541-542, 574-575, 610-611
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@workspace-tests/tests/k8s_multinode_tests.rs` around lines 68 - 82,
create_namespace currently uses fixed names and a fire-and-forget delete
(ns_api.delete + sleep) which causes collisions and Terminating namespaces;
change create_namespace to generate a unique namespace name per test (e.g.,
append a UUID or random suffix to the provided name), and replace the pre-create
fire-and-forget delete with an awaited deletion sequence: call
ns_api.delete(&name, &DeleteParams::default()).await when a namespace exists,
then poll ns_api.get(&name).await (or handle 404) until the namespace is fully
removed (not in Terminating) before calling
ns_api.create(&PostParams::default(), &ns). Also remove reliance on
NamespaceGuard::drop() spawning un-awaited deletes — add an explicit async
cleanup method on NamespaceGuard (e.g., async fn delete_and_wait(&self, client:
&Client)) that performs ns_api.delete(...) .await and polls until deletion
completes, and invoke that in test teardown so deletions are awaited instead of
fire-and-forget.
There was a problem hiding this comment.
@nicolasnoble could you address this comment? Claude also pointed the same issue and suggested polling until the namespace is gone (404 on GET).
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
- workspace-tests/tests/k8s_multinode_tests.rs: Rust integration tests using the kube crate, all #[ignore] by default. 5 tests: - Cross-node gRPC transfer (default, 64KB, 4MB chunk sizes) - Transfer integrity with md5 content hash comparison - Multi-replica independent cache verification - test_multinode_k8s.sh: minimal wrapper that builds the Docker image, distributes it via microk8s images import, and runs cargo test - Adds kube ws feature and anyhow to workspace-tests dependencies Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
ecfaecc to
84f5814
Compare
- Rename CARGO_ARGS to TEST_ARGS in shell script (they're libtest args, not cargo args) - Use nodeName instead of nodeSelector with kubernetes.io/hostname label for pod placement (metadata.name is the node's API name, not necessarily its hostname label) - Use set -euo pipefail for stricter shell error handling Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
|
Ready for review. @AndyDai-nv @zhengluo-nv @nv-hwoo @KavinKrishnan |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| kube = { workspace = true, features = ["ws"] } | ||
| k8s-openapi = { workspace = true } | ||
| anyhow = { workspace = true } |
There was a problem hiding this comment.
I think we should move these to under dev dependencies as they are only used for testing.
| if [ $? -ne 0 ]; then echo "RESULT=FAILED"; exit 1; fi | ||
|
|
||
| echo "CHECKSUMS_START" | ||
| find /cache -type f -exec md5sum {{}} \; | awk '{{print $1}}' | sort -u |
There was a problem hiding this comment.
Is there a reason why client does not filter .lock and refs/ while the server checksum does? Should we add the same filter on the client side as well?
There was a problem hiding this comment.
Thanks @nicolasnoble for adding these. These are great value adds since we need more k8s testings. It seems like this test is to be ran manually by devs locally (correct me if I'm wrong). Do you think we can also make these to be triggered by our CI and make it run on the company CI clusters? (unless that requires some support from Ops team)
| let svc = deploy_server(&client, ns, "mx-server", &server_node).await?; | ||
| submit_client_job(&client, ns, "dl-default", &client_node, &svc, "").await?; | ||
|
|
||
| let success = timeout(TIMEOUT, wait_for_job(&client, ns, "dl-default")).await??; |
There was a problem hiding this comment.
Do we need the outer timeouts when wait_for_job() already has a timeout within the function?
| /// Create a namespace, returning a guard that deletes it on drop. | ||
| async fn create_namespace(client: &Client, name: &str) -> Result<NamespaceGuard> { | ||
| let ns_api: Api<Namespace> = Api::all(client.clone()); | ||
| let ns: Namespace = serde_json::from_value(serde_json::json!({ | ||
| "apiVersion": "v1", | ||
| "kind": "Namespace", | ||
| "metadata": { "name": name } | ||
| }))?; | ||
|
|
||
| // Delete if leftover from a previous failed run | ||
| let _ = ns_api.delete(name, &DeleteParams::default()).await; | ||
| tokio::time::sleep(Duration::from_secs(2)).await; | ||
|
|
||
| ns_api.create(&PostParams::default(), &ns).await?; | ||
|
|
There was a problem hiding this comment.
@nicolasnoble could you address this comment? Claude also pointed the same issue and suggested polling until the namespace is gone (404 on GET).
| // --------------------------------------------------------------------------- | ||
|
|
||
| /// Pick two different schedulable nodes from the cluster. | ||
| async fn select_nodes(client: &Client) -> Result<(String, String)> { |
There was a problem hiding this comment.
What happens when there are no nodes available? Does it wait?
Summary
workspace-tests/tests/k8s_multinode_tests.rs) using thekubecrate for multi-node k8s testing, all#[ignore]by defaulttest_multinode_k8s.sh) that builds the Docker image, distributes viamicrok8s images import, and runscargo testkube(ws feature),k8s-openapi, andanyhowto workspace-tests dependenciesTest plan
cargo clippypasses with no warningspre-commit run --all-filespasses#[ignore], won't run in CI)Summary by CodeRabbit
Tests
Chores