Skip to content

refactor(kubernetes): simplify cleanLogsCollector to skip if daemonset is not found#2915

Open
barnabasbusa wants to merge 2 commits intomainfrom
bbusa/fix-log-collector
Open

refactor(kubernetes): simplify cleanLogsCollector to skip if daemonset is not found#2915
barnabasbusa wants to merge 2 commits intomainfrom
bbusa/fix-log-collector

Conversation

@barnabasbusa
Copy link
Collaborator

@barnabasbusa barnabasbusa commented Feb 25, 2026

PR Summary: Fix kurtosis clean -a hanging on k8s clusters with tainted/unhealthy nodes

Problem

kurtosis clean -a hangs indefinitely on Kubernetes clusters where some nodes have taints (e.g. DiskPressure, smc). The fluentbit logs collector
Clean method creates remove-dir-pod cleanup pods targeted at each node, but nodes with taints won't schedule these pods. The waitForPodAvailability
function then blocks for 15 minutes per unschedulable pod, and this happens sequentially per node.

Changes (5 files, +46/-19)

  1. kubernetes_manager.go — waitForPodAvailability now:
    - Respects context cancellation (was ignoring ctx.Done())
    - Detects PodReasonUnschedulable and returns immediately instead of waiting 15 minutes
  2. fluentbit_logs_collector_daemonset.go — Clean method now:
    - Returns nil instead of error when zero pods found
    - Makes WaitForPodTermination best-effort (warn, don't fail)
    - Makes RemoveDirPathFromNode best-effort with 2-minute per-node timeout (skips tainted nodes)
    - Makes waitForAtLeastOneActivePodManagedByDaemonSet best-effort
  3. kubernetes_kurtosis_backend_enclave_functions.go — CleanLogsCollector and CleanLogsAggregator errors downgraded from fatal to best-effort
    warnings
  4. clean_logs_collector.go — Calls getLogsCollectorKubernetesResourcesForCluster directly, adds nil check for missing DaemonSet
  5. shared_helpers.go — Two fixes:
    - namespace.Namespace → namespace.Name (was always empty for k8s Namespace objects, causing cross-namespace service account lookups and "found 2"
    errors)
    - Zero pods case returns Stopped status with warning instead of error

Result

kurtosis clean -a completes in ~40 seconds even with tainted/unhealthy nodes, instead of hanging indefinitely.

…t is not found

refactor(kubernetes): treat logs collector as stopped if no pods are found for daemonset
@barnabasbusa barnabasbusa added this pull request to the merge queue Feb 27, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants