Skip to content

Conversation

@nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Mar 26, 2025

The pod cleaner will continuously clean up succeeded pods. It's much more efficient and more configurable than the existing one. I evaluated jobs again to leverage pod ttl, but the additional run creation overheads make it impractical.

The failed pod handler is there to watch for failed pods, extract failure reasons, and then delete the pods. We'll have to adapt error extraction to the new runners so that's not enabled yet. Deletion is on.

There are tests for both.

Summary by CodeRabbit

  • New Features

    • Introduced enhanced pod management capabilities, including configurable options for monitoring, cleaning, and handling pods, as well as improved metrics collection.
    • Improved Kubernetes integration with refined filtering options for better operational control.
  • Chores

    • Streamlined startup and development commands.
    • Updated dependencies and applied configuration patches for enhanced reliability.
  • Tests

    • Added comprehensive integration tests to validate the new pod management functionalities.

@changeset-bot
Copy link

changeset-bot bot commented Mar 26, 2025

⚠️ No Changeset found

Latest commit: 0a4f034

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 26, 2025

Walkthrough

This pull request implements several changes across different parts of the application. In the supervisor package, script commands are updated and dependencies are added for metrics collection and pod management. New environment variables are defined to control metrics collection, pod cleaning, and failed pod handling. The Kubernetes client receives a new utility function for creating informers with additional filtering options. New service classes, including a pod cleaner and a failed pod handler, are introduced along with corresponding integration tests. A patch for the Kubernetes client is also applied to support an extra parameter.

Changes

File(s) Change Summary
apps/supervisor/package.json Updated dev and start scripts by removing --experimental-sqlite; added test:watch script; added dependency "prom-client": "^15.1.0".
apps/supervisor/src/clients/kubernetes.ts Added new generic function makeInformer inside createK8sApi with new imports (Informer, ListPromise, KubernetesObject) from @kubernetes/client-node.
apps/supervisor/src/env.ts Added new environment variables: METRICS_COLLECT_DEFAULTS, POD_CLEANER_ENABLED, POD_CLEANER_INTERVAL_MS, POD_CLEANER_BATCH_SIZE, FAILED_POD_HANDLER_ENABLED, and FAILED_POD_HANDLER_RECONNECT_INTERVAL_MS.
apps/supervisor/src/index.ts Integrated metrics collection by importing collectDefaultMetrics and register; introduced new podCleaner and failedPodHandler properties in ManagedSupervisor; updated start and stop methods to invoke start/stop on these services.
apps/supervisor/src/metrics.ts New file: Exports a register constant as a Prometheus Registry instance.
apps/supervisor/src/services/failedPodHandler.ts
apps/supervisor/src/services/failedPodHandler.test.ts
New file and corresponding integration tests for the FailedPodHandler class. The class monitors failed pods via an informer, processes them, and updates metrics accordingly.
apps/supervisor/src/services/podCleaner.ts
apps/supervisor/src/services/podCleaner.test.ts
New file and corresponding integration tests for the PodCleaner class. The class handles the deletion of completed pods using batch processing and heartbeat-based scheduling while tracking relevant metrics.
package.json Added a new patched dependency entry for "@kubernetes/[email protected]" pointing to patches/@[email protected].
patches/@[email protected] Updated patch for ListWatch and makeInformer to include a new fieldSelector parameter, allowing field-based filtering in addition to label selectors.

Sequence Diagram(s)

sequenceDiagram
  participant S as ManagedSupervisor
  participant PC as PodCleaner
  participant FPH as FailedPodHandler
  participant M as Metrics Collector
  
  S->>S: Check env.METRICS_COLLECT_DEFAULTS
  alt Metrics enabled
    S->>M: Initiate default metrics collection
  end
  S->>S: Check env.POD_CLEANER_ENABLED
  alt PodCleaner enabled
    S->>PC: Instantiate PodCleaner with config
  end
  S->>S: Check env.FAILED_POD_HANDLER_ENABLED
  alt FailedPodHandler enabled
    S->>FPH: Instantiate FailedPodHandler with config
  end
  S->>PC: Call podCleaner.start()
  S->>FPH: Call failedPodHandler.start()
Loading
sequenceDiagram
  participant FPH as FailedPodHandler
  participant I as Informer
  participant API as K8sApi
  
  FPH->>I: Start informer monitoring pods
  I->>FPH: Emit "pod added" event
  FPH->>FPH: onPodCompleted -> check pod status
  alt Pod Failed
    FPH->>API: Process pod deletion
    API-->>FPH: Confirm deletion & update metrics
  else Pod Succeeded
    FPH->>FPH: Process success logic (if applicable)
  end
Loading

Suggested reviewers

  • ericallam
  • matt-aitken

Poem

I'm just a rabbit hopping by,
Celebrating changes way up high.
New scripts and handlers now take flight,
Cleaning pods by day and testing by night.
Metrics and patches, oh what a sight!
With carrots and code, we keep things light.
Hop along and code on tight!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (17)
apps/supervisor/src/index.ts (2)

302-308: Consider adding error handling for service startup

While the implementation correctly starts the pod lifecycle handlers when enabled, it might be good to add try/catch blocks to ensure that failures in one service don't prevent others from starting.

 if (this.podCleaner) {
-  await this.podCleaner.start();
+  try {
+    await this.podCleaner.start();
+  } catch (error) {
+    this.logger.error("[ManagedWorker] Failed to start pod cleaner", { error });
+  }
 }

 if (this.failedPodHandler) {
-  await this.failedPodHandler.start();
+  try {
+    await this.failedPodHandler.start();
+  } catch (error) {
+    this.logger.error("[ManagedWorker] Failed to start failed pod handler", { error });
+  }
 }

330-336: Consider adding error handling for service shutdown

Similar to the start method, it would be good to add error handling for the stop method to ensure graceful shutdown even if one service fails.

 if (this.podCleaner) {
-  await this.podCleaner.stop();
+  try {
+    await this.podCleaner.stop();
+  } catch (error) {
+    this.logger.error("[ManagedWorker] Failed to stop pod cleaner", { error });
+  }
 }

 if (this.failedPodHandler) {
-  await this.failedPodHandler.stop();
+  try {
+    await this.failedPodHandler.stop();
+  } catch (error) {
+    this.logger.error("[ManagedWorker] Failed to stop failed pod handler", { error });
+  }
 }
apps/supervisor/src/services/podCleaner.test.ts (1)

273-472: Well-designed test helper functions

The helper functions encapsulate common test operations and make the tests more readable. They include robust timeout handling and clear error messages.

Consider extracting these helper functions to a shared test utilities file if they might be useful for other test files, particularly for the failed pod handler tests.

apps/supervisor/src/services/failedPodHandler.ts (5)

9-16: Ensure appropriate defaults and validations for user-supplied options.

The interface FailedPodHandlerOptions allows several optional properties. It may be beneficial to validate user inputs (e.g., verifying namespace is non-empty) to avoid potential runtime errors or silent misconfigurations.


37-66: Provide more clarity on reconnection intervals and logging.

In the constructor, you set this.reconnectIntervalMs = opts.reconnectIntervalMs ?? 1000; without direct mention to users that the agent will retry reconnections every second on failure. Consider adding additional logs or clarifications to highlight the reconnection logic if repeated failures occur.


150-185: Consider renaming onPodCompleted for clarity.

The current name onPodCompleted handles both Succeeded and Failed pods. To avoid confusion, you could rename this to onPodPhaseChange or a similar term that aligns with the multiple end states (Succeeded, Failed).


195-205: Validate presence of additional error details for failed pods.

Within onPodFailed, you call processFailedPod and then always delete the pod. If future requirements need analyzing the cause of failure, consider capturing pod.status.reason or containerStatuses[].state for advanced debugging or reporting before removal.


239-251: Handle repeated error events cautiously.

Restarting the informer in onError is valid, but if the error is persistent, the loop might become churn-heavy. Consider incremental backoff or maximum retry to prevent continuous restarts in persistent failure scenarios.

apps/supervisor/src/services/podCleaner.ts (4)

1-14: Avoid potential edge cases with minimal intervals or overly large batch sizes.

PodCleanerOptions offers intervalMs and batchSize but lacks explicit validations. If a user sets an extremely large batch size or minimal interval, it might cause stress on the Kubernetes cluster. A sanity check or documentation update might help.


16-57: Leverage consistent logging for clearer ops insights.

The constructor logs fewer details about the cleaner’s configuration compared to the FailedPodHandler. Consider adding logs reflecting key configuration parameters (e.g., batchSize, intervalMs) for enhanced visibility and debugging.


59-66: Add state checks to prevent double starts/stops.

Similar to FailedPodHandler, consider verifying whether PodCleaner is already running before calling start, and likewise, whether it is already stopped before calling stop, to avoid unintentional restarts or logs spamming.


111-117: Confirm metric coverage for additional error scenarios.

getMetrics returns counters and gauges which track cycles and timestamps. However, if certain cleanup fails mid-cycle, the existing counters might not fully reflect partial deletions. Consider adding a metric for partial successes versus full runs to improve observability.

apps/supervisor/src/services/failedPodHandler.test.ts (5)

38-137: Validate performance under multiple simultaneous pods.

The test verifies correct behavior for 2 failed pods. Consider a scenario with more pods failing concurrently. This can reveal potential load issues around informing, event handling, or resource contention in real K8s clusters.


178-223: Recommend verifying partial states for pods in deletion.

The test ensures pods being deleted are skipped. Great approach—this prevents double-handling or spurious metrics. Consider adding more logging or metrics in FailedPodHandler that explicitly track pods recognized as “in deletion.”


319-418: Consolidate pod creation and deletion utilities in a test helper.

createTestPods, waitForPodsDeletion, and podExists are valuable utilities. You might move them into a shared testing helper module if they’re used across multiple test files for better maintainability and reusability.


420-459: Increase test variety for different phases beyond Failed.

waitForPodsPhase checks a single phase at a time, which is good. For more comprehensive tests, you might stage transitions from "Pending" → "Running" → "Failed" in a single test to verify multi-phase transitions as they reflect typical real cluster scenarios.


461-478: Validate force-deletion or orphaned pods.

deleteAllPodsInNamespace is thorough, but consider testing forced deletions or resource finalizers. Certain contexts or finalizers might block deletion, requiring additional logic or extra knowledge from the user.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82251e2 and 0a4f034.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (11)
  • apps/supervisor/package.json (1 hunks)
  • apps/supervisor/src/clients/kubernetes.ts (1 hunks)
  • apps/supervisor/src/env.ts (1 hunks)
  • apps/supervisor/src/index.ts (5 hunks)
  • apps/supervisor/src/metrics.ts (1 hunks)
  • apps/supervisor/src/services/failedPodHandler.test.ts (1 hunks)
  • apps/supervisor/src/services/failedPodHandler.ts (1 hunks)
  • apps/supervisor/src/services/podCleaner.test.ts (1 hunks)
  • apps/supervisor/src/services/podCleaner.ts (1 hunks)
  • package.json (1 hunks)
  • patches/@[email protected] (1 hunks)
🧰 Additional context used
🧬 Code Definitions (4)
apps/supervisor/src/services/podCleaner.test.ts (3)
apps/supervisor/src/clients/kubernetes.ts (3)
  • k8s (52-52)
  • createK8sApi (9-29)
  • K8sApi (31-31)
apps/supervisor/src/metrics.ts (1)
  • register (3-3)
apps/supervisor/src/services/podCleaner.ts (1)
  • PodCleaner (16-118)
apps/supervisor/src/services/failedPodHandler.test.ts (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
  • FailedPodHandler (18-281)
apps/supervisor/src/services/podCleaner.ts (2)
apps/supervisor/src/clients/kubernetes.ts (2)
  • K8sApi (31-31)
  • createK8sApi (9-29)
apps/supervisor/src/metrics.ts (1)
  • register (3-3)
apps/supervisor/src/index.ts (2)
apps/supervisor/src/services/podCleaner.ts (1)
  • PodCleaner (16-118)
apps/supervisor/src/services/failedPodHandler.ts (1)
  • FailedPodHandler (18-281)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (24)
apps/supervisor/src/metrics.ts (1)

1-3: Good implementation of metrics registry.

The code correctly sets up a metrics registry using the prom-client library, which will be essential for the pod lifecycle handlers mentioned in the PR objectives.

package.json (1)

81-82: Appropriate patch for Kubernetes client.

The addition of the patched dependency for @kubernetes/client-node is necessary to support the fieldSelector parameter in the makeInformer function, which will enable more granular filtering for pod lifecycle handlers.

apps/supervisor/src/clients/kubernetes.ts (3)

2-4: Good approach with separate imports.

Breaking out the specific imports from @kubernetes/client-node improves code clarity and documentation of dependencies.


12-19: Well-implemented informer creation function.

The makeInformer function is well-typed and provides a clean abstraction over the Kubernetes client library. The inclusion of the fieldSelector parameter allows for more precise filtering of Kubernetes objects, which will be valuable for the pod cleaner and failed pod handler mentioned in the PR objectives.


25-25: Good API extension.

Adding makeInformer to the exported API object follows the established pattern in this file and makes the function easily accessible to consumers.

apps/supervisor/package.json (3)

11-11: Good addition of test watch script.

Adding a test:watch script with vitest is helpful for development workflow and supports the testing of the new pod lifecycle handlers.


19-19: Appropriate prometheus client dependency.

The addition of the prom-client dependency aligns with the new metrics functionality implemented in metrics.ts and supports the monitoring of pod lifecycle handlers.


9-10:

❓ Verification inconclusive

Removed experimental flag - verify compatibility.

The removal of the --experimental-sqlite flag from both scripts suggests that SQLite functionality is no longer experimental or needed. Please ensure this doesn't break existing functionality if SQLite is still being used.


🏁 Script executed:

#!/bin/bash
# Check if SQLite is still used elsewhere in the codebase
echo "Searching for SQLite references in the codebase..."
rg -i "sqlite" --type ts --type js apps/supervisor/

Length of output: 162


Below is a revised review comment based on the initial search results. However, because the output was ambiguous (i.e. no matches were returned, which might be due to the limited scope of the search), please run the additional verification below to be sure that no parts of the codebase rely on the experimental SQLite flag.

Action Required: Verify Compatibility of SQLite Integration

The --experimental-sqlite flag was removed from the npm scripts in apps/supervisor/package.json (lines 9–10). An initial search within apps/supervisor did not return any references to SQLite. However, since that search produced no output (and thus low-confidence inference), please verify if any file in the repository still references the flag or uses SQLite functionality. This will ensure that the removal does not break existing features.

  • File to review: apps/supervisor/package.json (lines 9–10)
  • Next step: Run an additional comprehensive check across the repository for occurrences of the --experimental-sqlite flag.

For your convenience, please run the following script from the repository root:

#!/bin/bash
echo "Searching for '--experimental-sqlite' flag occurrences in the entire repository..."
rg --ignore-case '--experimental-sqlite'

Once confirmed that the flag is no longer used anywhere, you can safely consider this change verified.

apps/supervisor/src/env.ts (1)

54-65: Well-structured environment variable additions

The new environment variables follow the established pattern in the codebase with appropriate types and sensible defaults. The grouping by functionality (metrics, pod cleaner, failed pod handler) makes the configuration options clear and maintainable.

apps/supervisor/src/index.ts (3)

21-29: Well-organized imports and metrics initialization

The imports are logically structured, and the conditional metrics collection setup is clean. This ensures metrics are only collected when explicitly enabled through environment variables.


40-42: Good addition of private properties for pod lifecycle handlers

The private property declarations for the pod cleaner and failed pod handler follow the existing pattern in the class.


51-64: Clean conditional instantiation of pod lifecycle handlers

The handlers are only instantiated when enabled via environment variables, and they're configured with appropriate parameters from the environment.

apps/supervisor/src/services/podCleaner.test.ts (5)

7-34: Well-structured test setup and teardown

The test setup creates a dedicated test namespace and cleans up resources after each test, which is good practice for integration tests. The registry clearing prevents metric conflicts between tests.


35-75: Comprehensive test for pod cleanup functionality

This test case effectively verifies the core functionality of the pod cleaner by creating a test pod, waiting for it to succeed, and then confirming it gets deleted.


77-114: Good verification of metrics tracking

The test properly verifies that metrics are updated when pods are deleted, which is important for observability.


116-194: Thorough testing of batch size configurations

Testing both small and large batch sizes ensures the pod cleaner handles different workloads correctly. The metrics verification confirms the expected number of deletion cycles based on batch size.


196-270: Important negative test cases

These tests verify that the pod cleaner doesn't delete pods it shouldn't (neither pods without the right label nor running pods), which is critical for preventing unintended deletions.

patches/@[email protected] (4)

9-16: Good addition of fieldSelector parameter to ListWatch constructor

The patch correctly adds the fieldSelector parameter to the constructor and stores it as a class property, following the same pattern as the existing labelSelector.


20-26: Properly handling fieldSelector in query parameters

The implementation correctly adds the fieldSelector to the query parameters when making watch requests, similar to how labelSelector is handled.


33-38: Updated type declaration for makeInformer

The type declaration is properly updated to include the new fieldSelector parameter.


47-52: Properly updated makeInformer implementation

The implementation is correctly updated to pass the fieldSelector parameter to the ListWatch constructor.

apps/supervisor/src/services/failedPodHandler.test.ts (3)

12-28: Ensure beforeAll logic cleans up thoroughly.

While you're creating the namespace if missing, the subsequent call to deleteAllPodsInNamespace should handle leftover pods. In cases where the user lacks permissions or pods are in a terminating state, it may cause flakes. Confirm that the cluster/CI environment properly handles these states.


139-176: Good check for label-based filtering.

Your test ensures that only pods labeled app=task-run get processed. The logic and metrics validation here appear sound.


225-316: Commendable coverage for dynamic failures.

Confirming that newly failing pods are handled after the handler starts is crucial. The test scenario is well-structured and covers typical asynchronous event flows.

@nicktrn nicktrn merged commit 4fe1d49 into main Mar 26, 2025
12 checks passed
@nicktrn nicktrn deleted the feat/re2-podcleaner branch March 26, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants