-
-
Notifications
You must be signed in to change notification settings - Fork 888
Add v4 pod lifecycle handlers #1819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThis pull request implements several changes across different parts of the application. In the supervisor package, script commands are updated and dependencies are added for metrics collection and pod management. New environment variables are defined to control metrics collection, pod cleaning, and failed pod handling. The Kubernetes client receives a new utility function for creating informers with additional filtering options. New service classes, including a pod cleaner and a failed pod handler, are introduced along with corresponding integration tests. A patch for the Kubernetes client is also applied to support an extra parameter. Changes
Sequence Diagram(s)sequenceDiagram
participant S as ManagedSupervisor
participant PC as PodCleaner
participant FPH as FailedPodHandler
participant M as Metrics Collector
S->>S: Check env.METRICS_COLLECT_DEFAULTS
alt Metrics enabled
S->>M: Initiate default metrics collection
end
S->>S: Check env.POD_CLEANER_ENABLED
alt PodCleaner enabled
S->>PC: Instantiate PodCleaner with config
end
S->>S: Check env.FAILED_POD_HANDLER_ENABLED
alt FailedPodHandler enabled
S->>FPH: Instantiate FailedPodHandler with config
end
S->>PC: Call podCleaner.start()
S->>FPH: Call failedPodHandler.start()
sequenceDiagram
participant FPH as FailedPodHandler
participant I as Informer
participant API as K8sApi
FPH->>I: Start informer monitoring pods
I->>FPH: Emit "pod added" event
FPH->>FPH: onPodCompleted -> check pod status
alt Pod Failed
FPH->>API: Process pod deletion
API-->>FPH: Confirm deletion & update metrics
else Pod Succeeded
FPH->>FPH: Process success logic (if applicable)
end
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (17)
apps/supervisor/src/index.ts (2)
302-308: Consider adding error handling for service startupWhile the implementation correctly starts the pod lifecycle handlers when enabled, it might be good to add try/catch blocks to ensure that failures in one service don't prevent others from starting.
if (this.podCleaner) { - await this.podCleaner.start(); + try { + await this.podCleaner.start(); + } catch (error) { + this.logger.error("[ManagedWorker] Failed to start pod cleaner", { error }); + } } if (this.failedPodHandler) { - await this.failedPodHandler.start(); + try { + await this.failedPodHandler.start(); + } catch (error) { + this.logger.error("[ManagedWorker] Failed to start failed pod handler", { error }); + } }
330-336: Consider adding error handling for service shutdownSimilar to the start method, it would be good to add error handling for the stop method to ensure graceful shutdown even if one service fails.
if (this.podCleaner) { - await this.podCleaner.stop(); + try { + await this.podCleaner.stop(); + } catch (error) { + this.logger.error("[ManagedWorker] Failed to stop pod cleaner", { error }); + } } if (this.failedPodHandler) { - await this.failedPodHandler.stop(); + try { + await this.failedPodHandler.stop(); + } catch (error) { + this.logger.error("[ManagedWorker] Failed to stop failed pod handler", { error }); + } }apps/supervisor/src/services/podCleaner.test.ts (1)
273-472: Well-designed test helper functionsThe helper functions encapsulate common test operations and make the tests more readable. They include robust timeout handling and clear error messages.
Consider extracting these helper functions to a shared test utilities file if they might be useful for other test files, particularly for the failed pod handler tests.
apps/supervisor/src/services/failedPodHandler.ts (5)
9-16: Ensure appropriate defaults and validations for user-supplied options.The interface
FailedPodHandlerOptionsallows several optional properties. It may be beneficial to validate user inputs (e.g., verifyingnamespaceis non-empty) to avoid potential runtime errors or silent misconfigurations.
37-66: Provide more clarity on reconnection intervals and logging.In the constructor, you set
this.reconnectIntervalMs = opts.reconnectIntervalMs ?? 1000;without direct mention to users that the agent will retry reconnections every second on failure. Consider adding additional logs or clarifications to highlight the reconnection logic if repeated failures occur.
150-185: Consider renamingonPodCompletedfor clarity.The current name
onPodCompletedhandles both Succeeded and Failed pods. To avoid confusion, you could rename this toonPodPhaseChangeor a similar term that aligns with the multiple end states (Succeeded, Failed).
195-205: Validate presence of additional error details for failed pods.Within
onPodFailed, you callprocessFailedPodand then always delete the pod. If future requirements need analyzing the cause of failure, consider capturingpod.status.reasonorcontainerStatuses[].statefor advanced debugging or reporting before removal.
239-251: Handle repeated error events cautiously.Restarting the informer in
onErroris valid, but if the error is persistent, the loop might become churn-heavy. Consider incremental backoff or maximum retry to prevent continuous restarts in persistent failure scenarios.apps/supervisor/src/services/podCleaner.ts (4)
1-14: Avoid potential edge cases with minimal intervals or overly large batch sizes.
PodCleanerOptionsoffersintervalMsandbatchSizebut lacks explicit validations. If a user sets an extremely large batch size or minimal interval, it might cause stress on the Kubernetes cluster. A sanity check or documentation update might help.
16-57: Leverage consistent logging for clearer ops insights.The constructor logs fewer details about the cleaner’s configuration compared to the
FailedPodHandler. Consider adding logs reflecting key configuration parameters (e.g.,batchSize,intervalMs) for enhanced visibility and debugging.
59-66: Add state checks to prevent double starts/stops.Similar to
FailedPodHandler, consider verifying whetherPodCleaneris already running before callingstart, and likewise, whether it is already stopped before callingstop, to avoid unintentional restarts or logs spamming.
111-117: Confirm metric coverage for additional error scenarios.
getMetricsreturns counters and gauges which track cycles and timestamps. However, if certain cleanup fails mid-cycle, the existing counters might not fully reflect partial deletions. Consider adding a metric for partial successes versus full runs to improve observability.apps/supervisor/src/services/failedPodHandler.test.ts (5)
38-137: Validate performance under multiple simultaneous pods.The test verifies correct behavior for 2 failed pods. Consider a scenario with more pods failing concurrently. This can reveal potential load issues around informing, event handling, or resource contention in real K8s clusters.
178-223: Recommend verifying partial states for pods in deletion.The test ensures pods being deleted are skipped. Great approach—this prevents double-handling or spurious metrics. Consider adding more logging or metrics in
FailedPodHandlerthat explicitly track pods recognized as “in deletion.”
319-418: Consolidate pod creation and deletion utilities in a test helper.
createTestPods,waitForPodsDeletion, andpodExistsare valuable utilities. You might move them into a shared testing helper module if they’re used across multiple test files for better maintainability and reusability.
420-459: Increase test variety for different phases beyond Failed.
waitForPodsPhasechecks a single phase at a time, which is good. For more comprehensive tests, you might stage transitions from "Pending" → "Running" → "Failed" in a single test to verify multi-phase transitions as they reflect typical real cluster scenarios.
461-478: Validate force-deletion or orphaned pods.
deleteAllPodsInNamespaceis thorough, but consider testing forced deletions or resource finalizers. Certain contexts or finalizers might block deletion, requiring additional logic or extra knowledge from the user.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (11)
apps/supervisor/package.json(1 hunks)apps/supervisor/src/clients/kubernetes.ts(1 hunks)apps/supervisor/src/env.ts(1 hunks)apps/supervisor/src/index.ts(5 hunks)apps/supervisor/src/metrics.ts(1 hunks)apps/supervisor/src/services/failedPodHandler.test.ts(1 hunks)apps/supervisor/src/services/failedPodHandler.ts(1 hunks)apps/supervisor/src/services/podCleaner.test.ts(1 hunks)apps/supervisor/src/services/podCleaner.ts(1 hunks)package.json(1 hunks)patches/@[email protected](1 hunks)
🧰 Additional context used
🧬 Code Definitions (4)
apps/supervisor/src/services/podCleaner.test.ts (3)
apps/supervisor/src/clients/kubernetes.ts (3)
k8s(52-52)createK8sApi(9-29)K8sApi(31-31)apps/supervisor/src/metrics.ts (1)
register(3-3)apps/supervisor/src/services/podCleaner.ts (1)
PodCleaner(16-118)
apps/supervisor/src/services/failedPodHandler.test.ts (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
FailedPodHandler(18-281)
apps/supervisor/src/services/podCleaner.ts (2)
apps/supervisor/src/clients/kubernetes.ts (2)
K8sApi(31-31)createK8sApi(9-29)apps/supervisor/src/metrics.ts (1)
register(3-3)
apps/supervisor/src/index.ts (2)
apps/supervisor/src/services/podCleaner.ts (1)
PodCleaner(16-118)apps/supervisor/src/services/failedPodHandler.ts (1)
FailedPodHandler(18-281)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (24)
apps/supervisor/src/metrics.ts (1)
1-3: Good implementation of metrics registry.The code correctly sets up a metrics registry using the
prom-clientlibrary, which will be essential for the pod lifecycle handlers mentioned in the PR objectives.package.json (1)
81-82: Appropriate patch for Kubernetes client.The addition of the patched dependency for
@kubernetes/client-nodeis necessary to support thefieldSelectorparameter in themakeInformerfunction, which will enable more granular filtering for pod lifecycle handlers.apps/supervisor/src/clients/kubernetes.ts (3)
2-4: Good approach with separate imports.Breaking out the specific imports from
@kubernetes/client-nodeimproves code clarity and documentation of dependencies.
12-19: Well-implemented informer creation function.The
makeInformerfunction is well-typed and provides a clean abstraction over the Kubernetes client library. The inclusion of thefieldSelectorparameter allows for more precise filtering of Kubernetes objects, which will be valuable for the pod cleaner and failed pod handler mentioned in the PR objectives.
25-25: Good API extension.Adding
makeInformerto the exported API object follows the established pattern in this file and makes the function easily accessible to consumers.apps/supervisor/package.json (3)
11-11: Good addition of test watch script.Adding a test:watch script with vitest is helpful for development workflow and supports the testing of the new pod lifecycle handlers.
19-19: Appropriate prometheus client dependency.The addition of the
prom-clientdependency aligns with the new metrics functionality implemented inmetrics.tsand supports the monitoring of pod lifecycle handlers.
9-10:❓ Verification inconclusive
Removed experimental flag - verify compatibility.
The removal of the
--experimental-sqliteflag from both scripts suggests that SQLite functionality is no longer experimental or needed. Please ensure this doesn't break existing functionality if SQLite is still being used.
🏁 Script executed:
#!/bin/bash # Check if SQLite is still used elsewhere in the codebase echo "Searching for SQLite references in the codebase..." rg -i "sqlite" --type ts --type js apps/supervisor/Length of output: 162
Below is a revised review comment based on the initial search results. However, because the output was ambiguous (i.e. no matches were returned, which might be due to the limited scope of the search), please run the additional verification below to be sure that no parts of the codebase rely on the experimental SQLite flag.
Action Required: Verify Compatibility of SQLite Integration
The
--experimental-sqliteflag was removed from the npm scripts inapps/supervisor/package.json(lines 9–10). An initial search withinapps/supervisordid not return any references to SQLite. However, since that search produced no output (and thus low-confidence inference), please verify if any file in the repository still references the flag or uses SQLite functionality. This will ensure that the removal does not break existing features.
- File to review:
apps/supervisor/package.json(lines 9–10)- Next step: Run an additional comprehensive check across the repository for occurrences of the
--experimental-sqliteflag.For your convenience, please run the following script from the repository root:
#!/bin/bash echo "Searching for '--experimental-sqlite' flag occurrences in the entire repository..." rg --ignore-case '--experimental-sqlite'Once confirmed that the flag is no longer used anywhere, you can safely consider this change verified.
apps/supervisor/src/env.ts (1)
54-65: Well-structured environment variable additionsThe new environment variables follow the established pattern in the codebase with appropriate types and sensible defaults. The grouping by functionality (metrics, pod cleaner, failed pod handler) makes the configuration options clear and maintainable.
apps/supervisor/src/index.ts (3)
21-29: Well-organized imports and metrics initializationThe imports are logically structured, and the conditional metrics collection setup is clean. This ensures metrics are only collected when explicitly enabled through environment variables.
40-42: Good addition of private properties for pod lifecycle handlersThe private property declarations for the pod cleaner and failed pod handler follow the existing pattern in the class.
51-64: Clean conditional instantiation of pod lifecycle handlersThe handlers are only instantiated when enabled via environment variables, and they're configured with appropriate parameters from the environment.
apps/supervisor/src/services/podCleaner.test.ts (5)
7-34: Well-structured test setup and teardownThe test setup creates a dedicated test namespace and cleans up resources after each test, which is good practice for integration tests. The registry clearing prevents metric conflicts between tests.
35-75: Comprehensive test for pod cleanup functionalityThis test case effectively verifies the core functionality of the pod cleaner by creating a test pod, waiting for it to succeed, and then confirming it gets deleted.
77-114: Good verification of metrics trackingThe test properly verifies that metrics are updated when pods are deleted, which is important for observability.
116-194: Thorough testing of batch size configurationsTesting both small and large batch sizes ensures the pod cleaner handles different workloads correctly. The metrics verification confirms the expected number of deletion cycles based on batch size.
196-270: Important negative test casesThese tests verify that the pod cleaner doesn't delete pods it shouldn't (neither pods without the right label nor running pods), which is critical for preventing unintended deletions.
patches/@[email protected] (4)
9-16: Good addition of fieldSelector parameter to ListWatch constructorThe patch correctly adds the fieldSelector parameter to the constructor and stores it as a class property, following the same pattern as the existing labelSelector.
20-26: Properly handling fieldSelector in query parametersThe implementation correctly adds the fieldSelector to the query parameters when making watch requests, similar to how labelSelector is handled.
33-38: Updated type declaration for makeInformerThe type declaration is properly updated to include the new fieldSelector parameter.
47-52: Properly updated makeInformer implementationThe implementation is correctly updated to pass the fieldSelector parameter to the ListWatch constructor.
apps/supervisor/src/services/failedPodHandler.test.ts (3)
12-28: EnsurebeforeAlllogic cleans up thoroughly.While you're creating the namespace if missing, the subsequent call to
deleteAllPodsInNamespaceshould handle leftover pods. In cases where the user lacks permissions or pods are in a terminating state, it may cause flakes. Confirm that the cluster/CI environment properly handles these states.
139-176: Good check for label-based filtering.Your test ensures that only pods labeled
app=task-runget processed. The logic and metrics validation here appear sound.
225-316: Commendable coverage for dynamic failures.Confirming that newly failing pods are handled after the handler starts is crucial. The test scenario is well-structured and covers typical asynchronous event flows.
The pod cleaner will continuously clean up succeeded pods. It's much more efficient and more configurable than the existing one. I evaluated jobs again to leverage pod ttl, but the additional run creation overheads make it impractical.
The failed pod handler is there to watch for failed pods, extract failure reasons, and then delete the pods. We'll have to adapt error extraction to the new runners so that's not enabled yet. Deletion is on.
There are tests for both.
Summary by CodeRabbit
New Features
Chores
Tests