fix(Pod/Job/mechanic-agent-6fe85a2890c9): mechanic-agent pod OOMKilled - requires manual intervention#1619
fix(Pod/Job/mechanic-agent-6fe85a2890c9): mechanic-agent pod OOMKilled - requires manual intervention#1619k8s-mendabot[bot] wants to merge 1 commit intomainfrom
Conversation
|
Mechanic re-investigation (fingerprint: Re-validated the OOMKilled issue for mechanic-agent pod Current Status:
Confirmed Constraints:
Verification:
Root cause analysis in original PR remains accurate. This issue requires manual intervention to increase AGENT_MEM_LIMIT in the mechanic Helm release. |
|
Mechanic re-investigation (fingerprint: Status: Root cause confirmed - issue persists Updated Findings: The mechanic-agent pod (mechanic-agent-6fe85a2890c9-dmx8z) continues to be OOMKilled with exit code 137. This is a recursive finding - the mechanic agent is investigating itself (the finding fingerprint ac0c0496d323 corresponds to the mechanic-agent pod that was investigating another issue). Key Details:
Confirmed Root Cause: No GitOps Fix Possible: Recommendation: |
|
Mechanic re-investigation (fingerprint: Re-validated OOMKilled issue for pod: Evidence Gathered:
Root Cause Confirmed: Constraints:
Recursive Finding Pattern: |
|
Mechanic re-investigation (fingerprint: Updated findings from new pod instance:
This is a recurrent issue - the mechanic deployment continues to spawn agent pods with the same 512Mi memory limit, which is insufficient for the opencode workload. The mechanic deployment remains outside GitOps control, preventing a GitOps-based remediation. |
|
Mechanic re-investigation (fingerprint: Confirmed the existing analysis. Key findings from re-investigation:
Recommendation remains the same: This requires manual Helm upgrade of the mechanic deployment, not a GitOps change. Set |
Summary
The mechanic-agent pod (mechanic-agent-6fe85a2890c9-dmx8z) is being terminated with exit code 137 (OOMKilled). The pod has insufficient memory (512Mi limit) to complete its investigation task. The mechanic deployment is managed by Helm but is not tracked in the GitOps repository, preventing a GitOps-based fix.
Finding
Evidence
Pod State (kubectl describe):
Job State:
Deployment Information:
Investigation Context:
Root Cause
The mechanic-agent pod is configured with a 512Mi memory limit (set by the mechanic-watcher when `AGENT_MEM_LIMIT` is not specified). The opencode agent requires more than 512Mi of memory to run its investigation workflow, including:
When the agent attempts to allocate memory beyond the 512Mi limit, the kernel kills the container with SIGKILL (exit code 137).
Additional Issue - Recursive Failure:
The mechanic-agent is a remediation tool designed to investigate pod failures. When the mechanic-agent pod itself fails due to OOM, it creates a recursive situation where:
Fix
Increase the memory limit for mechanic-agent pods from 512Mi to a higher value (recommended: 1Gi or 2Gi, depending on workload).
Implementation Required:
The fix requires updating the Helm values for the mechanic deployment by setting `AGENT_MEM_LIMIT` environment variable on the `mechanic` deployment.
However, the mechanic deployment is NOT managed via GitOps. Extensive search of the GitOps repository (`/workspace/repo/kubernetes/`) found no mechanic-related manifests. The deployment appears to have been installed manually via Helm or through a bootstrapping process outside of GitOps.
Manual Fix Steps:
Confidence
Low confidence in GitOps-based fix - The mechanic deployment is not tracked in the GitOps repository, so this issue cannot be resolved through GitOps changes.
High confidence in root cause - The evidence clearly shows OOMKilled due to insufficient memory (512Mi limit).
Notes
Recursive Finding Risk: The mechanic-agent pod failure creates a recursive finding situation. Consider adding logic to prevent mechanic from creating remediation jobs for its own failed pods.
Resource Limits: The current defaults (512Mi memory, 500m CPU) appear to be insufficient for opencode agent workloads, which involve LLM inference and repository operations.
Monitoring: Consider adding memory usage monitoring for mechanic-agent pods to proactively detect OOM situations before they occur.
Investigation of mechanic Deployment: Determine where the mechanic deployment is managed and whether it should be migrated to GitOps for better observability and change tracking.
Helm Values: If the mechanic deployment is to remain outside GitOps, ensure the values are documented and stored in version control somewhere (even if not managed by Flux).
Opened automatically by mechanic