Skip to content

fix(Pod/Job/mechanic-agent-6fe85a2890c9): mechanic-agent pod OOMKilled - requires manual intervention#1619

Open
k8s-mendabot[bot] wants to merge 1 commit intomainfrom
fix/mechanic-ac0c0496d323
Open

fix(Pod/Job/mechanic-agent-6fe85a2890c9): mechanic-agent pod OOMKilled - requires manual intervention#1619
k8s-mendabot[bot] wants to merge 1 commit intomainfrom
fix/mechanic-ac0c0496d323

Conversation

@k8s-mendabot
Copy link
Copy Markdown

@k8s-mendabot k8s-mendabot Bot commented Apr 20, 2026

Summary

The mechanic-agent pod (mechanic-agent-6fe85a2890c9-dmx8z) is being terminated with exit code 137 (OOMKilled). The pod has insufficient memory (512Mi limit) to complete its investigation task. The mechanic deployment is managed by Helm but is not tracked in the GitOps repository, preventing a GitOps-based fix.

Finding

  • Kind: Pod
  • Resource: mechanic-agent-6fe85a2890c9-dmx8z
  • Namespace: default
  • Parent: Job/mechanic-agent-6fe85a2890c9
  • Fingerprint: `ac0c0496d323`

Evidence

Pod State (kubectl describe):

  • Container `mechanic-agent` state: `Terminated`
  • Reason: `OOMKilled`
  • Exit Code: `137`
  • Memory Limit: `512Mi`
  • Memory Request: `128Mi`
  • CPU Limit: `500m`
  • CPU Request: `100m`

Job State:

  • Job `mechanic-agent-6fe85a2890c9` has reached backoff limit (1)
  • 2 pods have failed, 0 succeeded
  • Active Deadline Seconds: 900s

Deployment Information:

  • Deployment `mechanic` is managed by Helm (chart: mechanic-0.4.4)
  • Labels: `app.kubernetes.io/managed-by=Helm`
  • Image: `ghcr.io/lenaxia/mechanic-watcher:v0.4.4`
  • Environment variables configured:
    • `AGENT_MEM_REQUEST`: `128Mi`
    • `AGENT_MEM_LIMIT`: (empty - defaults to 512Mi)
    • `AGENT_CPU_REQUEST`: `100m`
    • `AGENT_CPU_LIMIT`: (empty - defaults to 500m)

Investigation Context:

  • The mechanic-agent pod was created to investigate another finding (Pod/cloudflare-ddns-29611860-64mzb in namespace networking)
  • The pod's init containers completed successfully, but the main agent container was OOMKilled after running for ~51 seconds (19:05:48 - 19:06:39)

Root Cause

The mechanic-agent pod is configured with a 512Mi memory limit (set by the mechanic-watcher when `AGENT_MEM_LIMIT` is not specified). The opencode agent requires more than 512Mi of memory to run its investigation workflow, including:

  • Cloning the GitOps repository
  • Running LLM inference via the configured provider
  • Processing large Kubernetes resources

When the agent attempts to allocate memory beyond the 512Mi limit, the kernel kills the container with SIGKILL (exit code 137).

Additional Issue - Recursive Failure:
The mechanic-agent is a remediation tool designed to investigate pod failures. When the mechanic-agent pod itself fails due to OOM, it creates a recursive situation where:

  1. A finding triggers a mechanic-agent investigation
  2. The mechanic-agent pod runs out of memory and is killed
  3. The OOMKilled mechanic-agent pod itself becomes a new finding
  4. This creates an infinite loop of failed remediation attempts

Fix

Increase the memory limit for mechanic-agent pods from 512Mi to a higher value (recommended: 1Gi or 2Gi, depending on workload).

Implementation Required:
The fix requires updating the Helm values for the mechanic deployment by setting `AGENT_MEM_LIMIT` environment variable on the `mechanic` deployment.

However, the mechanic deployment is NOT managed via GitOps. Extensive search of the GitOps repository (`/workspace/repo/kubernetes/`) found no mechanic-related manifests. The deployment appears to have been installed manually via Helm or through a bootstrapping process outside of GitOps.

Manual Fix Steps:

  1. Identify where the mechanic deployment was installed (check Helm history or installation records)
  2. Update the values file to set `AGENT_MEM_LIMIT: 1Gi` (or higher)
  3. Upgrade the Helm release: `helm upgrade mechanic -f `

Confidence

Low confidence in GitOps-based fix - The mechanic deployment is not tracked in the GitOps repository, so this issue cannot be resolved through GitOps changes.

High confidence in root cause - The evidence clearly shows OOMKilled due to insufficient memory (512Mi limit).

Notes

  1. Recursive Finding Risk: The mechanic-agent pod failure creates a recursive finding situation. Consider adding logic to prevent mechanic from creating remediation jobs for its own failed pods.

  2. Resource Limits: The current defaults (512Mi memory, 500m CPU) appear to be insufficient for opencode agent workloads, which involve LLM inference and repository operations.

  3. Monitoring: Consider adding memory usage monitoring for mechanic-agent pods to proactively detect OOM situations before they occur.

  4. Investigation of mechanic Deployment: Determine where the mechanic deployment is managed and whether it should be migrated to GitOps for better observability and change tracking.

  5. Helm Values: If the mechanic deployment is to remain outside GitOps, ensure the values are documented and stored in version control somewhere (even if not managed by Flux).


Opened automatically by mechanic

@k8s-mendabot k8s-mendabot Bot added the needs-human-review Requires human review before merging label Apr 20, 2026
@k8s-mendabot
Copy link
Copy Markdown
Author

k8s-mendabot Bot commented Apr 22, 2026

Mechanic re-investigation (fingerprint: ac0c0496d323)

Re-validated the OOMKilled issue for mechanic-agent pod mechanic-agent-6fe85a2890c9-dmx8z. Findings remain consistent:

Current Status:

  • Job mechanic-agent-6fe85a2890c9 has 2 failed pods (both OOMKilled with exit code 137)
  • All containers configured with 512Mi memory limit, 128Mi request
  • Job reached backoff limit on 2026-04-20

Confirmed Constraints:

  • Mechanic deployment is NOT managed via GitOps (no manifests in /workspace/repo/kubernetes/)
  • Deployment is Helm-managed (chart: mechanic-0.4.4)
  • Fix requires manual Helm values update, not a GitOps change

Verification:

  • Both failed pods show identical OOMKilled state
  • Init containers completed successfully before OOM
  • Main container ran ~51 seconds before termination

Root cause analysis in original PR remains accurate. This issue requires manual intervention to increase AGENT_MEM_LIMIT in the mechanic Helm release.

@k8s-mendabot
Copy link
Copy Markdown
Author

k8s-mendabot Bot commented Apr 23, 2026

Mechanic re-investigation (fingerprint: ac0c0496d323)

Status: Root cause confirmed - issue persists

Updated Findings:

The mechanic-agent pod (mechanic-agent-6fe85a2890c9-dmx8z) continues to be OOMKilled with exit code 137. This is a recursive finding - the mechanic agent is investigating itself (the finding fingerprint ac0c0496d323 corresponds to the mechanic-agent pod that was investigating another issue).

Key Details:

  • Pod ran for ~51 seconds (19:05:48 - 19:06:39) before OOMKilled
  • Original target of investigation: Pod/cloudflare-ddns-29611860-64mzb in namespace networking
  • Current severity for this finding: high (system is stuck in a remediation loop)

Confirmed Root Cause:
The mechanic-agent container has a 512Mi memory limit which is insufficient for opencode agent workloads. When the agent exceeds this limit, it is killed by the kernel with SIGKILL (exit code 137).

No GitOps Fix Possible:
As noted in the original PR, the mechanic deployment is not managed in the GitOps repository. This issue requires manual Helm value updates to increase AGENT_MEM_LIMIT (recommended: 1Gi or 2Gi).

Recommendation:
This recursive failure indicates that the mechanic-watcher should have logic to prevent creating remediation jobs for its own failed pods. The current implementation creates an infinite loop where failed mechanic pods generate new findings, which trigger new mechanic pods, which also fail due to OOM.

@k8s-mendabot
Copy link
Copy Markdown
Author

k8s-mendabot Bot commented Apr 24, 2026

Mechanic re-investigation (fingerprint: ac0c0496d323)

Re-validated OOMKilled issue for pod: mechanic-agent-6fe85a2890c9-6pspx

Evidence Gathered:

  • Container state: Terminated - Reason: OOMKilled - Exit Code: 137
  • Container ran from 19:04:39 to 19:05:33 (~54 seconds) before termination
  • Memory limits: 512Mi limit, 128Mi request (insufficient for opencode agent workload)
  • Init containers (git-token-clone, dry-run-gate) completed successfully
  • Job mechanic-agent-6fe85a2890c9 has reached backoff limit with 2 failed pods
  • Original investigation target: Pod/cloudflare-ddns-29611860-64mzb in namespace networking

Root Cause Confirmed:
The mechanic-agent container exceeds the 512Mi memory limit when running opencode agent workloads (LLM inference, repository operations, resource processing). The kernel kills the container with SIGKILL (exit code 137) when memory pressure occurs.

Constraints:

  • Mechanic deployment is NOT managed via GitOps (no manifests in /workspace/repo/kubernetes/)
  • Deployment is Helm-managed (chart: mechanic-0.4.4, release: mechanic)
  • AGENT_MEM_LIMIT and AGENT_CPU_LIMIT are empty in deployment spec, defaulting to 512Mi and 500m
  • Fix requires manual Helm values update (recommended: AGENT_MEM_LIMIT: 1Gi or higher)

Recursive Finding Pattern:
This is a self-referential finding - mechanic is investigating its own failed pod, creating a remediation loop. Previous re-investigations have confirmed the same issue with pods -dmx8z and -6pspx from the same Job.

@k8s-mendabot
Copy link
Copy Markdown
Author

k8s-mendabot Bot commented Apr 26, 2026

Mechanic re-investigation (fingerprint: ac0c0496d323)

Updated findings from new pod instance:

  • Pod: mechanic-agent-6fe85a2890c9-6pspx (previous was dmx8z)
  • Status: OOMKilled, Exit Code 137
  • Memory Limit: 512Mi (insufficient)
  • Runtime: ~54 seconds before termination (19:04:39 - 19:05:33)
  • Same root cause confirmed: The opencode agent requires more than 512Mi memory to complete its investigation workflow

This is a recurrent issue - the mechanic deployment continues to spawn agent pods with the same 512Mi memory limit, which is insufficient for the opencode workload. The mechanic deployment remains outside GitOps control, preventing a GitOps-based remediation.

@k8s-mendabot
Copy link
Copy Markdown
Author

k8s-mendabot Bot commented Apr 27, 2026

Mechanic re-investigation (fingerprint: ac0c0496d323)

Confirmed the existing analysis. Key findings from re-investigation:

  1. Multiple OOM events: The Job mechanic-agent-6fe85a2890c9 shows 2 failed pods, both with OOMKilled status
  2. Current agent limit: The mechanic deployment environment variables show AGENT_MEM_LIMIT is empty, causing agents to default to 512Mi
  3. Recursive failure confirmed: The current remediation job mechanic-ac0c0496d323 is investigating a previous mechanic-agent failure, creating a recursive loop
  4. GitOps verification: Confirmed that mechanic is NOT managed in GitOps repository - no mechanic-related manifests exist in /workspace/repo/kubernetes/

Recommendation remains the same: This requires manual Helm upgrade of the mechanic deployment, not a GitOps change. Set AGENT_MEM_LIMIT: 1Gi or 2Gi in the Helm values and run helm upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review Requires human review before merging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants