Skip to content

[Internal]: Add critical first steps to ECK troubleshooting workflow #3928

@damianpfister

Description

@damianpfister

Description

Add a "First Steps" section at the beginning of the troubleshooting workflow to prevent common misdiagnoses and reduce investigation time.

What: We are adding a new section that emphasizes three critical checks users should perform before detailed troubleshooting:

  1. Collect eck-diagnostics immediately (events expire after ~1 hour)
  2. Check Kubernetes security policies and permissions (most common blocker)
  3. Verify pod status before investigating application errors (causality awareness)

Why: Many ECK deployment issues are caused by Kubernetes admission layer blocks (security policies, quotas, admission webhooks) rather than application configuration. Without checking the Kubernetes layer first, users spend days investigating symptoms (operator errors, authentication failures) instead of the root cause (pods never created).

Details users need to know:

  • eck-diagnostics captures critical namespace events that reveal pod creation failures
  • UP-TO-DATE: 0 metric indicates Kubernetes is blocking pod creation (not app failure)
  • Operator errors (401, 503, connection refused) often occur because pods don't exist
  • Security policy violations appear in events.json, not pod logs
  • Events expire quickly - collect diagnostics early

Proposed Content

Section Title: First Steps: Critical Checks Before Detailed Investigation

Location: Add as first major section after page introduction, before existing troubleshooting steps

Content:

## First Steps: Critical Checks Before Detailed Investigation

Perform these checks first to catch common issues and prevent unnecessary investigation:

### Step 1: Collect eck-diagnostics

Collect diagnostics immediately for any ECK deployment issue. Events expire after ~1 hour in Kubernetes.

```bash
# Download from https://github.com/elastic/eck-diagnostics/releases/latest
./eck-diagnostics -o <operator-namespace> -r <resource-namespace>

# Check for pod creation failures
unzip -p eck-diagnostics-*.zip <namespace>/events.json | \
  jq '.items[] | select(.reason=="FailedCreate")'

When to collect:

  • Deployments show READY: 0/1 or UP-TO-DATE: 0
  • Reports of "no pods deployed"
  • Any new ECK deployment issue

Step 2: Check Kubernetes Security Policies

Most "no pods created" issues stem from Kubernetes security policies blocking admission.

# Check namespace Pod Security labels
kubectl get namespace <namespace> -o yaml | grep pod-security

# Check for FailedCreate events
kubectl get events -n <namespace> | grep FailedCreate

# Check deployment status
kubectl get deployment -n <namespace>

Common patterns:

Symptom Likely Cause Action
UP-TO-DATE: 0 Kubernetes blocking pod creation Check events for FailedCreate
"violates PodSecurity" in events Security policy violation See kubernetes troubleshooting page
"exceeded quota" in events Resource quota limit Run kubectl describe quota

Step 3: Verify Pod Status First

Operator errors are often symptoms of pods not existing.

kubectl get pods -n <namespace>

Decision point:

  • No pods (UP-TO-DATE: 0)? → Kubernetes-layer issue (check events, security policies)
  • Pods exist but failing? → Application-layer issue (check pod logs)

Important: Don't investigate operator errors (401, 503) before verifying pods exist.


---

## Rationale

**Problem:** Users often investigate application-layer errors (authentication, connectivity) for days without first checking if pods were ever created. Kubernetes security policies silently block pod creation at the admission layer.

**Impact:** This addition provides a clear entry point that catches Kubernetes-layer issues immediately, reducing multi-day investigations to hours.

**Placement:** At the top of troubleshooting workflow ensures all users see these critical checks first.


### Resources

**Target Page:**  
https://www.elastic.co/docs/troubleshoot/deployments/cloud-on-k8s/troubleshooting-methods


### Which documentation set does this change impact?

Elastic On-Prem only

### Feature differences

N/A

### What release is this request related to?

9.1

### Serverless release

N/A

### Collaboration model

The documentation team

### Point of contact.

**Main contact:** @eedugon 

**Stakeholders:** @damianpfister 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions