Skip to content

Improve AutoOpsAgentPolicy Status Reporting#9095

Open
moukoublen wants to merge 1 commit intoelastic:mainfrom
moukoublen:improve_autoops_status_reporting
Open

Improve AutoOpsAgentPolicy Status Reporting#9095
moukoublen wants to merge 1 commit intoelastic:mainfrom
moukoublen:improve_autoops_status_reporting

Conversation

@moukoublen
Copy link
Member

@moukoublen moukoublen commented Feb 3, 2026

Note: This description was generated with AI assistance (Claude Opus 4.5)

Improve AutoOpsAgentPolicy Status Reporting

Summary

  • Add per-resource status tracking with detailed error information for AutoOpsAgentPolicy
  • Introduce human-readable status fields for better observability
  • Enhance error reporting with specific messages for RBAC and reconciliation failures
  • Default columns of kubectl get autoopsagentpolicies changed to add more information.

New Status Fields

Skipped (int)

The number of Elasticsearch resources that are skipped from monitoring due to RBAC permission issues. When the operator is configured with --enforce-rbac-on-refs and the specified serviceAccountName lacks permission to access an Elasticsearch resource in a different namespace, that resource is counted as skipped rather than errored.

Example: skipped: 2 indicates 2 Elasticsearch clusters couldn't be monitored due to insufficient RBAC permissions.

ReadyCount (string)

A human-readable string showing the ratio of ready monitored resources to total monitored resources in the format Ready/Resources. This provides an at-a-glance view of the policy's health without needing to compare separate numeric fields.

Example: readyCount: "3/5" indicates 3 out of 5 matched Elasticsearch clusters have healthy AutoOps agents deployed.

Message (string)

A human-readable summary of the current status, combining information about ready resources, errors, and skipped resources into a single descriptive string. The message is dynamically generated based on non-zero counts.

Examples:

  • "3 resource ready" - all resources healthy
  • "2 resource ready, 1 error" - partial success with errors
  • "1 resource ready, 1 error, 2 skipped due to RBAC" - mixed status with RBAC issues

Details (map[string]ResourceStatus)

A map providing per-resource status information, keyed by resource identifier in the format namespace/name. Only resources with non-ready states (errors or skipped) are included in this map to keep the status lightweight. Each entry contains:

  • Phase (ResourcePhase): Either "Error" or "Skipped"
  • Message (string): Human-readable explanation, set for skipped resources (e.g., "RBAC access denied for service account my-sa")
  • Error (string): Detailed error information, set only for error states (e.g., "Failed to create AutoOps ES CA secret: secret not found")

Example:

details:
  production/es-cluster-1:
    phase: Error
    error: "Failed to create AutoOps ES API key: connection refused"
  staging/es-cluster-2:
    phase: Skipped
    message: "RBAC access denied for service account autoops-sa"

New Types

ResourceStatus (struct)

A lightweight struct for per-resource status information:

  • Phase: The resource phase (Error or Skipped)
  • Message: Human-readable explanation (only for non-ready states)
  • Error: Error details (only for error states)

ResourcePhase (string)

An enumeration for resource-level phases:

  • ErrorResourcePhase ("Error"): Resource reconciliation failed
  • SkippedResourcePhase ("Skipped"): Resource skipped due to RBAC

Other Changes

Renamed Method

  • CalculateFinalPhase()Finalize(): Now also generates the human-readable Message and ReadyCount fields at the end of reconciliation

Enhanced Error Tracking

Replaced generic MarkResourceError() calls with specific error methods that capture the failing operation:

  • ResourceRBACError(es): RBAC access denied
  • ResourceError(es, message, err): Captures specific failures for CA secret, API key, config map, and deployment operations

kubectl get autoopsagentpolicies

Examples:

NAMESPACE   NAME                  READY   PHASE                        MESSAGE   AGE
e2e-venus   autoops-policy-gx76   0/1     MonitoredResourcesNotReady             3s

---

NAMESPACE   NAME                  READY   PHASE                   MESSAGE   AGE
e2e-venus   autoops-policy-gx76   0/1     AutoOpsAgentsNotReady             82s

---

NAMESPACE   NAME                  READY   PHASE   MESSAGE            AGE
e2e-venus   autoops-policy-gx76   1/1     Ready   1 resource ready   109s

---

NAMESPACE   NAME                  READY   PHASE                  MESSAGE   AGE
e2e-venus   autoops-policy-gx76   0/0     NoMonitoredResources             114s

@moukoublen moukoublen linked an issue Feb 3, 2026 that may be closed by this pull request
10 tasks
@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Feb 3, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement Enhancement of existing functionality v3.4.0 (next next)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve AutoOpsAgentPolicy Status Reporting

2 participants