Skip to content

Improve the deletion process of the last_error_event from the error history of a machine #599

@simcod

Description

@simcod

The last_error_event from machines should be cleared from the issue history after some time (about 6 days).

While deploying metal-stack on the new supermicro nodes, we encountered the following problem: Already allocated machines (integrated into a Kubernetes cluster) had the last_error_event of : unexpectedly received in state pxe booting.

The metal-api-liveliness is running in the metal-control-plane namespace. The logs do not show any errors for machines.

{... "msg":"machine liveliness was requested"}
{... "msg":"machine liveliness evaluated","alive":x,"dead":0,"unknown":0,"errors":0}

However, listing the machines with metalctl machine ls returns some allocated machines with a ⭕ crashloop issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions