Skip to content

Add retries for publishing metrics & health checks #4105

@strategicpause

Description

@strategicpause

Summary

This is a request to add retries in the case of the agent failing to publish metrics or health check messages to TACS.

Description

I noticed in my logs that I see cases where the ecs agent is emitting the message "Error publishing metrics" to the logs. From looking at the code it looks like the tcsClientServer.publishMessages is reading metrics & health metrics from a channel and then emitting an error if the metrics were unable to be published. This behavior will result in either metrics or health checks failed to be reported to TACS when there is an error sending a message to TACS. For example, this could occur when a WS connection is closed from the server, which results in the client initiating a new connection.

Expected Behavior

I would expect some kind of retry mechanism which would attempt to send the metrics or health checks over the connection. I don't see any retry logic further down the stack either ie: ClientServerImpl.MakeRequest.

Observed Behavior

The following log line:

05:20:14.273 | {"level":"warn","time":"2024-03-02T05:20:14.032","msg":"Error publishing metrics","error":"websocket: close sent"}

Environment Details

Running on AL2 with kernel 5.10

Supporting Log Snippets

See above.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions