Skip to content

Conversation

@papazach
Copy link
Contributor

@papazach papazach commented Feb 13, 2025

This PR introduces two new Enums.

1. AgentExitReason

The values are taken from EXIT_REASON enum definition.

A new array field called exit_reasons is added to UpdateAgentConnection proto to let the Cloud know about all the reasons that led to the latest Agent shutdown/disconnection.

Those values will be added to relevant feed events to facilitate investigations (of course we could also surface them in the UI if it is needed).

2. NodeInstanceConnectivityReason

This enum aims to notify Cloud about important node instance disconnection reasons and enable it to make smarter decisions. Note, that those improvements are highly sought by our users.

So apart from the NODE_INSTANCE_CONNECTIVITY_REASON_UNSPECIFIED and NODE_INSTANCE_CONNECTIVITY_REASON_ONLINE, the following two values (that are about disconnections) are added:

  • NODE_INSTANCE_CONNECTIVITY_REASON_NO_RETENTION: This will enable Cloud to remove right away node instances that are rotated out of the Agent's DB. Currently those get disconnected, marked as pruned and 60 days need to go by until the Cloud can remove them. In the meantime user's see them as offline.

  • NODE_INSTANCE_CONNECTIVITY_REASON_AGENT_UPDATE: This will enable Cloud to not send reachability notifications when node instances go offline due to Agent updates.

This enum is added to UpdateNodeInstanceConnection messages.

@papazach papazach changed the title WIP: introduce ExitReason enum and use it in Agent/Node Instance connectivity update messages WIP: introduce AgentExitReason and NodeInstanceConnectivityReason enums and use them in conn update messages Feb 24, 2025
@papazach papazach requested review from a team, ktsaou and stelfrag February 24, 2025 12:38
@papazach papazach marked this pull request as ready for review February 28, 2025 12:12
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

enum NodeInstanceConnectivityReason {
// NODE_INSTANCE_CONNECTIVITY_REASON_UNSPECIFIED acts as default value
NODE_INSTANCE_CONNECTIVITY_REASON_UNSPECIFIED = 0;
NODE_INSTANCE_CONNECTIVITY_REASON_ONLINE = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for this reason? We already have liveness property on the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap we could remove it.

The thinking behind this was that this enum is about Connectivity Reasons, not disconnection reasons. So apart from the default (unspecified case), I added the Online reason to be used when UpdateNodeInstanceConnection has liveness true.

I think that if we were to remove it we could rename this as NodeInstanceDisconnectionReason and have only:

  • Unspecified
  • No Retention
    and
  • Agent Update

cases. Wdyt @car12o

Copy link
Contributor

@car12o car12o Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think about it as Connectivity Reasons which should include connections and disconnections.

I added the Online reason to be used when UpdateNodeInstanceConnection has liveness true.

I think it makes sense but wouldn't we also need Offline when a node disconnects, and it's not No Retention neither Agent Update? Perhaps instead of NODE_INSTANCE_CONNECTIVITY_REASON_ONLINE we could have something like NODE_INSTANCE_CONNECTIVITY_REASON_NORMAL which could be used when a node gracefully connects or disconnects without it being No Retention neither Agent Update.

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm now this is the point where ConnectionUpdateSource comes into play further complicating things because CONNECTION_UPDATE_SOURCE_AGENT also means that this was a normal connection / disconnection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be mapped to NODE_INSTANCE_CONNECTIVITY_REASON_NORMAL or do we need a specific Connectivity Reason for that?

Copy link
Member

@juacker juacker Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we were to remove it we could rename this as NodeInstanceDisconnectionReason and have only:

  • Unspecified
  • No Retention
    and
  • Agent Update

I agree on keeping these three only, I'd initially add only the ones we have use cases for, keeping UNSPECIFIED for other situations since it's the default value anyway.

Regarding NodeInstanceConnectivityReason name, I think it's ok, I understand it can apply for connections or disconnections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry forgot to answer to this.
Sounds good to me guys I removed the online case and kept the other 3 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with having just these 3 but personally I would consider UNSPECIFIED != NORMAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants