Skip to content

Conversation

@aryanjassal
Copy link
Contributor

@aryanjassal aryanjassal commented Apr 1, 2025

Description

Sometimes, it is possible for a system to finish authentication before another system, and it can start sending RPC messages before the other system has fully processed the authentication. This will crash the program. This PR aims to resolve the crash.

Issues Fixed

Tasks

  • 1. Investigate three-step authentication and duplex rework approaches
  • 2. Rework network authentication to apply these changes

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@aryanjassal aryanjassal self-assigned this Apr 1, 2025
@linear
Copy link

linear bot commented Apr 1, 2025

ENG-560

@aryanjassal
Copy link
Contributor Author

So far I have made some changes which should help the resources to be freed upon an abort. Basically, this should again help the agent to shut down without holding itself open for too long. I will merge these changes alongside the ones in the issue.

@aryanjassal
Copy link
Contributor Author

Theoretically, reworking network authentication to be duplex streams seems to be the more reliable approach here. I will rework the network authentication step for this.

Sometimes, if the authentication is rejected, then the node will not re-attempt any connections at all. There needs to be retry limit and maybe some nodes like the seednode can have a perpetual re-authentication attempt. I will make an issue tracking this spec, and after the rework, might just implement it in this PR as well.

@CMCDragonkai
Copy link
Member

We are still using unary calls?

@aryanjassal
Copy link
Contributor Author

I'm still reading and understanding the code, so my understanding might not be complete yet.

Yes. At the time of adding a new connection, both nodes make a RPC call to the other node with a unary call to the authentication handler, which triggers the reverse authentication on the receiver node. If both nodes return a successful reverse authentication response, then the connection becomes authenticated.

image

The trigger for forward authentication happens simultaneously when the connection gets added. However, they both can end at different times. This can lead to a race condition where the connection is considered authenticated from a node, and sends RPC requests to the node, but the other node has not yet completed the authorisation, leading to this issue.

With the duplex streaming rework, this would be more like a handshake instead, and subsequent code will only run after the authentication has either been successful or has failed.

@CMCDragonkai
Copy link
Member

You should review your understanding with @tegefaulkes first.

@aryanjassal aryanjassal force-pushed the feature-eng-560-leaked-rpc-authentication branch 2 times, most recently from 411b3c6 to de01723 Compare April 8, 2025 01:51
@aryanjassal
Copy link
Contributor Author

image

This is my plan on implementing the duplex stream for authentication. In the RPC method for authentication, the connecting side will call this. It will give a forward auth message token, which the other node will apply the reverse auth to verify the authorisation. Then, after accepting, the node will return another token from the other node which will be reverse-authenticated by the current node. Once both nodes have returned a success authentication result, then the connection will be authenticated.

This prevents any race conditions which can happen from state desync, as the RPC method can only resolve once both the sides have authenticated their connections.

@CMCDragonkai
Copy link
Member

image

This is my plan on implementing the duplex stream for authentication. In the RPC method for authentication, the connecting side will call this. It will give a forward auth message token, which the other node will apply the reverse auth to verify the authorisation. Then, after accepting, the node will return another token from the other node which will be reverse-authenticated by the current node. Once both nodes have returned a success authentication result, then the connection will be authenticated.

This prevents any race conditions which can happen from state desync, as the RPC method can only resolve once both the sides have authenticated their connections.

Can this be done more optimally? Is there a more efficient way to achieve this with minimal back and forth? Is there some crypto tricks here to use - including taking advantage of context of the comms?

@aryanjassal
Copy link
Contributor Author

Can this be done more optimally? Is there a more efficient way to achieve this with minimal back and forth? Is there some crypto tricks here to use - including taking advantage of context of the comms?

What do you mean by crypto tricks? I asked brian and even he was unsure about how we can do this more optimally. At minimum, we need the following steps in the caller side:

  • Send forward auth message
  • Send reverse auth message
  • Ack and close

The RPC handler also does similar steps.

  • Send reverse auth message
  • Send forward auth message
  • Ack and close

This is the minimum requirement of this communication. How else would you propose we handle this?

@aryanjassal
Copy link
Contributor Author

aryanjassal commented Apr 11, 2025

https://github.com/MatrixAI/Polykey/actions/runs/14396425288/job/40372816693

@brynblack the linting job fails here, but the workflow still passes. This shouldn't happen.

@aryanjassal aryanjassal requested a review from tegefaulkes April 11, 2025 05:59
@aryanjassal
Copy link
Contributor Author

This PR is otherwise done. I have asked Brian to do a final review before merging it.

@aryanjassal aryanjassal marked this pull request as ready for review April 14, 2025 02:10
@CMCDragonkai
Copy link
Member

Is this before or after esm migration?

@aryanjassal
Copy link
Contributor Author

Is this before or after esm migration?

Do you mean this issue or the rpc version bump? This issue was encountered prior to ESM migration. The RPC version bump is after ESM.

Copy link
Contributor

@tegefaulkes tegefaulkes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small stuff but otherwise good.

chore: rebased onto staging for esm

fix: build

wip: cleaning up duplex auth handler
[ci skip]
chore: updated documentation

fix: lint

fix: import order

fix: all tests now passing

fix: auth handler hanging until timeout if errored
@aryanjassal aryanjassal force-pushed the feature-eng-560-leaked-rpc-authentication branch from 2ade4ec to 7d4506e Compare April 17, 2025 04:40
@aryanjassal
Copy link
Contributor Author

I have addressed all review comments and added all the required features. I have also rebased it to latest staging, so this PR is now basically done. After the checks pass, this can be merged.

@aryanjassal aryanjassal merged commit 76cedfb into staging Apr 17, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Leaked RPC authentication error in polykey agent

4 participants