Skip to content

Conversation

samlurye
Copy link
Contributor

@samlurye samlurye commented Aug 7, 2025

Summary:
The purpose of this diff is to handle the following scenario:

  1. Process A starts serving a NetRx.
  2. Process B creates a NetTx that connects to process A's NetRx.
  3. B sends a few messages to A, and the messages are acked.
  4. Process A dies/is killed, while B stays alive.
  5. A new Process C starts serving a NetRx on the same channel as from step 1.
  6. B's NetTx connects to C's NetRx, with no way of knowing it has connected to a different process than before.
  7. B sends messages to C, starting from where it left off with A.
  8. C rejects all of B's messages because of invalid sequence numbers.
  9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 7, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 7, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 7, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 7, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 8, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 8, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 8, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 8, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 8, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 8, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@samlurye samlurye force-pushed the export-D79607092 branch 2 times, most recently from c817b33 to 2241ac2 Compare August 12, 2025 19:18
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 12, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 12, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 12, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 12, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 12, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 12, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 13, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Reviewed By: mariusae

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 13, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Reviewed By: mariusae

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 13, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Reviewed By: mariusae

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 13, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Reviewed By: mariusae

Differential Revision: D79607092
@samlurye samlurye force-pushed the export-D79607092 branch 2 times, most recently from 31d9002 to 1c87225 Compare August 19, 2025 19:40
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 19, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 19, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 19, 2025
…eta-pytorch#793)

Summary:

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 19, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
samlurye added a commit to samlurye/monarch-1 that referenced this pull request Aug 19, 2025
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092

Reviewed By: mariusae
…eta-pytorch#793)

Summary:
Pull Request resolved: meta-pytorch#793

The purpose of this diff is to handle the following scenario:
1. Process A starts serving a NetRx.
2. Process B creates a NetTx that connects to process A's NetRx.
3. B sends a few messages to A, and the messages are acked.
4. Process A dies/is killed, while B stays alive.
5. A new Process C starts serving a NetRx on the same channel as from step 1.
6. B's NetTx connects to C's NetRx, *with no way of knowing it has connected to a different process than before*.
7. B sends messages to C, starting from where it left off with A.
8. C rejects all of B's messages because of invalid sequence numbers.
9. B's NetTx eventually times out after a long time with no acks.

In order to distinguish among connections from different NetTx instances to the same NetRx instance, each NetTx generates a random unique session id. This session id gets sent as part of an initial handshake from NetTx -> NetRx before the NetTx starts sending normal messages.

Currently, though, NetTx doesn't wait for any handshake before starting to send messages. To resolve the issue described above, this diff introduces a global (per-process) "rx session id". When a NetTx first connects to a NetRx, the NetRx responds with its rx session id as part of the handshake. The NetTx waits for the handshake response and extracts the rx session id. If this is the first time the NetTx is connecting, the NetTx stores the rx session id. On subsequent connection attempts, the NetTx will validate the rx session id it receives from the handshake against the rx session id it previously stored; if there is a mismatch, the NetTx returns the appropriate error to its caller.

Differential Revision: D79607092
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79607092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants