Mesh slice supervision #824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

thomasywang wants to merge 1 commit into meta-pytorch:main from thomasywang:export-D79821712

+167 −21

Contributor

thomasywang commented Aug 12, 2025

Summary:
Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:

ActorMesh A is unhealthy (contains A gpu=4)
SliceA_1 is unhealthy (contains A gpu=4)
SliceA_2 is healthy (does not contain A gpu=4)
ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:

All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router. This diff changes the channel from a mpsc to a broadcast
The spawned RootActorMesh gets a copy of the tx. For broadcast channels, new subscribers are created by calling tx.subscribe().
Instead of being able to take the event stream from a RootActorMesh, we now create a new subscriber
PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

meta-cla bot added the CLA Signed label

Contributor

facebook-github-bot commented Aug 12, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

facebook-github-bot added the fb-exported label

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

ccae6a8

Summary:

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from 843077b to ccae6a8 Compare

August 13, 2025 20:24

Contributor

facebook-github-bot commented Aug 13, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

c4b9a87

Summary:

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from ccae6a8 to c4b9a87 Compare

August 13, 2025 20:33

Contributor

facebook-github-bot commented Aug 13, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

8f6ab58

Summary:

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from c4b9a87 to 8f6ab58 Compare

August 21, 2025 21:12

Contributor

facebook-github-bot commented Aug 21, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

54d25b4

Summary:
Pull Request resolved: meta-pytorch#824

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch 2 times, most recently from 54d25b4 to 481689a Compare

August 22, 2025 02:15

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

481689a

Summary:

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

Contributor

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

456db48

Summary:
Pull Request resolved: meta-pytorch#824

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from 481689a to 456db48 Compare

August 22, 2025 02:18

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

062e838

Summary:

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from 456db48 to 062e838 Compare

August 22, 2025 02:38

Contributor

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from 062e838 to 557a383 Compare

August 22, 2025 02:42

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request


          Mesh slice supervision (meta-pytorch#824)

557a383

Summary:
Pull Request resolved: meta-pytorch#824

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. All supervision event streams are created when a RootActorMesh is spawned. A tx-rx pair are created and the tx is inserted into a map acting as a router.
2. The router now holds onto a Vec of senders instead of a single sender. For each Actor mesh in the router, we can call bind to create another tx-rx pair. The router will manage the tx, sending a message using every tx for a given Actor mesh name every time there is a supervision event. The rx is returned to be used by mesh slices.
3. The spawned RootActorMesh gets a copy of the Arc holding the router so that mesh slices can bind to it
4. PythonActorMeshes contain a monitor which is just a loop that listens from the next supervision event from a stream. If a ::Crashed event comes in, we will update an Arc<> keeping track of the health state. This monitor will now also take in the shape of the mesh it is monitoring, and only update the health state to ::Crashed, if the crashed Actor is within the shape.
5. When a PythonActorMesh is sliced, a PythonActorMeshRef is created. We will add a monitor to PythonActorMeshRefs. It is an Option<> because if it is ever serialized and deserialized, we can no longer monitor it
6. When we cast to an PythonActorMeshRef, we will first check the health state and return a SupervisionError if the mesh is unhealthy

Differential Revision: D79821712


          Mesh slice supervision (meta-pytorch#824)

cbfd5e4

Summary:

Suppose we have a ProcMesh with 4 gpus. On this mesh we spawn a ActorMesh A, and an ActorMesh B. We create a slice of ActorMesh A SliceA_1 containing only gpu 4 and a slice SliceA_2 containing only gpu 1. If the Actor A on gpu 4 crashes we should have the following health states:
- ActorMesh A is unhealthy (contains A gpu=4)
- SliceA_1 is unhealthy (contains A gpu=4)
- SliceA_2 is healthy (does not contain A gpu=4)
- ActorMesh B is healthy (contains gpu=4 but not Actor A)

Implementation:
1. An `Arc<DashMap<usize, ActorSupervisionEvent>>` is created in order to track all `Actor` crashes. This is necessary because the `UnhealthyState` only tracks the latest event. This `Arc` will be called `crashed_actors`
2. `crashed_actors` is passed into the monitor loop and updated when an `Actor` crashes
3. Before casting to a `PythonActorMeshRef`, we will check `crashed_actors` and return a `SupervisionError` if containing the first rank it finds int `crashed_actors`
4. When it comes to monitoring supervision events through the `PortListener`, the `PythonTask` will loop and skip over any `ActorSupervisionEvents` that do not affect ranks outside of the mesh

Reviewed By: pzhan9

Differential Revision: D79821712

thomasywang force-pushed the export-D79821712 branch from 557a383 to cbfd5e4 Compare

August 22, 2025 19:03

Contributor

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D79821712

facebook-github-bot closed this in

04ae355

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Aug 23, 2025

This pull request has been merged in 04ae355.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged