-
Notifications
You must be signed in to change notification settings - Fork 114
Open
Description
from monarch.actor import Actor, current_rank, endpoint
from monarch.actor import this_proc
import torch
class MyActor(Actor):
@endpoint
def my_function(self, chunks):
self.chunks = chunks
for chunk in self.chunks:
print(list(map(str, [chunk.size(), chunk[-1]])))
return 0
if __name__ == "__main__":
procs = this_proc()
chunks = []
for _ in range(10): # Works up until 9
chunks.append(torch.empty(1024 * 1024 * 1024, dtype=torch.uint8))
my_actor = procs.spawn("MyActor", MyActor)
print(my_actor.my_function.call(chunks).get())
This will fail with
result, i = await PythonTask.select_one([self._monitor.task(), awaitable])
monarch._rust_bindings.monarch_hyperactor.supervision.SupervisionError: Actor unix:@sOfau7MSwxVhtIHq6FoE4mQb,root_client_proc_mesh_0_17XAxuyfKKRV,MyActor_16CbU1KQ6T6G[0] exited because of the following reason: <PyActorSupervisionEvent: The actor <root>.<__main__.MyActor MyActor> and all its descendants have failed.
This occurred because the actor itself failed.
The error was:
stopped>
Raising HYPERACTOR_CODEC_MAX_FRAME_LENGTH will resolve this, but it took a long time and help from core developers to resolve this. Could this error have been caught and bubbled up to a nicer error instead of causing a crash?