Skip to content

Actor hard exits with inscrutable error when exceeding HYPERACTOR_CODEC_MAX_FRAME_LENGTH #2041

@cpuhrsch

Description

@cpuhrsch
from monarch.actor import Actor, current_rank, endpoint  
from monarch.actor import this_proc  
import torch  
  
class MyActor(Actor):  
  
    @endpoint  
    def my_function(self, chunks):  
        self.chunks = chunks  
        for chunk in self.chunks:  
            print(list(map(str, [chunk.size(), chunk[-1]])))  
        return 0  
  
if __name__ == "__main__":  
    procs = this_proc()  
    chunks = []  
    for _ in range(10): # Works up until 9  
        chunks.append(torch.empty(1024 * 1024 * 1024, dtype=torch.uint8))  
    my_actor = procs.spawn("MyActor", MyActor)  
    print(my_actor.my_function.call(chunks).get())

This will fail with

    result, i = await PythonTask.select_one([self._monitor.task(), awaitable])  
monarch._rust_bindings.monarch_hyperactor.supervision.SupervisionError: Actor unix:@sOfau7MSwxVhtIHq6FoE4mQb,root_client_proc_mesh_0_17XAxuyfKKRV,MyActor_16CbU1KQ6T6G[0] exited because of the following reason: <PyActorSupervisionEvent: The actor <root>.<__main__.MyActor MyActor> and all its descendants have failed.  
This occurred because the actor itself failed.  
The error was:  
 stopped>

Raising HYPERACTOR_CODEC_MAX_FRAME_LENGTH will resolve this, but it took a long time and help from core developers to resolve this. Could this error have been caught and bubbled up to a nicer error instead of causing a crash?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions