Skip to content

0.1.0

Latest

Choose a tag to compare

@colin2328 colin2328 released this 22 Oct 05:02
· 4 commits to 0.1.0 since this release

🦋 Monarch v0.1.0 — Initial Release
We’re excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.

🚀 Highlights

  1. Actor-Based Programming for PyTorch
    Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.
from monarch.actor import Actor, endpoint, this_host

training_procs = this_host().spawn_procs({"gpus": 8})

class Trainer(Actor):
    @endpoint
    def train(self, step: int): ...

trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()
  1. Scalable Messaging and Meshes
    Actors are organized into meshes — collections that support broadcast, gather, and other scalable communication primitives.
  2. Supervision and Fault Tolerance
    Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows.
  3. High-Performance RDMA Transfers
    Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts.
  4. Distributed Tensors
    Native support for tensors sharded across processes — enabling distributed compute without custom data movement code.

⚠️ Early Development Notice
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions — please discuss significant changes or ideas via issues before submitting PRs.