-
Notifications
You must be signed in to change notification settings - Fork 16
Auto-track and globally shut down all Forge actors and services #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/forge/controller/actor.py
Outdated
This method is used by `Service` to teardown a replica. | ||
""" | ||
if not quiet: | ||
logger.info(f"Shutting down actor {getattr(actor, 'name', cls.__name__)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this quiet check because otherwise when we shutdown a service, it would call actor.shutdown and print the log twice.
src/forge/controller/provisioner.py
Outdated
|
||
async def track_allocation(self, alloc: Any): | ||
"""Tracks an allocation for cleanup.""" | ||
self._allocations.append(alloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I think an even simpler approach is to just track the proc meshes right? We can just do await proc_mesh.stop()
and I think everything inside of it should shut down neatly. Let me know if that doesn't work though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would work for actor since shutting down actor is essentially stopping the proc_mesh
.
But for service, it involves some other operations such as stopping the replicas and healthy loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I understand - here are a few suggestions:
- In provisioner.py, let's add in a
register_service
andregister_actor
on the Provisioner - Also add top level
register_service
andregister_actor
functions (likeget_proc_mesh
) - Within
.as_service(...)
and.as_actor(...)
we callregister_service
andregister_actor
respectively
@allenwang28 Done! |
src/forge/controller/provisioner.py
Outdated
|
||
def register_actor(self, actor: "ForgeActor") -> None: | ||
"""Registers a single actor allocation for cleanup.""" | ||
from monarch._src.actor.actor_mesh import ActorMesh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this at the toplevel? And use
from monarch.actor import ActorMesh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got
ImportError: cannot import name 'ActorMesh' from 'monarch.actor' (/home/dxie/.fbpkg_conda_envs/forge-e146614/lib/python3.10/site-packages/monarch/actor/__init__.py)
Sure I can do it on top-level.
""" | ||
Shut down the underlying Service. | ||
""" | ||
logger.info(f"Shutting down service {self.actor_def.__name__}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're quietly shutting down actors/replicas, in this log here can we mention how many actors we're shutting down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now the log only says
Shutting down 3 service(s) and 4 actor(s)...
apps/grpo/main.py
Outdated
|
||
# give mlogger time to shutdown backends, otherwise they can stay running. | ||
# TODO (felipemello) find more elegant solution | ||
await mlogger.shutdown.call_one() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DNXie @felipemello1 maybe we can just move the mlogger shutdown into the global shutdown as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it into shutdown()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Danning!
|
||
async def shutdown_metric_logger(): | ||
"""Shutdown the global metric logger and all its backends.""" | ||
from forge.observability.metric_actors import get_or_create_metric_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this doesn't need to be inline imported, can we do this at the toplevel? This should be a general thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause a circular dependency issue. This is also what Felipe did: https://github.com/meta-pytorch/forge/blob/main/src/forge/controller/provisioner.py#L325-L327
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sg! I will just assume this to be the case moving forward
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #357 +/- ##
=======================================
Coverage ? 64.63%
=======================================
Files ? 79
Lines ? 7736
Branches ? 0
=======================================
Hits ? 5000
Misses ? 2736
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Context: #360
This PR introduces centralized allocation tracking and unified shutdown for all
ForgeActor
andServiceInterface
instances managed by the globalProvisioner
.Actors and services are now automatically registered with the provisioner when spawned via
.as_actor()
or.as_service()
.This enables a single, global shutdown sequence (
await shutdown()
) that gracefully terminates all registered allocations in reverse order, no manual teardown required.Automatic Registration
register_service()
/register_actor()
to theProvisioner
with type checks.get_proc_mesh
,stop_proc_mesh
).ForgeActor.as_service()
and.as_actor()
now automatically register their proxies after initialization.Centralized Shutdown
Provisioner.shutdown_all_allocations()
for unified teardown.ServiceInterface
andForgeActor
instances in reverse allocation order.shutdown()
Test:
The log after ctrl+C (cleaned version):