Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Oct 9, 2025

Context: #360

This PR introduces centralized allocation tracking and unified shutdown for all ForgeActor and ServiceInterface instances managed by the global Provisioner.

Actors and services are now automatically registered with the provisioner when spawned via .as_actor() or .as_service().
This enables a single, global shutdown sequence (await shutdown()) that gracefully terminates all registered allocations in reverse order, no manual teardown required.

Automatic Registration

  • Added register_service() / register_actor() to the Provisioner with type checks.
  • Added matching top-level helpers for API consistency (get_proc_mesh, stop_proc_mesh).
  • ForgeActor.as_service() and .as_actor() now automatically register their proxies after initialization.

Centralized Shutdown

  • Added Provisioner.shutdown_all_allocations() for unified teardown.
  • Gracefully shuts down all tracked ServiceInterface and ForgeActor instances in reverse allocation order.
  • Move metric logger shutdown into the centralized function shutdown()

Test:

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

The log after ctrl+C (cleaned version):

Shutting down...
Shutting down metric logger...
Shutting down provisioner..
Shutting down 3 service(s) and 4 actor(s)...
Health loop stopped gracefully.
Health loop stopped gracefully.
Shutdown completed successfully

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2025
@DNXie DNXie changed the title Auto-track and globally shut down all Forge actors and services [WIP] Auto-track and globally shut down all Forge actors and services Oct 9, 2025
This method is used by `Service` to teardown a replica.
"""
if not quiet:
logger.info(f"Shutting down actor {getattr(actor, 'name', cls.__name__)}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this quiet check because otherwise when we shutdown a service, it would call actor.shutdown and print the log twice.

@DNXie DNXie changed the title [WIP] Auto-track and globally shut down all Forge actors and services Auto-track and globally shut down all Forge actors and services Oct 9, 2025
@DNXie DNXie marked this pull request as ready for review October 9, 2025 19:18

async def track_allocation(self, alloc: Any):
"""Tracks an allocation for cleanup."""
self._allocations.append(alloc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I think an even simpler approach is to just track the proc meshes right? We can just do await proc_mesh.stop() and I think everything inside of it should shut down neatly. Let me know if that doesn't work though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would work for actor since shutting down actor is essentially stopping the proc_mesh.
But for service, it involves some other operations such as stopping the replicas and healthy loop.

@DNXie DNXie requested a review from allenwang28 October 10, 2025 23:45
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I understand - here are a few suggestions:

  • In provisioner.py, let's add in a register_service and register_actor on the Provisioner
  • Also add top level register_service and register_actor functions (like get_proc_mesh)
  • Within .as_service(...) and .as_actor(...) we call register_service and register_actor respectively

@DNXie
Copy link
Member Author

DNXie commented Oct 13, 2025

ok, I understand - here are a few suggestions:

  • In provisioner.py, let's add in a register_service and register_actor on the Provisioner
  • Also add top level register_service and register_actor functions (like get_proc_mesh)
  • Within .as_service(...) and .as_actor(...) we call register_service and register_actor respectively

@allenwang28 Done!

@DNXie DNXie requested a review from allenwang28 October 13, 2025 18:10

def register_actor(self, actor: "ForgeActor") -> None:
"""Registers a single actor allocation for cleanup."""
from monarch._src.actor.actor_mesh import ActorMesh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this at the toplevel? And use

from monarch.actor import ActorMesh

Copy link
Member Author

@DNXie DNXie Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got

ImportError: cannot import name 'ActorMesh' from 'monarch.actor' (/home/dxie/.fbpkg_conda_envs/forge-e146614/lib/python3.10/site-packages/monarch/actor/__init__.py)

Sure I can do it on top-level.

"""
Shut down the underlying Service.
"""
logger.info(f"Shutting down service {self.actor_def.__name__}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're quietly shutting down actors/replicas, in this log here can we mention how many actors we're shutting down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now the log only says

Shutting down 3 service(s) and 4 actor(s)...

@DNXie DNXie requested a review from allenwang28 October 13, 2025 18:49

# give mlogger time to shutdown backends, otherwise they can stay running.
# TODO (felipemello) find more elegant solution
await mlogger.shutdown.call_one()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DNXie @felipemello1 maybe we can just move the mlogger shutdown into the global shutdown as well?

Copy link
Member Author

@DNXie DNXie Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it into shutdown().

@DNXie DNXie requested a review from allenwang28 October 13, 2025 23:03
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Danning!


async def shutdown_metric_logger():
"""Shutdown the global metric logger and all its backends."""
from forge.observability.metric_actors import get_or_create_metric_logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this doesn't need to be inline imported, can we do this at the toplevel? This should be a general thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a circular dependency issue. This is also what Felipe did: https://github.com/meta-pytorch/forge/blob/main/src/forge/controller/provisioner.py#L325-L327

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sg! I will just assume this to be the case moving forward

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 46.15385% with 28 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@4c14792). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/forge/controller/provisioner.py 39.13% 28 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #357   +/-   ##
=======================================
  Coverage        ?   64.63%           
=======================================
  Files           ?       79           
  Lines           ?     7736           
  Branches        ?        0           
=======================================
  Hits            ?     5000           
  Misses          ?     2736           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DNXie DNXie merged commit f19eb1b into meta-pytorch:main Oct 14, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants