-
Notifications
You must be signed in to change notification settings - Fork 16
[Service] Turns Service into an Actor and splits service into its own files #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 25 commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
dd2cfa8
initial commit for replica
d3677ba
clean up
d4f5660
phase out service for service v2
4202883
remove v2
efe1806
remove v2 from spawn
7d6b247
more minor cleanups
2054d63
Merge branch 'main' into replica
8392ae6
remove comment
41d71da
remove comment
e6519ee
initial commit of ServiceEndpoint
b9759aa
tests work
0de554b
simplify and unify replica initialization
e18c125
stop the underlying service proc
a1d1b4c
Merge branch 'replica' into service_in_proc
271b224
split out components into their own files
f142250
address comments
a2e58a9
address comments
2ccdcb1
add capacity semaphore
099b19e
merge conflict
c883e13
rebasing changes
b956737
fix test
355e97d
merge conflict
4853692
logger changes
388a217
fix sess_id kwarg
62dd3c7
Merge branch 'main' into service_in_proc
4358614
makes _call its own implementation
b3abdd6
docstring fix
087a687
add comment on serviceinterface
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the BSD-style license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
""" | ||
Service interface and session management. | ||
|
||
This module provides the user-facing API for interacting with distributed services, | ||
including session management, context propagation, and dynamic endpoint registration. | ||
""" | ||
|
||
import contextvars | ||
import logging | ||
from dataclasses import dataclass | ||
from typing import Generic, List, ParamSpec, TypeVar | ||
|
||
from monarch._src.actor.endpoint import EndpointProperty | ||
|
||
logger = logging.getLogger(__name__) | ||
logger.setLevel(logging.DEBUG) | ||
|
||
P = ParamSpec("P") | ||
R = TypeVar("R") | ||
|
||
|
||
@dataclass | ||
class Session: | ||
"""Simple session data holder.""" | ||
|
||
session_id: str | ||
|
||
|
||
# Context variable for session state | ||
_session_context = contextvars.ContextVar("session_context") | ||
|
||
|
||
class SessionContext: | ||
""" | ||
Async context manager for stateful service sessions with automatic lifecycle management. | ||
|
||
Provides a convenient way to maintain stateful connections to replicas across multiple | ||
requests. Sessions ensure that all requests within the context are routed to the same | ||
replica, enabling stateful interactions while handling session lifecycle automatically. | ||
|
||
Example: | ||
|
||
>>> async with service.session() as session: | ||
... # All calls within this block use the same replica | ||
... result1 = await service.my_endpoint(arg1) | ||
... result2 = await service.another_endpoint(result1) | ||
|
||
""" | ||
|
||
def __init__(self, service: "ServiceInterface"): | ||
self.service = service | ||
self.session_id: str | None = None | ||
self._token = None | ||
|
||
async def __aenter__(self): | ||
"""Start a session and set context variables.""" | ||
self.session_id = await self.service.start_session() | ||
# Set context for this async task | ||
context_value = {"session_id": self.session_id} | ||
self._token = _session_context.set(context_value) | ||
return self | ||
|
||
async def __aexit__(self, exc_type, exc_val, exc_tb): | ||
"""Terminate the session and restore context.""" | ||
if self._token: | ||
_session_context.reset(self._token) | ||
if self.session_id: | ||
await self.service.terminate_session(self.session_id) | ||
self.session_id = None | ||
|
||
|
||
class ServiceEndpoint(Generic[P, R]): | ||
"""An endpoint object specific to services. | ||
|
||
This loosely mimics the Endpoint APIs exposed in Monarch, with | ||
a few key differences: | ||
- Only choose and call are retained (dropping stream and call_one) | ||
- Call returns a list directly rather than a ValueMesh. | ||
|
||
These changes are made with Forge use cases in mind, but can | ||
certainly be expanded/adapted in the future. | ||
|
||
""" | ||
|
||
def __init__(self, actor_mesh, endpoint_name: str): | ||
self.actor_mesh = actor_mesh | ||
self.endpoint_name = endpoint_name | ||
|
||
async def choose( | ||
self, *args: P.args, sess_id: str | None = None, **kwargs: P.kwargs | ||
) -> R: | ||
"""Chooses a replica to call based on context and load balancing strategy.""" | ||
return await self.actor_mesh._call.call_one( | ||
sess_id, self.endpoint_name, *args, **kwargs | ||
) | ||
|
||
async def call(self, *args: P.args, **kwargs: P.kwargs) -> List[R]: | ||
"""Broadcasts a request to all healthy replicas and returns the results as a list.""" | ||
result = await self.actor_mesh._call_all.call_one( | ||
self.endpoint_name, *args, **kwargs | ||
) | ||
return result | ||
|
||
|
||
class ServiceInterface: | ||
""" | ||
A lightweight interface to a Service Actor running on a single-node mesh. | ||
|
||
This interface holds references to the proc_mesh and actor_mesh (both of size 1) | ||
and exposes its user-defined actor endpoints as ServiceEndpoint objects that | ||
route through the Service Actor's _call and _call_all endpoints. | ||
|
||
The ServiceInterface acts as the handle that is returned to end clients, | ||
providing a simple interface that makes actual calls to the Service Actor. | ||
""" | ||
|
||
def __init__(self, _proc_mesh, _service, actor_def): | ||
self._proc_mesh = _proc_mesh | ||
self._service = _service | ||
self.actor_def = actor_def | ||
|
||
# Dynamically create ServiceEndpoint objects for user's actor endpoints | ||
# Inspect the actor_def directly to find endpoints | ||
for attr_name in dir(actor_def): | ||
attr_value = getattr(actor_def, attr_name) | ||
if isinstance(attr_value, EndpointProperty): | ||
# Create a ServiceEndpoint that will route through the Service Actor | ||
endpoint = ServiceEndpoint(self._service, attr_name) | ||
setattr(self, attr_name, endpoint) | ||
|
||
# Session management methods - handled by ServiceInterface | ||
async def start_session(self) -> str: | ||
"""Starts a new session for stateful request handling.""" | ||
return await self._service.start_session.call_one() | ||
|
||
async def terminate_session(self, sess_id: str): | ||
"""Terminates an active session and cleans up associated resources.""" | ||
return await self._service.terminate_session.call_one(sess_id) | ||
|
||
def session(self) -> "SessionContext": | ||
"""Returns a context manager for session-based calls.""" | ||
return SessionContext(self) | ||
|
||
# Service control methods - forwarded to Service Actor | ||
async def stop(self): | ||
"""Stops the service gracefully.""" | ||
# First stop the service | ||
await self._service.stop.call_one() | ||
# Then stop its underlying proc | ||
await self._proc_mesh.stop() | ||
|
||
# Metrics methods - forwarded to Service Actor | ||
async def get_metrics(self): | ||
"""Get comprehensive service metrics for monitoring and analysis.""" | ||
return await self._service.get_metrics.call_one() | ||
|
||
async def get_metrics_summary(self): | ||
"""Get a summary of key metrics for monitoring and debugging.""" | ||
return await self._service.get_metrics_summary.call_one() | ||
|
||
# Testing method - forwarded to Service Actor | ||
def _get_internal_state(self): | ||
""" | ||
Get comprehensive internal state for testing purposes. | ||
|
||
Returns: | ||
dict: Complete internal state including sessions, replicas, and metrics | ||
""" | ||
return self._service._get_internal_state.call_one() | ||
|
||
def __getattr__(self, name: str): | ||
"""Forward all other attribute access to the underlying Service Actor.""" | ||
# Forward everything else to the _service | ||
if hasattr(self._service, name): | ||
return getattr(self._service, name) | ||
|
||
raise AttributeError( | ||
f"'{self.__class__.__name__}' object has no attribute '{name}'" | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the BSD-style license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
""" | ||
Service metrics collection and aggregation. | ||
|
||
This module provides comprehensive metrics tracking for distributed services, | ||
including per-replica performance data, service-wide aggregations, and | ||
health status information. | ||
""" | ||
|
||
from dataclasses import dataclass, field | ||
from typing import Dict, List | ||
|
||
from forge.controller.replica import ReplicaMetrics | ||
|
||
|
||
# TODO - tie this into metrics logger when it exists. | ||
@dataclass | ||
class ServiceMetrics: | ||
""" | ||
Aggregated metrics collection for the entire service. | ||
|
||
Provides service-wide visibility into performance, health, and scaling metrics | ||
by aggregating data from all replica instances. | ||
|
||
Attributes: | ||
replica_metrics: Per-replica metrics indexed by replica ID | ||
total_sessions: Number of active sessions across all replicas | ||
healthy_replicas: Number of currently healthy replicas | ||
total_replicas: Total number of replicas (healthy + unhealthy) | ||
last_scale_event: Timestamp of the last scaling operation | ||
""" | ||
|
||
# Replica metrics | ||
replica_metrics: Dict[int, ReplicaMetrics] = field(default_factory=dict) | ||
# Service-level metrics | ||
total_sessions: int = 0 | ||
healthy_replicas: int = 0 | ||
total_replicas: int = 0 | ||
# Time-based metrics | ||
last_scale_event: float = 0.0 | ||
|
||
def get_total_request_rate(self, window_seconds: float = 60.0) -> float: | ||
"""Get total requests per second across all replicas.""" | ||
return sum( | ||
metrics.get_request_rate(window_seconds) | ||
for metrics in self.replica_metrics.values() | ||
) | ||
|
||
def get_avg_queue_depth(self, replicas: List) -> float: | ||
"""Get average queue depth across all healthy replicas.""" | ||
healthy_replicas = [r for r in replicas if r.healthy] | ||
if not healthy_replicas: | ||
return 0.0 | ||
total_queue_depth = sum(r.request_queue.qsize() for r in healthy_replicas) | ||
return total_queue_depth / len(healthy_replicas) | ||
|
||
def get_avg_capacity_utilization(self, replicas: List) -> float: | ||
"""Get average capacity utilization across all healthy replicas.""" | ||
healthy_replicas = [r for r in replicas if r.healthy] | ||
if not healthy_replicas: | ||
return 0.0 | ||
total_utilization = sum(r.capacity_utilization for r in healthy_replicas) | ||
return total_utilization / len(healthy_replicas) | ||
|
||
def get_sessions_per_replica(self) -> float: | ||
"""Get average sessions per replica.""" | ||
if self.total_replicas == 0: | ||
return 0.0 | ||
return self.total_sessions / self.total_replicas |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment in the docstring, but a few reasons: