-
Notifications
You must be signed in to change notification settings - Fork 165
refactor(BA-3799): unify scheduler handlers with SessionLifecycleHandler pattern #7867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
HyeockJinKim
wants to merge
23
commits into
main
Choose a base branch
from
feat/scheduler-refactoring-repository
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 43 out of 43 changed files in this pull request and generated no comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6eeaba8 to
1c50854
Compare
Implement the DeploymentCoordinator pattern for ScheduleCoordinator: - Add SessionLifecycleHandler ABC with declarative status methods: - target_statuses(), target_kernel_statuses() - success_status(), failure_status(), stale_status() - Add new result types in results.py: - HandlerKernelData, HandlerSessionData for handler input - SessionExecutionError, SessionExecutionResult for handler output - Add handler-specific DB methods in db_source.py: - fetch_sessions_for_handler() - update_sessions_status_bulk() - update_kernels_status_bulk() - Add process_lifecycle_schedule() to coordinator: - Iterates over scaling groups - Queries sessions based on handler's target statuses - Executes handler logic - Applies status transitions via _handle_status_transitions() This enables gradual migration of existing handlers to the new pattern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add SessionLifecycleHandler implementations for all 4 progress handlers: - CheckPullingProgressLifecycleHandler (PREPARING/PULLING -> PREPARED) - CheckCreatingProgressLifecycleHandler (CREATING -> RUNNING, with hooks) - CheckTerminatingProgressLifecycleHandler (TERMINATING -> TERMINATED, with hooks) - CheckRunningSessionTerminationLifecycleHandler (RUNNING -> TERMINATING) - Register new lifecycle handlers in ScheduleCoordinator._init_lifecycle_handlers() - Export new handlers from handlers/__init__.py - Keep legacy handlers for backward compatibility (marked as DEPRECATED) Each lifecycle handler implements the DeploymentCoordinator pattern: - Coordinator queries sessions based on target_statuses/target_kernel_statuses - Handler executes business logic and returns successes/failures/stales - Coordinator applies status transitions via success_status/failure_status/stale_status 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add comprehensive tests for the new DeploymentCoordinator-style pattern: test_lifecycle_handlers.py: - Tests for CheckPullingProgressLifecycleHandler - Tests for CheckCreatingProgressLifecycleHandler - Tests for CheckTerminatingProgressLifecycleHandler - Tests for CheckRunningSessionTerminationLifecycleHandler - Tests for SessionExecutionResult merge/post-processing test_coordinator.py: - Tests for process_lifecycle_schedule method - Tests for _handle_status_transitions (success/failure/stale) - Tests for scaling group iteration - Tests for result merging from multiple scaling groups 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
When a lifecycle handler exists for a schedule type, process_schedule now delegates to process_lifecycle_schedule which uses the new DeploymentCoordinator pattern with scaling-group iteration and centralized status transitions. Activated handlers: - CHECK_PULLING_PROGRESS - CHECK_CREATING_PROGRESS - CHECK_TERMINATING_PROGRESS - CHECK_RUNNING_SESSION_TERMINATION 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add SessionConditions, KernelConditions, SessionOrders, KernelOrders for BatchQuerier-based query filtering - Add NoPagination class for scheduler batch operations - Implement search_sessions, search_sessions_with_kernels, search_sessions_with_kernels_and_user methods in DBSource and Repository - Add SessionSearchResult, SessionWithKernelsSearchResult, SessionWithKernelsAndUserSearchResult types - Update CheckPreconditionLifecycleHandler with repository dependency - Add trigger_image_pulling method to SessionLauncher 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add StartSessionsLifecycleHandler class with SessionLifecycleHandler interface - Add start_sessions_for_handler method to SessionLauncher - Handler queries additional session data with user info via Repository - Coordinator will handle status transitions (PREPARED → CREATING) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…dler - Document why SessionTransitionData is used instead of SessionDataForPull - session_type field is required for hook registry lookup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add TerminateSessionsLifecycleHandler class with SessionLifecycleHandler interface - Handler delegates to existing SessionTerminator for actual RPC calls - No status transition needed (handled by agent events) - success_status is None since kernel termination is async 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Export StartSessionsLifecycleHandler and TerminateSessionsLifecycleHandler from handlers - Register CheckPreconditionLifecycleHandler, StartSessionsLifecycleHandler, TerminateSessionsLifecycleHandler in Coordinator's lifecycle_handlers mapping - Now 7 schedule types use new lifecycle handler pattern: CHECK_PRECONDITION, CHECK_PULLING_PROGRESS, START, CHECK_CREATING_PROGRESS, TERMINATE, CHECK_TERMINATING_PROGRESS, CHECK_RUNNING_SESSION_TERMINATION 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ry/maintenance handlers Add SessionLifecycleHandler implementations following the DeploymentCoordinator pattern: Recovery handlers: - RetryPreparingLifecycleHandler: Retry stuck PREPARING/PULLING sessions - RetryCreatingLifecycleHandler: Retry stuck CREATING sessions Maintenance handlers: - SweepSessionsLifecycleHandler: Sweep stale PENDING sessions - SweepLostAgentKernelsLifecycleHandler: Sweep sessions with lost agents - SweepStaleKernelsLifecycleHandler: Sweep sessions with stale kernel presence Lifecycle handlers: - ScheduleSessionsLifecycleHandler: Schedule PENDING sessions per scaling group All handlers implement: - target_statuses(): Session statuses to query - target_kernel_statuses(): Kernel status filter - success_status()/failure_status()/stale_status(): Status transitions - execute(): Scaling group-specific processing - post_process(): Post-processing actions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Register 6 new SessionLifecycleHandler implementations in Coordinator: - ScheduleSessionsLifecycleHandler (SCHEDULE) - RetryPreparingLifecycleHandler (RETRY_PREPARING) - RetryCreatingLifecycleHandler (RETRY_CREATING) - SweepSessionsLifecycleHandler (SWEEP) - SweepLostAgentKernelsLifecycleHandler (SWEEP_LOST_AGENT_KERNELS) - SweepStaleKernelsLifecycleHandler (SWEEP_STALE_KERNELS) Update handlers/__init__.py to export all new LifecycleHandler classes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Phase 5 of ScheduleCoordinator refactoring: - Add handler-specific methods that accept pre-fetched session data - Add get_sessions_for_start_by_ids() and get_terminating_sessions_by_ids() repository methods for handlers to fetch detailed data by session IDs - Add retry_preparing_for_handler() and retry_creating_for_handler() to Launcher - Add terminate_sessions_for_handler() to Terminator - Update RetryPreparingLifecycleHandler, RetryCreatingLifecycleHandler, and TerminateSessionsLifecycleHandler to use new handler-specific methods - Export TerminatingSessionData from scheduler repository module This follows the DeploymentCoordinator pattern where: 1. Coordinator provides HandlerSessionData to handlers 2. Handler fetches detailed data using *_by_ids() repository methods 3. Handler calls underlying methods with pre-fetched data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Modify Sweep handlers to iterate by scaling group instead of being global operations. This avoids reading too many sessions at once by processing sessions in smaller batches per scaling group. Changes: - Add get_pending_timeout_sessions_by_ids() for SweepSessions handler - Add get_terminating_kernels_with_lost_agents_by_ids() for SweepLostAgentKernels - Add get_sessions_for_transition_by_ids() for SweepStaleKernels - Add handler-specific methods to Terminator: - sweep_stale_sessions_for_handler() - sweep_lost_agent_kernels_for_handler() - sweep_stale_kernels_for_handler() - Update SweepSessionsLifecycleHandler to use Terminator - Update SweepLostAgentKernelsLifecycleHandler to use Terminator - Update SweepStaleKernelsLifecycleHandler to use Terminator - Export SweptSessionInfo from scheduler repository module Following the DeploymentCoordinator pattern: 1. Coordinator provides HandlerSessionData per scaling group 2. Handler fetches detailed data using *_by_ids() repository methods 3. Handler calls Terminator's handler-specific methods 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…rnels Consolidate intermediate data types (HandlerSessionData, SessionTransitionData) into unified SessionWithKernels type. Add _v2 methods to all concrete hook implementations for the new type. Update handler execute signature to execute(scaling_group, sessions) and update tests accordingly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
… methods - Remove SchedulerHandler ABC and all handler implementations - Remove Scheduler delegation methods (terminate_sessions, sweep_*, etc.) - Remove v1 hook interface (SessionTransitionData-based), keep v2 (SessionWithKernels) - Rename hook methods from *_v2 to standard names (on_transition_to_running, etc.) - Remove SessionTransitionData, KernelTransitionData and related repository methods - Clean up coordinator to use only SessionLifecycleHandler pattern All handler logic now flows through SessionLifecycleHandler interface with Coordinator managing scaling group iteration, session querying, and status transitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ataclass - Remove Scheduler class and SchedulerArgs dataclass - Add SchedulerComponents dataclass as a simple container - Add create_scheduler_components factory function - Update ScheduleCoordinator to receive SchedulerComponents directly - Update SokovanOrchestrator to use scheduler_components parameter - Update server.py to use create_default_scheduler_components - Remove unused lock_factory and valkey_schedule parameters from factory This simplifies the scheduler architecture by replacing the Scheduler class (which was just a pass-through container) with a plain dataclass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add test coverage for ScheduleCoordinator including: - Kernel event handlers (pulling, creating, running, terminated, etc.) - Image update methods (update_kernels_to_pulling_for_image, etc.) - SchedulerTaskSpec (create_if_needed_event, create_process_event) - Task spec creation (create_task_specs method) - Lock acquisition behavior for handlers Update existing tests to work with SchedulerComponents dataclass: - Replace mock_scheduler with mock_scheduler_components - Update test fixtures to match new coordinator interface - Fix ScheduleType enum value format (underscore not hyphen) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ther - Refactor process_lifecycle_schedule to process each scaling group independently - Add _process_scaling_group method for per-scaling-group processing: - RecorderContext scoped per scaling group - Status transitions applied immediately per scaling group - Metrics emitted per scaling group - Post-processing executed per scaling group - Use asyncio.gather with return_exceptions=True for parallel execution - Add error logging for failed scaling groups (BaseException check) - Update tests to verify per-scaling-group behavior - Add test for parallel error handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move the recorder module to sokovan/recorder for reuse across coordinators (scheduler, deployment, route). Key changes: - Make TransitionRecorder generic with EntityIdT bound to UUID - Implement non-nestable phase/step constraints with custom exceptions - Restructure to Phase/Step hierarchy using Pydantic models - Simplify API with RecorderContext.scope() and entity() context managers - Add comprehensive tests for recorder functionality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add scope(), entity(), current_pool() API for RecorderContext lifecycle. Implement shared_phase() and shared_step() context managers for batch operations that apply to all entities. Use Python 3.10+ parenthesized context manager syntax for cleaner code. Co-Authored-By: Claude Opus 4.5 <[email protected]>
1c50854 to
38fd6af
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
resolves #NNN (BA-MMM)
Checklist: (if applicable)
ai.backend.testdocsdirectory