Skip to content

Conversation

@HyeockJinKim
Copy link
Collaborator

resolves #NNN (BA-MMM)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Copilot AI review requested due to automatic review settings January 8, 2026 04:45
@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component labels Jan 8, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 43 out of 43 changed files in this pull request and generated no comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HyeockJinKim HyeockJinKim force-pushed the feat/scheduler-refactoring-repository branch from 6eeaba8 to 1c50854 Compare January 8, 2026 05:23
HyeockJinKim and others added 22 commits January 10, 2026 15:31
Implement the DeploymentCoordinator pattern for ScheduleCoordinator:

- Add SessionLifecycleHandler ABC with declarative status methods:
  - target_statuses(), target_kernel_statuses()
  - success_status(), failure_status(), stale_status()

- Add new result types in results.py:
  - HandlerKernelData, HandlerSessionData for handler input
  - SessionExecutionError, SessionExecutionResult for handler output

- Add handler-specific DB methods in db_source.py:
  - fetch_sessions_for_handler()
  - update_sessions_status_bulk()
  - update_kernels_status_bulk()

- Add process_lifecycle_schedule() to coordinator:
  - Iterates over scaling groups
  - Queries sessions based on handler's target statuses
  - Executes handler logic
  - Applies status transitions via _handle_status_transitions()

This enables gradual migration of existing handlers to the new pattern.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add SessionLifecycleHandler implementations for all 4 progress handlers:
  - CheckPullingProgressLifecycleHandler (PREPARING/PULLING -> PREPARED)
  - CheckCreatingProgressLifecycleHandler (CREATING -> RUNNING, with hooks)
  - CheckTerminatingProgressLifecycleHandler (TERMINATING -> TERMINATED, with hooks)
  - CheckRunningSessionTerminationLifecycleHandler (RUNNING -> TERMINATING)

- Register new lifecycle handlers in ScheduleCoordinator._init_lifecycle_handlers()
- Export new handlers from handlers/__init__.py
- Keep legacy handlers for backward compatibility (marked as DEPRECATED)

Each lifecycle handler implements the DeploymentCoordinator pattern:
- Coordinator queries sessions based on target_statuses/target_kernel_statuses
- Handler executes business logic and returns successes/failures/stales
- Coordinator applies status transitions via success_status/failure_status/stale_status

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add comprehensive tests for the new DeploymentCoordinator-style pattern:

test_lifecycle_handlers.py:
- Tests for CheckPullingProgressLifecycleHandler
- Tests for CheckCreatingProgressLifecycleHandler
- Tests for CheckTerminatingProgressLifecycleHandler
- Tests for CheckRunningSessionTerminationLifecycleHandler
- Tests for SessionExecutionResult merge/post-processing

test_coordinator.py:
- Tests for process_lifecycle_schedule method
- Tests for _handle_status_transitions (success/failure/stale)
- Tests for scaling group iteration
- Tests for result merging from multiple scaling groups

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
When a lifecycle handler exists for a schedule type, process_schedule
now delegates to process_lifecycle_schedule which uses the new
DeploymentCoordinator pattern with scaling-group iteration and
centralized status transitions.

Activated handlers:
- CHECK_PULLING_PROGRESS
- CHECK_CREATING_PROGRESS
- CHECK_TERMINATING_PROGRESS
- CHECK_RUNNING_SESSION_TERMINATION

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add SessionConditions, KernelConditions, SessionOrders, KernelOrders
  for BatchQuerier-based query filtering
- Add NoPagination class for scheduler batch operations
- Implement search_sessions, search_sessions_with_kernels,
  search_sessions_with_kernels_and_user methods in DBSource and Repository
- Add SessionSearchResult, SessionWithKernelsSearchResult,
  SessionWithKernelsAndUserSearchResult types
- Update CheckPreconditionLifecycleHandler with repository dependency
- Add trigger_image_pulling method to SessionLauncher

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add StartSessionsLifecycleHandler class with SessionLifecycleHandler interface
- Add start_sessions_for_handler method to SessionLauncher
- Handler queries additional session data with user info via Repository
- Coordinator will handle status transitions (PREPARED → CREATING)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…dler

- Document why SessionTransitionData is used instead of SessionDataForPull
- session_type field is required for hook registry lookup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add TerminateSessionsLifecycleHandler class with SessionLifecycleHandler interface
- Handler delegates to existing SessionTerminator for actual RPC calls
- No status transition needed (handled by agent events)
- success_status is None since kernel termination is async

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Export StartSessionsLifecycleHandler and TerminateSessionsLifecycleHandler from handlers
- Register CheckPreconditionLifecycleHandler, StartSessionsLifecycleHandler,
  TerminateSessionsLifecycleHandler in Coordinator's lifecycle_handlers mapping
- Now 7 schedule types use new lifecycle handler pattern:
  CHECK_PRECONDITION, CHECK_PULLING_PROGRESS, START, CHECK_CREATING_PROGRESS,
  TERMINATE, CHECK_TERMINATING_PROGRESS, CHECK_RUNNING_SESSION_TERMINATION

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ry/maintenance handlers

Add SessionLifecycleHandler implementations following the DeploymentCoordinator pattern:

Recovery handlers:
- RetryPreparingLifecycleHandler: Retry stuck PREPARING/PULLING sessions
- RetryCreatingLifecycleHandler: Retry stuck CREATING sessions

Maintenance handlers:
- SweepSessionsLifecycleHandler: Sweep stale PENDING sessions
- SweepLostAgentKernelsLifecycleHandler: Sweep sessions with lost agents
- SweepStaleKernelsLifecycleHandler: Sweep sessions with stale kernel presence

Lifecycle handlers:
- ScheduleSessionsLifecycleHandler: Schedule PENDING sessions per scaling group

All handlers implement:
- target_statuses(): Session statuses to query
- target_kernel_statuses(): Kernel status filter
- success_status()/failure_status()/stale_status(): Status transitions
- execute(): Scaling group-specific processing
- post_process(): Post-processing actions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Register 6 new SessionLifecycleHandler implementations in Coordinator:
- ScheduleSessionsLifecycleHandler (SCHEDULE)
- RetryPreparingLifecycleHandler (RETRY_PREPARING)
- RetryCreatingLifecycleHandler (RETRY_CREATING)
- SweepSessionsLifecycleHandler (SWEEP)
- SweepLostAgentKernelsLifecycleHandler (SWEEP_LOST_AGENT_KERNELS)
- SweepStaleKernelsLifecycleHandler (SWEEP_STALE_KERNELS)

Update handlers/__init__.py to export all new LifecycleHandler classes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Phase 5 of ScheduleCoordinator refactoring:
- Add handler-specific methods that accept pre-fetched session data
- Add get_sessions_for_start_by_ids() and get_terminating_sessions_by_ids()
  repository methods for handlers to fetch detailed data by session IDs
- Add retry_preparing_for_handler() and retry_creating_for_handler() to Launcher
- Add terminate_sessions_for_handler() to Terminator
- Update RetryPreparingLifecycleHandler, RetryCreatingLifecycleHandler, and
  TerminateSessionsLifecycleHandler to use new handler-specific methods
- Export TerminatingSessionData from scheduler repository module

This follows the DeploymentCoordinator pattern where:
1. Coordinator provides HandlerSessionData to handlers
2. Handler fetches detailed data using *_by_ids() repository methods
3. Handler calls underlying methods with pre-fetched data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Modify Sweep handlers to iterate by scaling group instead of being
global operations. This avoids reading too many sessions at once by
processing sessions in smaller batches per scaling group.

Changes:
- Add get_pending_timeout_sessions_by_ids() for SweepSessions handler
- Add get_terminating_kernels_with_lost_agents_by_ids() for SweepLostAgentKernels
- Add get_sessions_for_transition_by_ids() for SweepStaleKernels
- Add handler-specific methods to Terminator:
  - sweep_stale_sessions_for_handler()
  - sweep_lost_agent_kernels_for_handler()
  - sweep_stale_kernels_for_handler()
- Update SweepSessionsLifecycleHandler to use Terminator
- Update SweepLostAgentKernelsLifecycleHandler to use Terminator
- Update SweepStaleKernelsLifecycleHandler to use Terminator
- Export SweptSessionInfo from scheduler repository module

Following the DeploymentCoordinator pattern:
1. Coordinator provides HandlerSessionData per scaling group
2. Handler fetches detailed data using *_by_ids() repository methods
3. Handler calls Terminator's handler-specific methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…rnels

Consolidate intermediate data types (HandlerSessionData, SessionTransitionData)
into unified SessionWithKernels type. Add _v2 methods to all concrete hook
implementations for the new type. Update handler execute signature to
execute(scaling_group, sessions) and update tests accordingly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
… methods

- Remove SchedulerHandler ABC and all handler implementations
- Remove Scheduler delegation methods (terminate_sessions, sweep_*, etc.)
- Remove v1 hook interface (SessionTransitionData-based), keep v2 (SessionWithKernels)
- Rename hook methods from *_v2 to standard names (on_transition_to_running, etc.)
- Remove SessionTransitionData, KernelTransitionData and related repository methods
- Clean up coordinator to use only SessionLifecycleHandler pattern

All handler logic now flows through SessionLifecycleHandler interface with
Coordinator managing scaling group iteration, session querying, and status transitions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ataclass

- Remove Scheduler class and SchedulerArgs dataclass
- Add SchedulerComponents dataclass as a simple container
- Add create_scheduler_components factory function
- Update ScheduleCoordinator to receive SchedulerComponents directly
- Update SokovanOrchestrator to use scheduler_components parameter
- Update server.py to use create_default_scheduler_components
- Remove unused lock_factory and valkey_schedule parameters from factory

This simplifies the scheduler architecture by replacing the Scheduler class
(which was just a pass-through container) with a plain dataclass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add test coverage for ScheduleCoordinator including:
- Kernel event handlers (pulling, creating, running, terminated, etc.)
- Image update methods (update_kernels_to_pulling_for_image, etc.)
- SchedulerTaskSpec (create_if_needed_event, create_process_event)
- Task spec creation (create_task_specs method)
- Lock acquisition behavior for handlers

Update existing tests to work with SchedulerComponents dataclass:
- Replace mock_scheduler with mock_scheduler_components
- Update test fixtures to match new coordinator interface
- Fix ScheduleType enum value format (underscore not hyphen)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ther

- Refactor process_lifecycle_schedule to process each scaling group independently
- Add _process_scaling_group method for per-scaling-group processing:
  - RecorderContext scoped per scaling group
  - Status transitions applied immediately per scaling group
  - Metrics emitted per scaling group
  - Post-processing executed per scaling group
- Use asyncio.gather with return_exceptions=True for parallel execution
- Add error logging for failed scaling groups (BaseException check)
- Update tests to verify per-scaling-group behavior
- Add test for parallel error handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move the recorder module to sokovan/recorder for reuse across coordinators
(scheduler, deployment, route). Key changes:

- Make TransitionRecorder generic with EntityIdT bound to UUID
- Implement non-nestable phase/step constraints with custom exceptions
- Restructure to Phase/Step hierarchy using Pydantic models
- Simplify API with RecorderContext.scope() and entity() context managers
- Add comprehensive tests for recorder functionality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add scope(), entity(), current_pool() API for RecorderContext lifecycle.
Implement shared_phase() and shared_step() context managers for batch
operations that apply to all entities. Use Python 3.10+ parenthesized
context manager syntax for cleaner code.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@HyeockJinKim HyeockJinKim force-pushed the feat/scheduler-refactoring-repository branch from 1c50854 to 38fd6af Compare January 10, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants