Skip to content

Handling Garbage Collector's ServiceWaitingForManualInterventionError and Stale DS-Scheduler State and Directorv2 Interaction #8040

@pcrespov

Description

@pcrespov

Which deploy/s?

production aws (e.g. osparc.io)

Current Behavior

  • The garbage collector is reporting the following error dynamic_scheduler.errors.ServiceWaitingForManualInterventionError on a specific service while running GC on the orphan services.
  • Looking at the scheduler monitoring (e.g. https://monitoring.*/dynamic-scheduler/) we find
    Image
  • Q: we do not know why? UNEXEPECTED_OUTME!?? How to debug this?
  • Portainer shows that this service was constantly pending
  • Image
    • probably the machine was removed but the scheduler does not updated its state?
  • We remove the service from the scheduler. The monitoring does not show any problem with the service anymore
  • But the GC continues logging error
  • We restart the directorv2
  • No the GC does not log error anymore
  • It is strange since we see that the GC is getting this information from the scheduler and not the directorv2
  • We wonder whether the scheduler state still depends on something living int he drierctorv2 which is definitively misbehaving

An error related to garbage collection (GC) of orphan services was observed in the system. The root cause appeared to be a stale scheduler state that was not cleared properly, potentially due to inconsistencies between the dynamic-scheduler and directorv2 components. Manual intervention and a component restart were ultimately required to resolve the issue.

  • Initial Symptom

    • The garbage collector reported the following recurring error:
      dynamic_scheduler.errors.ServiceWaitingForManualInterventionError
      
    • This occurred during GC runs on orphan services.
      Image
  • Monitoring Clues

    • Inspection of the scheduler monitoring dashboard (https://monitoring.*/dynamic-scheduler/) revealed a failed service in an unexpected state:
      Image
      UNEXPECTED_OUTCOME!
      
    • This status was not self-explanatory and raised questions about its origin and meaning. We did not know how to debug this!??
  • Container State

    • Portainer indicated that the service in question remained in a "pending" state indefinitely.
    • This suggested a possible issue where the node hosting the service might have been removed, yet the scheduler had not updated its internal state accordingly.
      Image
  • Manual Cleanup Attempt

    • The affected service was manually removed from the scheduler.
    • After this action, the scheduler’s monitoring interface no longer displayed any issues related to the service.
  • Persistence of GC Errors

    • Despite the manual cleanup, the garbage collector continued logging the same ServiceWaitingForManualInterventionError.
  • Directorv2 Restart

    • Restarting the directorv2 component led to the GC error disappearing.
    • This was unexpected, given that GC operations were believed to rely solely on the scheduler’s state and not directorv2.

Analysis and Open Questions

  • The error seems rooted in a mismatch or stale state between the dynamic-scheduler and directorv2.
  • Although the scheduler was manually corrected, the GC continued to fail, suggesting some latent state still depended on directorv2.
  • The resolution via directorv2 restart implies that the scheduler might query or synchronize with directorv2, directly or indirectly, for orphan service validation.
  • The meaning and origin of the UNEXPECTED_OUTCOME! state is unclear and requires further information to diagnose the unexpected event! (e.g. well defined exepction handling and logging)

Recommendations

  • Clarify and document the interaction between dynamic-scheduler, directorv2, and the GC system.
  • Investigate how GC fetches service state and whether it caches any data from directorv2.
  • Improve error messages like UNEXPECTED_OUTME! to be more informative or link them to internal state reasons
  • Consider adding automatic reconciliation logic to detect and handle removed/missing nodes. (instead of "manual intervention")

Metadata

Metadata

Assignees

Labels

bugbuggy, it does not work as expected

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions