-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Labels
bugbuggy, it does not work as expectedbuggy, it does not work as expected
Milestone
Description
Which deploy/s?
production aws (e.g. osparc.io)
Current Behavior
- The garbage collector is reporting the following error
dynamic_scheduler.errors.ServiceWaitingForManualInterventionError
on a specific service while running GC on the orphan services. - Looking at the scheduler monitoring (e.g.
https://monitoring.*/dynamic-scheduler/
) we find
- Q: we do not know why? UNEXEPECTED_OUTME!?? How to debug this?
- Portainer shows that this service was constantly pending
- probably the machine was removed but the scheduler does not updated its state?
- We remove the service from the scheduler. The monitoring does not show any problem with the service anymore
- But the GC continues logging error
- We restart the directorv2
- No the GC does not log error anymore
- It is strange since we see that the GC is getting this information from the scheduler and not the directorv2
- We wonder whether the scheduler state still depends on something living int he drierctorv2 which is definitively misbehaving
An error related to garbage collection (GC) of orphan services was observed in the system. The root cause appeared to be a stale scheduler state that was not cleared properly, potentially due to inconsistencies between the dynamic-scheduler
and directorv2
components. Manual intervention and a component restart were ultimately required to resolve the issue.
-
Initial Symptom
-
Monitoring Clues
-
Container State
-
Manual Cleanup Attempt
- The affected service was manually removed from the scheduler.
- After this action, the scheduler’s monitoring interface no longer displayed any issues related to the service.
-
Persistence of GC Errors
- Despite the manual cleanup, the garbage collector continued logging the same
ServiceWaitingForManualInterventionError
.
- Despite the manual cleanup, the garbage collector continued logging the same
-
Directorv2 Restart
- Restarting the
directorv2
component led to the GC error disappearing. - This was unexpected, given that GC operations were believed to rely solely on the scheduler’s state and not
directorv2
.
- Restarting the
Analysis and Open Questions
- The error seems rooted in a mismatch or stale state between the
dynamic-scheduler
anddirectorv2
. - Although the scheduler was manually corrected, the GC continued to fail, suggesting some latent state still depended on
directorv2
. - The resolution via
directorv2
restart implies that the scheduler might query or synchronize withdirectorv2
, directly or indirectly, for orphan service validation. - The meaning and origin of the
UNEXPECTED_OUTCOME!
state is unclear and requires further information to diagnose the unexpected event! (e.g. well defined exepction handling and logging)
Recommendations
- Clarify and document the interaction between
dynamic-scheduler
,directorv2
, and the GC system. - Investigate how GC fetches service state and whether it caches any data from
directorv2
. - Improve error messages like
UNEXPECTED_OUTME!
to be more informative or link them to internal state reasons - Consider adding automatic reconciliation logic to detect and handle removed/missing nodes. (instead of "manual intervention")
Metadata
Metadata
Assignees
Labels
bugbuggy, it does not work as expectedbuggy, it does not work as expected