fix(ai): fix orchestrator suspension for AI jobs#3393
fix(ai): fix orchestrator suspension for AI jobs#3393leszko merged 16 commits intolivepeer:masterfrom
Conversation
* refactor: some minor code changes * refactor: fix transient error naming * refactor: make isRetryableError case insensitive
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3393 +/- ##
===================================================
- Coverage 32.18719% 32.14408% -0.04311%
===================================================
Files 147 147
Lines 40687 40754 +67
===================================================
+ Hits 13096 13100 +4
- Misses 26818 26880 +62
- Partials 773 774 +1
... and 2 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
| if ok { | ||
| penalty -= lastCount | ||
| } | ||
| pool.suspender.suspend(sess.Transcoder(), penalty) |
There was a problem hiding this comment.
How does this penalty actually work for AI Jobs (and AI Video), because I see that the Orchestrato gets penalized every time it's removed from the pool. So, for example, if we call O and it has "insufficient capacity", then it will get penalty. Right?
I don't think it's actually correct. Maybe we should detect what kind of error is returned and only penalize certain errors. Wdyt?
There was a problem hiding this comment.
This pull request serves as an intermediate fix after PR #3033 was reverted. While it’s not the ideal solution for batch or generative AI jobs, I’m unsure how it will impact real-time jobs. However, it’s still an improvement over the previous issue where a broken orchestrator wasn’t being suspended at all.
I don't think it's actually correct. Maybe we should detect what kind of error is returned and only penalize certain errors. Wdyt?
A similar approach was taken in PR #3033, but the issue was that an orchestrator with insufficient capacity needed to be temporarily deprioritized, giving other orchestrators a chance to be tried first. A better approach would be:
- If an orchestrator has no capacity, it shouldn’t be suspended for transient errors.
- Instead, it should be deprioritized while other orchestrators are attempted.
- If all orchestrators have been tried, it should be retried.
This suggests that the previous logic was close to a proper fix, but we needed to refine it slightly to handle no-capacity cases more effectively.
There was a problem hiding this comment.
I agree insufficient capacity returned by the Orchestrator should not be penalized beyond the current request. A couple other errors for ticket nonce count too high and ticket params expired I think should also not be suspended as well.
A possibly lighter weight solution to this is add a function to the AISessionManager that removes the session from the selector without suspending it and adds the session back into the pool and selector after the request timeout (defaults to 2 seconds). This could be some multiple of the timeout as well (maybe 2-3x) and the right timeout is likely dependent on average time to process the requests.
I put together a shot at this but have not tested yet. Let me know what you think: ad-astra-video@fb248cf
The harder solution is creating a selector for each request. This would allow to remove and suspend sessions from the selectors without impacting the main selector tracking the sessions. since I think we have on the horizon updating selection to allow including hardware preferences as well I think this would be better tackled when we look at adding hardware information to the selection algo (and possible allow for latency score only selection if possible). I started exploring this option here but not fully tested it yet: https://github.com/ad-astra-video/go-livepeer/tree/av-new-selector-for-each-ai-request
There was a problem hiding this comment.
Ok, let's merge this PR as it is. And @ad-astra-video This change you suggested ad-astra-video@fb248cf would be a nice addition. Would you mind sending it as a separate PR and we can discuss there?
What does this pull request do? Explain your changes. (required)
Follow up PR to #3033 and #3392 reverting #3033 to remove the part that caused issues in selection. Will move that part to a separate PR.
Specific updates (required)
How did you test each of these updates (required)
Does this pull request close any open issues?
Checklist:
makeruns successfully./test.shpass