Skip to content

fix(ai): fix orchestrator suspension for AI jobs#3393

Merged
leszko merged 16 commits intolivepeer:masterfrom
ad-astra-video:ai-video-fix-selection-pr
Feb 18, 2025
Merged

fix(ai): fix orchestrator suspension for AI jobs#3393
leszko merged 16 commits intolivepeer:masterfrom
ad-astra-video:ai-video-fix-selection-pr

Conversation

@ad-astra-video
Copy link
Copy Markdown
Collaborator

@ad-astra-video ad-astra-video commented Feb 14, 2025

What does this pull request do? Explain your changes. (required)

Follow up PR to #3033 and #3392 reverting #3033 to remove the part that caused issues in selection. Will move that part to a separate PR.

Specific updates (required)

How did you test each of these updates (required)

Does this pull request close any open issues?

Checklist:

@github-actions github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 14, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 2.50000% with 78 lines in your changes missing coverage. Please review.

Project coverage is 32.14408%. Comparing base (d84c0c6) to head (98e06db).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
server/handlers.go 2.27273% 42 Missing and 1 partial ⚠️
server/ai_session.go 0.00000% 35 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3393         +/-   ##
===================================================
- Coverage   32.18719%   32.14408%   -0.04311%     
===================================================
  Files            147         147                 
  Lines          40687       40754         +67     
===================================================
+ Hits           13096       13100          +4     
- Misses         26818       26880         +62     
- Partials         773         774          +1     
Files with missing lines Coverage Δ
server/ai_process.go 0.59222% <ø> (ø)
server/webserver.go 95.87629% <100.00000%> (+0.04296%) ⬆️
server/ai_session.go 2.33333% <0.00000%> (-0.18466%) ⬇️
server/handlers.go 52.19092% <2.27273%> (-1.77991%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d84c0c6...98e06db. Read the comment docs.

Files with missing lines Coverage Δ
server/ai_process.go 0.59222% <ø> (ø)
server/webserver.go 95.87629% <100.00000%> (+0.04296%) ⬆️
server/ai_session.go 2.33333% <0.00000%> (-0.18466%) ⬇️
server/handlers.go 52.19092% <2.27273%> (-1.77991%) ⬇️

... and 2 files with indirect coverage changes

if ok {
penalty -= lastCount
}
pool.suspender.suspend(sess.Transcoder(), penalty)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this penalty actually work for AI Jobs (and AI Video), because I see that the Orchestrato gets penalized every time it's removed from the pool. So, for example, if we call O and it has "insufficient capacity", then it will get penalty. Right?

I don't think it's actually correct. Maybe we should detect what kind of error is returned and only penalize certain errors. Wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request serves as an intermediate fix after PR #3033 was reverted. While it’s not the ideal solution for batch or generative AI jobs, I’m unsure how it will impact real-time jobs. However, it’s still an improvement over the previous issue where a broken orchestrator wasn’t being suspended at all.

I don't think it's actually correct. Maybe we should detect what kind of error is returned and only penalize certain errors. Wdyt?

A similar approach was taken in PR #3033, but the issue was that an orchestrator with insufficient capacity needed to be temporarily deprioritized, giving other orchestrators a chance to be tried first. A better approach would be:

  • If an orchestrator has no capacity, it shouldn’t be suspended for transient errors.
  • Instead, it should be deprioritized while other orchestrators are attempted.
  • If all orchestrators have been tried, it should be retried.
    This suggests that the previous logic was close to a proper fix, but we needed to refine it slightly to handle no-capacity cases more effectively.

Copy link
Copy Markdown
Collaborator Author

@ad-astra-video ad-astra-video Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree insufficient capacity returned by the Orchestrator should not be penalized beyond the current request. A couple other errors for ticket nonce count too high and ticket params expired I think should also not be suspended as well.

A possibly lighter weight solution to this is add a function to the AISessionManager that removes the session from the selector without suspending it and adds the session back into the pool and selector after the request timeout (defaults to 2 seconds). This could be some multiple of the timeout as well (maybe 2-3x) and the right timeout is likely dependent on average time to process the requests.

I put together a shot at this but have not tested yet. Let me know what you think: ad-astra-video@fb248cf

The harder solution is creating a selector for each request. This would allow to remove and suspend sessions from the selectors without impacting the main selector tracking the sessions. since I think we have on the horizon updating selection to allow including hardware preferences as well I think this would be better tackled when we look at adding hardware information to the selection algo (and possible allow for latency score only selection if possible). I started exploring this option here but not fully tested it yet: https://github.com/ad-astra-video/go-livepeer/tree/av-new-selector-for-each-ai-request

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's merge this PR as it is. And @ad-astra-video This change you suggested ad-astra-video@fb248cf would be a nice addition. Would you mind sending it as a separate PR and we can discuss there?

@leszko leszko merged commit 39db9b6 into livepeer:master Feb 18, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Issues and PR related to the AI-video branch. go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants