fix(ai): fix orchestrator suspension for AI jobs by ad-astra-video · Pull Request #3393 · livepeer/go-livepeer

ad-astra-video · 2025-02-14T18:29:05Z

What does this pull request do? Explain your changes. (required)

Follow up PR to #3033 and #3392 reverting #3033 to remove the part that caused issues in selection. Will move that part to a separate PR.

Specific updates (required)

How did you test each of these updates (required)

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

* refactor: some minor code changes * refactor: fix transient error naming * refactor: make isRetryableError case insensitive

codecov · 2025-02-14T18:42:02Z

Codecov Report

Attention: Patch coverage is 2.50000% with 78 lines in your changes missing coverage. Please review.

Project coverage is 32.14408%. Comparing base (d84c0c6) to head (98e06db).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
server/handlers.go	2.27273%	42 Missing and 1 partial ⚠️
server/ai_session.go	0.00000%	35 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3393         +/-   ##
===================================================
- Coverage   32.18719%   32.14408%   -0.04311%     
===================================================
  Files            147         147                 
  Lines          40687       40754         +67     
===================================================
+ Hits           13096       13100          +4     
- Misses         26818       26880         +62     
- Partials         773         774          +1

Files with missing lines	Coverage Δ
server/ai_process.go	`0.59222% <ø> (ø)`
server/webserver.go	`95.87629% <100.00000%> (+0.04296%)`	⬆️
server/ai_session.go	`2.33333% <0.00000%> (-0.18466%)`	⬇️
server/handlers.go	`52.19092% <2.27273%> (-1.77991%)`	⬇️

... and 2 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d84c0c6...98e06db. Read the comment docs.

Files with missing lines	Coverage Δ
server/ai_process.go	`0.59222% <ø> (ø)`
server/webserver.go	`95.87629% <100.00000%> (+0.04296%)`	⬆️
server/ai_session.go	`2.33333% <0.00000%> (-0.18466%)`	⬇️
server/handlers.go	`52.19092% <2.27273%> (-1.77991%)`	⬇️

... and 2 files with indirect coverage changes

leszko · 2025-02-17T15:50:51Z

server/ai_session.go

+	if ok {
+		penalty -= lastCount
+	}
 	pool.suspender.suspend(sess.Transcoder(), penalty)


How does this penalty actually work for AI Jobs (and AI Video), because I see that the Orchestrato gets penalized every time it's removed from the pool. So, for example, if we call O and it has "insufficient capacity", then it will get penalty. Right?

I don't think it's actually correct. Maybe we should detect what kind of error is returned and only penalize certain errors. Wdyt?

This pull request serves as an intermediate fix after PR #3033 was reverted. While it’s not the ideal solution for batch or generative AI jobs, I’m unsure how it will impact real-time jobs. However, it’s still an improvement over the previous issue where a broken orchestrator wasn’t being suspended at all.

I don't think it's actually correct. Maybe we should detect what kind of error is returned and only penalize certain errors. Wdyt?

A similar approach was taken in PR #3033, but the issue was that an orchestrator with insufficient capacity needed to be temporarily deprioritized, giving other orchestrators a chance to be tried first. A better approach would be:

If an orchestrator has no capacity, it shouldn’t be suspended for transient errors.

Instead, it should be deprioritized while other orchestrators are attempted.

If all orchestrators have been tried, it should be retried.
This suggests that the previous logic was close to a proper fix, but we needed to refine it slightly to handle no-capacity cases more effectively.

I agree insufficient capacity returned by the Orchestrator should not be penalized beyond the current request. A couple other errors for ticket nonce count too high and ticket params expired I think should also not be suspended as well.

A possibly lighter weight solution to this is add a function to the AISessionManager that removes the session from the selector without suspending it and adds the session back into the pool and selector after the request timeout (defaults to 2 seconds). This could be some multiple of the timeout as well (maybe 2-3x) and the right timeout is likely dependent on average time to process the requests.

I put together a shot at this but have not tested yet. Let me know what you think: ad-astra-video@fb248cf

The harder solution is creating a selector for each request. This would allow to remove and suspend sessions from the selectors without impacting the main selector tracking the sessions. since I think we have on the horizon updating selection to allow including hardware preferences as well I think this would be better tackled when we look at adding hardware information to the selection algo (and possible allow for latency score only selection if possible). I started exploring this option here but not fully tested it yet: https://github.com/ad-astra-video/go-livepeer/tree/av-new-selector-for-each-ai-request

Ok, let's merge this PR as it is. And @ad-astra-video This change you suggested ad-astra-video@fb248cf would be a nice addition. Would you mind sending it as a separate PR and we can discuss there?

ad-astra-video and others added 14 commits February 5, 2025 10:10

move signalRefresh() to Refresh

894e55c

add log line for session selected

b914210

fix suspension

0586602

fix penalty def and comment

d42df5e

fix variable naming to camelCase

79435d4

do not suspend orchestrators for certain errors

c484417

add cli webserver handler to get AI pools info

c553737

do not add suspended orchestrators to pool

45d0c70

fix typo

acda01a

fix insufficient capacity error text to not suspend orchestrators

7636d8e

refactor: some minor code changes (#32)

8091750

* refactor: some minor code changes * refactor: fix transient error naming * refactor: make isRetryableError case insensitive

Merge branch 'master' into ai-video-fix-selection-pr

77b3923

Update ai_session.go

af7434f

remove not suspending for retryable errors

9fa281f

ad-astra-video requested a review from leszko February 14, 2025 18:29

github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 14, 2025

ad-astra-video requested a review from rickstaa February 14, 2025 18:29

Merge branch 'master' into ai-video-fix-selection-pr

b73d168

leszko reviewed Feb 17, 2025

View reviewed changes

Merge branch 'master' into ai-video-fix-selection-pr

98e06db

leszko approved these changes Feb 18, 2025

View reviewed changes

leszko merged commit 39db9b6 into livepeer:master Feb 18, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai): fix orchestrator suspension for AI jobs#3393

fix(ai): fix orchestrator suspension for AI jobs#3393
leszko merged 16 commits intolivepeer:masterfrom
ad-astra-video:ai-video-fix-selection-pr

ad-astra-video commented Feb 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Feb 14, 2025 •

edited

Loading

Uh oh!

leszko Feb 17, 2025

Uh oh!

rickstaa Feb 17, 2025

Uh oh!

ad-astra-video Feb 18, 2025 •

edited

Loading

Uh oh!

leszko Feb 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ad-astra-video commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leszko Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

rickstaa Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

ad-astra-video Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leszko Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ad-astra-video commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

ad-astra-video Feb 18, 2025 •

edited

Loading