Do not re-throw error when multiple workflow invocations race to complete the workflow#1118
Do not re-throw error when multiple workflow invocations race to complete the workflow#1118VaguelySerious merged 3 commits intomainfrom
Conversation
…lete the workflow Signed-off-by: Peter Wielander <mittgfu@gmail.com>
🦋 Changeset detectedLatest commit: 5effb22 The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (45 failed)turso (45 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
| return; | ||
| } | ||
| throw err; | ||
| } |
There was a problem hiding this comment.
This solves an issue where completing the workflow fails with 409, followed by another workflow invocations that then tries to mark the workflow as "failed", which then also 409s, etc.
There was a problem hiding this comment.
are you sure this is the right approach? are there valid 409s? what about 410 response? are there valid cases where there IS actually indicative of an error?
worth investigating a bit
separately i also think the error codes are getting a bit out of hand to manage and correlate between server and runtime
either this pr future we might want to have well defined workflow error responses from the server (rather than overloading http codes) and have the world-vercel convert them into client errors from the errors package that all worlds can share
i think the runtime should then handle those errors instead of dealing with less-descript HTTP error codes
There was a problem hiding this comment.
Since I'm specifically wrapping the run_completed event creation, we only care about 409s for that specifically, which currently only happens if the the run is already terminal (had claude confirm this + looked myself). This makes me think it's fine to ignore this specific invocation, since even if something did go wrong, it would have gone wrong in the original invocation, and now that the run is terminal, there are no remedial actions we can take.
This begs the question of what possible future 409s we could add to this specific invocation, which I can't come up with any good ones for run_completed that wouldn't want the same handling.
There was a problem hiding this comment.
do we also need to do the same handling for run_failed and run_cancelled then?
we return 409 on any invalid state transition iirc
but for hooks i think we return 410 when the run is completed
but yeah that's also what i mean by its a little hard to track the server vs client error states based on http codes :P
claude's also more likely to miss http codes were used compared to more explicit error returns like "invalid_run_state_transition" or something
There was a problem hiding this comment.
Maybe for run_failed too. It's much less likely to race, but can happen just the same. Updated the code.
I think run_cancelled should error visibly since it's a mostly user-UI or programmatic action and shouldn't race.
There was a problem hiding this comment.
awesome! makes sense. if otel is tidied up and correct - let's also maybe add a test for the changes to prevent regression in the errors
messages and ship :)
| }, | ||
| }); | ||
| } catch (err) { | ||
| if (WorkflowAPIError.is(err) && err.status === 409) { |
There was a problem hiding this comment.
When we hit this path, are we potentially missing setting the span attributes (on line 287) cos we would never hit that path? It might end up not reporting in o11y for these traces.
There was a problem hiding this comment.
We are, I think intentionally as we don't actually complete the run in this case (and don't know the run state, could be "failed") so we'd rather not add wrong information
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
pranaygp
left a comment
There was a problem hiding this comment.
would love tests on this too :)
…p_retrying events When multiple step invocations race to complete/fail/retry a step, the server returns a 409 Conflict. Previously this would bubble up as an unhandled error. Now these are caught and logged as warnings, matching the pattern established for run_completed/run_failed in #1118. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No description provided.