-
Notifications
You must be signed in to change notification settings - Fork 469
Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
await _rpcWorkerProcess.StartProcessAsync(); | ||
_state = _state | RpcWorkerChannelState.Initializing; | ||
await _workerInitTask.Task; | ||
await _rpcWorkerProcess.StartProcessAsync(cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the important change - we will now wait on either worker fully initialized (gRPC connection established) or worker exits (in which case, we will re-throw any failures the worker experience).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good -- would just like a test added
/azp run host.integration-tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run host.public |
Azure Pipelines successfully started running 1 pipeline(s). |
@jviau @mattchenderson I was just investigating this yesterday! Turns out this is the same reason that we are stuck for 60s when trying to exit the CLI during startup (Azure/azure-functions-core-tools#4355). Glad to see there is a already a fix before I opened one :) I'm going to test your changes out with core tools and see if it addresses the issue like I expect. Looks like it works?
Before this change, it would hang for 60 seconds waiting on the timeout from PendingItem.
For my version of the fix, I was considering passing a CT to PendingItem and registering an edit: more logs - this is outside of my debug session - it takes 5 seconds to shutdown after we cancel:
For core tools at least, I would love to reduce this time but I can probably just inject options in the webhost to reduce the flush logs timer to a 1 second or something like that. Anywho not related to this PR
|
d499b02
to
d9d1cd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the ScriptHost startup experience with faulty workers by implementing timeout optimizations. The key issue addressed is that when a worker crashes or exits immediately after startup, GrpcWorkerChannel.StartWorkerProcessAsync
would block on the worker initialization task until it times out, potentially causing host faults during debugging.
Key Changes
- Added
WaitForExitAsync
method toIWorkerProcess
to wait for process exit with proper exception propagation - Modified
GrpcWorkerChannel.StartWorkerProcessAsync
to race between worker initialization and process exit - Enhanced test coverage for worker process lifecycle and failure scenarios
Reviewed Changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/WebJobs.Script/Workers/ProcessManagement/IWorkerProcess.cs |
Added WaitForExitAsync method signature with cancellation token support |
src/WebJobs.Script/Workers/ProcessManagement/WorkerProcess.cs |
Implemented WaitForExitAsync with TaskCompletionSource for custom exception propagation |
src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs |
Modified StartWorkerProcessAsync to race worker init against process exit |
test/WebJobs.Script.Tests/Workers/Rpc/RpcWorkerProcessTests.cs |
Added comprehensive tests for WaitForExitAsync functionality |
test/WebJobs.Script.Tests/Workers/Rpc/GrpcWorkerChannelTests.cs |
Added tests for process exit scenarios and timeout improvements |
ca68897
to
16650da
Compare
This reverts commit a816a7f.
{ | ||
_processExit.TrySetResult(Process.ExitCode); | ||
UnregisterFromProcessMonitor(); | ||
Process.Close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-E2E test step in CI is hanging because some child process is inheriting STDIO and leaving it open. Looking at my PR, the changes to process management seem like the most likely culprit. But I don't understand why -- we were never disposing/closing the process in all cases before. Wouldn't this be a leak in the process restart scenario?
@fabiocav, any thoughts?
d141cb4
to
4f1a2b7
Compare
4f1a2b7
to
de9970f
Compare
Issue describing the changes in this PR
resolves #issue_for_this_pr
Pull request checklist
IMPORTANT: Currently, changes must be backported to the
in-proc
branch to be included in Core Tools and non-Flex deployments.in-proc
branch is not requiredrelease_notes.md
-- TODOAdditional information
This PR improves the ScriptHost startup experience with a bad worker. Today, if a worker crashes or exits immediately after startup, then the
GrpcWorkerChannel.StartWorkerProcessAsync
will block on_workerInitTask.Task
until it eventually times out. This tends to fault the entire host (at least during debugging).To address this, a
WorkerProcess.WaitForExitAsync
is added andGrpcWorkerChannel.StartWorkerProcessAsync
will also wait on that, improving the responsiveness to a worker exiting before connecting gRPC events.