You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[v3.x] Figure out FUNCTIONS_WORKER_RUNTIME from function app content if Environment variable FUNCTIONS_WORKER_RUNTIME is not set (#8212)
* Taking worker runtime from files if not in Env setting
* Added tests
* Code cleanup
* Added tests in Utility
* minor restructuring in utility
* RpcFunctionInvocationDispatcher: capture async void errors
In test runs we're seeing failures like this:
```
The active test run was aborted. Reason: Test host process crashed : Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /_/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 155
at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.InitializeJobhostLanguageWorkerChannelAsync(Int32 attemptCount) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 119
at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.StartWorkerChannel(String runtime) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 535
at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.StartWorkerChannel(String runtime) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 548
at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.DisposeAndRestartWorkerChannel(String runtime, String workerId, Exception workerException) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 497
at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.WorkerError(WorkerErrorEvent workerError) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 442
at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
```
This is overall because of the async void usage in the event handlers for the `ScriptEvent` pipe. By kicking off a task that could error (what `async void` does under the covers), its only option is to crash the runtime when an error happens. We have the options of:
1. Let the runtime crash (what happens today)
2. Capture only the `TaskCancelledException` case (what this PR does)
3. Capture and log all errors
The problem with #3 is it introduces a state I'm tooo naïve to reason about: what happens when an error happens _and we fail to restart the worker_? This could be a downstream net win from a state standpoint of "well, we restart everything and recover". It's not fast though, vs. restarting just the worker. If that's _not_ a new state and it's handled correctly, 3 is the better/global option.
The interaction here with the test suite is:
1. We throw an worker error (in testing)
2. An event is triggered kicking off an `async void` to restart the worker
3. We dispose of the Rpc bits (when the test finishes)
4. That background test is cancelled, throwing an error and crashing our runtime
I think it's worth noting this is fixing this class only - similar usages (grep `async void`) lie in other areas and is something we should address globally. Writing this up to make sure the direction is correct/agreeable as a first step.
Co-authored-by: Surgupta <[email protected]>
Co-authored-by: Nick Craver <[email protected]>
_logger.LogDebug("Received WorkerErrorEvent for runtime:{runtime}, workerId:{workerId}",workerError.Language,workerError.WorkerId);
449
+
_logger.LogDebug("WorkerErrorEvent runtime:{runtime} does not match current runtime:{currentRuntime}. Failed with: {exception}",workerError.Language,_workerRuntime,workerError.Exception);
450
+
}
442
451
}
443
-
else
452
+
catch(TaskCanceledException)
444
453
{
445
-
_logger.LogDebug("Received WorkerErrorEvent for runtime:{runtime}, workerId:{workerId}",workerError.Language,workerError.WorkerId);
446
-
_logger.LogDebug("WorkerErrorEvent runtime:{runtime} does not match current runtime:{currentRuntime}. Failed with: {exception}",workerError.Language,_workerRuntime,workerError.Exception);
454
+
// Specifically in the "we were torn down while trying to restart" case, we want to catch here and ignore
455
+
// If we don't catch the exception from an async void method, we'll end up tearing down the entire runtime instead
456
+
// It's possible we want to catch *all* exceptions and log or ignore here, but taking the minimal change first
457
+
// For example if we capture and log, we're left in a worker-less state with a working Host runtime - is that desired? Will it self recover elsewhere?
447
458
}
448
459
}
449
460
@@ -454,8 +465,18 @@ public async void WorkerRestart(WorkerRestartEvent workerRestart)
454
465
return;
455
466
}
456
467
457
-
_logger.LogDebug("Handling WorkerRestartEvent for runtime:{runtime}, workerId:{workerId}",workerRestart.Language,workerRestart.WorkerId);
0 commit comments