[v3.x] Figure out FUNCTIONS_WORKER_RUNTIME from function app content if Environment variable FUNCTIONS_WORKER_RUNTIME is not set (#8212)

github-actions[bot] · surgupta-msft · NickCraver · fabiocav · commit 5d42b96335ac · 2022-03-08T18:55:39.000-08:00
* Taking worker runtime from files if not in Env setting * Added tests * Code cleanup * Added tests in Utility * minor restructuring in utility * RpcFunctionInvocationDispatcher: capture async void errors In test runs we're seeing failures like this: ``` The active test run was aborted. Reason: Test host process crashed : Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled. at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /_/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 155 at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.InitializeJobhostLanguageWorkerChannelAsync(Int32 attemptCount) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 119 at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.StartWorkerChannel(String runtime) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 535 at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.StartWorkerChannel(String runtime) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 548 at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.DisposeAndRestartWorkerChannel(String runtime, String workerId, Exception workerException) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 497 at Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcFunctionInvocationDispatcher.WorkerError(WorkerErrorEvent workerError) in /_/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs:line 442 at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() at System.Threading.Thread.StartCallback() ``` This is overall because of the async void usage in the event handlers for the `ScriptEvent` pipe. By kicking off a task that could error (what `async void` does under the covers), its only option is to crash the runtime when an error happens. We have the options of: 1. Let the runtime crash (what happens today) 2. Capture only the `TaskCancelledException` case (what this PR does) 3. Capture and log all errors The problem with #3 is it introduces a state I'm tooo naïve to reason about: what happens when an error happens _and we fail to restart the worker_? This could be a downstream net win from a state standpoint of "well, we restart everything and recover". It's not fast though, vs. restarting just the worker. If that's _not_ a new state and it's handled correctly, 3 is the better/global option. The interaction here with the test suite is: 1. We throw an worker error (in testing) 2. An event is triggered kicking off an `async void` to restart the worker 3. We dispose of the Rpc bits (when the test finishes) 4. That background test is cancelled, throwing an error and crashing our runtime I think it's worth noting this is fixing this class only - similar usages (grep `async void`) lie in other areas and is something we should address globally. Writing this up to make sure the direction is correct/agreeable as a first step. Co-authored-by: Surgupta <surgupta@microsoft.com> Co-authored-by: Nick Craver <nrcraver@gmail.com>
diff --git a/src/WebJobs.Script/Utility.cs b/src/WebJobs.Script/Utility.cs
@@ -631,9 +631,21 @@ internal static bool IsSingleLanguage(IEnumerable<FunctionMetadata> functions, s
             return ContainsFunctionWithWorkerRuntime(filteredFunctions, workerRuntime);
         }
 
-        internal static string GetWorkerRuntime(IEnumerable<FunctionMetadata> functions)
+        internal static string GetWorkerRuntime(IEnumerable<FunctionMetadata> functions, IEnvironment environment = null)
         {
-            if (IsSingleLanguage(functions, null))
+            string workerRuntime = null;
+
+            if (environment != null)
+            {
+                workerRuntime = environment.GetEnvironmentVariable(EnvironmentSettingNames.FunctionWorkerRuntime);
+
+                if (!string.IsNullOrEmpty(workerRuntime))
+                {
+                    return workerRuntime;
+                }
+            }
+
+            if (functions != null && IsSingleLanguage(functions, null))
             {
                 var filteredFunctions = functions?.Where(f => !f.IsCodeless());
                 string functionLanguage = filteredFunctions.FirstOrDefault()?.Language;
diff --git a/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs b/src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs
@@ -217,7 +217,8 @@ public async Task InitializeAsync(IEnumerable<FunctionMetadata> functions, Cance
                 return;
             }
 
-            _workerRuntime = _workerRuntime ?? _environment.GetEnvironmentVariable(EnvironmentSettingNames.FunctionWorkerRuntime);
+            _workerRuntime = _workerRuntime ?? Utility.GetWorkerRuntime(functions, _environment);
+
             if (string.IsNullOrEmpty(_workerRuntime) || _workerRuntime.Equals(RpcWorkerConstants.DotNetLanguageWorkerName, StringComparison.InvariantCultureIgnoreCase))
             {
                 // Shutdown any placeholder channels for empty function apps or dotnet function apps.
@@ -434,16 +435,26 @@ public async void WorkerError(WorkerErrorEvent workerError)
                 return;
             }
 
-            if (string.Equals(_workerRuntime, workerError.Language))
+            try
             {
-                _logger.LogDebug("Handling WorkerErrorEvent for runtime:{runtime}, workerId:{workerId}. Failed with: {exception}", workerError.Language, _workerRuntime, workerError.Exception);
-                AddOrUpdateErrorBucket(workerError);
-                await DisposeAndRestartWorkerChannel(workerError.Language, workerError.WorkerId, workerError.Exception);
+                if (string.Equals(_workerRuntime, workerError.Language))
+                {
+                    _logger.LogDebug("Handling WorkerErrorEvent for runtime:{runtime}, workerId:{workerId}. Failed with: {exception}", workerError.Language, _workerRuntime, workerError.Exception);
+                    AddOrUpdateErrorBucket(workerError);
+                    await DisposeAndRestartWorkerChannel(workerError.Language, workerError.WorkerId, workerError.Exception);
+                }
+                else
+                {
+                    _logger.LogDebug("Received WorkerErrorEvent for runtime:{runtime}, workerId:{workerId}", workerError.Language, workerError.WorkerId);
+                    _logger.LogDebug("WorkerErrorEvent runtime:{runtime} does not match current runtime:{currentRuntime}. Failed with: {exception}", workerError.Language, _workerRuntime, workerError.Exception);
+                }
             }
-            else
+            catch (TaskCanceledException)
             {
-                _logger.LogDebug("Received WorkerErrorEvent for runtime:{runtime}, workerId:{workerId}", workerError.Language, workerError.WorkerId);
-                _logger.LogDebug("WorkerErrorEvent runtime:{runtime} does not match current runtime:{currentRuntime}. Failed with: {exception}", workerError.Language, _workerRuntime, workerError.Exception);
+                // Specifically in the "we were torn down while trying to restart" case, we want to catch here and ignore
+                // If we don't catch the exception from an async void method, we'll end up tearing down the entire runtime instead
+                // It's possible we want to catch *all* exceptions and log or ignore here, but taking the minimal change first
+                // For example if we capture and log, we're left in a worker-less state with a working Host runtime - is that desired? Will it self recover elsewhere?
             }
         }
 
@@ -454,8 +465,18 @@ public async void WorkerRestart(WorkerRestartEvent workerRestart)
                 return;
             }
 
-            _logger.LogDebug("Handling WorkerRestartEvent for runtime:{runtime}, workerId:{workerId}", workerRestart.Language, workerRestart.WorkerId);
-            await DisposeAndRestartWorkerChannel(workerRestart.Language, workerRestart.WorkerId);
+            try
+            {
+                _logger.LogDebug("Handling WorkerRestartEvent for runtime:{runtime}, workerId:{workerId}", workerRestart.Language, workerRestart.WorkerId);
+                await DisposeAndRestartWorkerChannel(workerRestart.Language, workerRestart.WorkerId);
+            }
+            catch (TaskCanceledException)
+            {
+                // Specifically in the "we were torn down while trying to restart" case, we want to catch here and ignore
+                // If we don't catch the exception from an async void method, we'll end up tearing down the entire runtime instead
+                // It's possible we want to catch *all* exceptions and log or ignore here, but taking the minimal change first
+                // For example if we capture and log, we're left in a worker-less state with a working Host runtime - is that desired? Will it self recover elsewhere?
+            }
         }
 
         public async Task StartWorkerChannel()
diff --git a/test/WebJobs.Script.Tests/UtilityTests.cs b/test/WebJobs.Script.Tests/UtilityTests.cs
@@ -595,6 +595,27 @@ public void IsSupported_Returns_True(string language, string funcMetadataLanguag
             Assert.True(Utility.IsFunctionMetadataLanguageSupportedByWorkerRuntime(func1, language));
         }
 
+        [Theory]
+        [InlineData(null)]
+        [InlineData("java")]
+        public void GetWorkerRuntimeTests(string workerRuntime)
+        {
+            FunctionMetadata func1 = new FunctionMetadata()
+            {
+                Name = "func1",
+                Language = workerRuntime
+            };
+
+            IEnumerable<FunctionMetadata> functionMetadatas = new List<FunctionMetadata>
+            {
+                 func1
+            };
+
+            var testEnv = new TestEnvironment();
+            testEnv.SetEnvironmentVariable(EnvironmentSettingNames.FunctionWorkerRuntime, workerRuntime);
+            Assert.True(Utility.GetWorkerRuntime(functionMetadatas, testEnv) == workerRuntime);
+        }
+
         [Theory]
         [InlineData("node", "java")]
         [InlineData("java", "node")]
diff --git a/test/WebJobs.Script.Tests/Workers/Rpc/RpcFunctionInvocationDispatcherTests.cs b/test/WebJobs.Script.Tests/Workers/Rpc/RpcFunctionInvocationDispatcherTests.cs
@@ -126,6 +126,41 @@ public async Task WorkerIndexing_Setting_ChannelInitializationState_Succeeds()
             Assert.Equal(expectedProcessCount, initializedChannelsCount);
         }
 
+        [Theory]
+        [InlineData(null)]
+        [InlineData(RpcWorkerConstants.JavaLanguageWorkerName)]
+        public async Task WorkerRuntime_Setting_ChannelInitializationState_Succeeds(string workerRuntime)
+        {
+            _testLoggerProvider.ClearAllLogMessages();
+            int expectedProcessCount = 1;
+            RpcFunctionInvocationDispatcher functionDispatcher = GetTestFunctionDispatcher(expectedProcessCount, false, runtime: workerRuntime, workerIndexing: true);
+
+            // create channels and ensure that they aren't ready for invocation requests yet
+            await functionDispatcher.InitializeAsync(new List<FunctionMetadata>());
+
+            if (!string.IsNullOrEmpty(workerRuntime))
+            {
+                int createdChannelsCount = await WaitForJobhostWorkerChannelsToStartup(functionDispatcher, expectedProcessCount, false);
+                Assert.Equal(expectedProcessCount, createdChannelsCount);
+
+                IEnumerable<IRpcWorkerChannel> channels = await functionDispatcher.GetInitializedWorkerChannelsAsync();
+                Assert.Equal(0, channels.Count());
+
+                // set up invocation buffers, send load requests, and ensure that the channels are now set up for invocation requests
+                var functions = GetTestFunctionsList(RpcWorkerConstants.JavaLanguageWorkerName);
+                await functionDispatcher.FinishInitialization(functions);
+                int initializedChannelsCount = await WaitForJobhostWorkerChannelsToStartup(functionDispatcher, expectedProcessCount, true);
+                Assert.Equal(expectedProcessCount, initializedChannelsCount);
+            }
+            else
+            {
+                foreach (var currChannel in functionDispatcher.JobHostLanguageWorkerChannelManager.GetChannels())
+                {
+                    Assert.True(((TestRpcWorkerChannel)currChannel).ExecutionContexts.Count == 0);
+                }
+            }
+        }
+
         [Fact]
         public async Task Starting_MultipleJobhostChannels_Failed()
         {