Skip to content

Race condition in SecretManager secret caching #11556

@mathewc

Description

@mathewc

In investigating CRI https://portal.microsofticm.com/imp/v5/incidents/details/51000000793837/summary, telemetry shows that the caching logic we have here for host secrets isn't working correctly under concurrent access. What can happen (and what we saw in the CRI) is that 72 threads can all come in and acquire the lock serially, meaning each thread will log the "Loading host secrets" message then one at a time go through the process to load the secrets from storage. The last one will win. We need to employ a double check strategy to prevent this:

public async virtual Task<HostSecretsInfo> GetHostSecretsAsync()
{
    if (_hostSecrets == null)
    {
        using (_metricsLogger.LatencyEvent(GetMetricEventName(MetricEventNames.SecretManagerGetHostSecrets)))
        {
            await _hostSecretsLock.WaitAsync();

            try
            {
                // FIX: Double-check after acquiring lock
                if (_hostSecrets is not null)
                {
                    return _hostSecrets;
                }

                _logger.LogDebug("Loading host secrets");

                HostSecrets hostSecrets = await LoadSecretsAsync<HostSecrets>();
                
                // ... rest of the method

This will prevent all the unnecessary work that is happening (and the log noise). Note that the same fix should also be applied to GetFunctionSecretsAsync.

In addition, I suspect that this will also address a DataProtection issue that also happened in this CRI. For that issue, the host is showing the following error and stack trace:

Non-decryptable host secrets detected. Refreshing secrets. Exception: System.Security.Cryptography.CryptographicException: The provided payload could not be decrypted. Refer to the inner exception for more information. For more information go to http://aka.ms/dataprotectionwarning
 ---> System.InvalidOperationException: The key ring does not contain a valid default protection key. The data protection system cannot create a new key because auto-generation of keys is disabled. For more information go to http://aka.ms/dataprotectionwarning
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.CreateCacheableKeyRingCore(DateTimeOffset now, IKey keyJustAdded)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.Microsoft.AspNetCore.DataProtection.KeyManagement.Internal.ICacheableKeyRingProvider.GetCacheableKeyRing(DateTimeOffset now)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.GetCurrentKeyRingCore(DateTime utcNow, Boolean forceRefresh)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.GetCurrentKeyRing()
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingBasedDataProtector.UnprotectCore(Byte[] protectedData, Boolean allowOperationsOnRevokedKeys, UnprotectStatus& status)
   --- End of inner exception stack trace ---
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingBasedDataProtector.UnprotectCore(Byte[] protectedData, Boolean allowOperationsOnRevokedKeys, UnprotectStatus& status)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingBasedDataProtector.DangerousUnprotect(Byte[] protectedData, Boolean ignoreRevocationErrors, Boolean& requiresMigration, Boolean& wasRevoked)
   at Microsoft.Azure.WebJobs.Script.WebHost.DataProtectionKeyValueConverter.Unprotect(Key key) in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/DataProtectionKeyValueConverter.cs:line 42
   at Microsoft.Azure.WebJobs.Script.WebHost.DataProtectionKeyValueConverter.ReadValue(Key key) in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/DataProtectionKeyValueConverter.cs:line 29
   at Microsoft.Azure.WebJobs.Script.WebHost.SecretManager.ReadHostSecrets(HostSecrets hostSecrets) in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/SecretManager.cs:line 741
   at Microsoft.Azure.WebJobs.Script.WebHost.SecretManager.GetHostSecretsAsync() in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/SecretManager.cs:line 123.

It might be that the high volume of calls to the DP library caused this failure.

Metadata

Metadata

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions