Skip to content

Regression: STSAssumeRoleWebIdentityCredentialsProvider not caching credentials: STS congestion and crashes #3558

@jukka-aro-navvis

Description

@jukka-aro-navvis

Bug Description

The STSAssumeRoleWebIdentityCredentialsProvider in AWS SDK C++ bypasses the credential caching layer, causing severe performance issues and connection crashes in multi-threaded applications using Kubernetes IAM Roles for Service Accounts (IRSA).

This is a regression - credential caching was working correctly in AWS SDK C++ version 1.11.621 and was broken by commit c6ed3e3

Observed Symptoms

  1. STS Congestion: Occasional very high latency through STS congestion due to a large number of concurrent calls from different threads when calling GetObject for S3 data access
  2. Connection Crashes: Recurring crashes in aws-c-http connection handling at https://github.com/awslabs/aws-c-http/blob/main/source/connection.c#L429, suspected to be a double release bug in reference counting

Environment

  • AWS SDK C++ Version: 1.11.653 (regression introduced in 1.11.622+)
  • Last Working Version: 1.11.621
  • Operating System: Ubuntu 22.04
  • Authentication Method: AssumeRoleWithWebIdentity (Kubernetes IRSA)
  • Cloud Setup: IAM role-based authentication with Kubernetes service accounts annotated with AWS IAM roles
  • Application: Multi-threaded application using Aws::S3Crt::S3CrtClient making concurrent AWS S3 operations

Important: This issue is particularly relevant when using the S3 CRT Client (Aws::S3Crt::S3CrtClient) as it may have different credential provider behavior compared to the standard S3 client.

Root Cause Analysis

Primary Issue: Missing Credential Caching

File: src/aws-cpp-sdk-core/source/auth/STSCredentialsProvider.cpp (line 39)

// Current problematic implementation - creates uncached provider
m_credentialsProvider = Aws::Crt::Auth::CredentialsProvider::CreateCredentialsProviderSTSWebIdentity(stsConfig);

Problem: The STSAssumeRoleWebIdentityCredentialsProvider creates a raw, uncached STS Web Identity provider. Every call to GetAWSCredentials() results in a direct network call to AWS STS, with variable latency typically ranging from 5-15 seconds. This behavior was introduced as a regression in commit c6ed3e3, which removed the previous credential caching without introducing the proper caching through the CRT layer.

Impact:

  • Every AWS S3 operation (GetObject, PutObject, ListObjects) triggers a fresh STS call
  • Multi-threaded applications create concurrent STS request bursts leading to congestion
  • Connection pool strain from numerous simultaneous HTTP connections to STS
  • No benefit from credential caching despite 1-hour STS credential lifetime
  • Suspected connection reference counting issues leading to crashes in aws-c-http

Secondary Issue: Race Condition in Thread Synchronization

File: src/aws-cpp-sdk-core/source/auth/STSCredentialsProvider.cpp (GetAWSCredentials method)

AWSCredentials STSAssumeRoleWebIdentityCredentialsProvider::GetAWSCredentials() {
  AWSCredentials credentials{};
  auto refreshDone = false;  // LOCAL VARIABLE - NOT THREAD SAFE
  
  m_credentialsProvider->GetCredentials(
    [this, &credentials, &refreshDone](std::shared_ptr<Aws::Crt::Auth::Credentials> crtCredentials, int errorCode) -> void {
      const std::unique_lock<std::mutex> lock{m_refreshMutex};
      // Process credentials...
      refreshDone = true;  // SETS LOCAL VARIABLE OF CALLING THREAD ONLY
    });

  std::unique_lock<std::mutex> lock{m_refreshMutex};
  m_refreshSignal.wait_for(lock, m_providerFuturesTimeoutMs, [&refreshDone]() -> bool { 
    return refreshDone; // Each thread waits on its own local variable
  });
  
  return credentials;
}

Race Condition: Each thread has its own local refreshDone variable, causing multiple concurrent threads to make separate STS calls instead of sharing results.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Credentials should be cached for approximately 50-55 minutes (standard STS credential lifetime minus refresh buffer) with:

  • Cache hits completing in <40ms
  • Fresh STS calls only when credentials are near expiry
  • Thread-safe coordination between concurrent requests
  • Reduced connection pressure on aws-c-http layer

Current Behavior

Impact:

  • Every AWS S3 operation (GetObject, PutObject, ListObjects) triggers a fresh STS call
  • Multi-threaded applications create concurrent STS request bursts leading to congestion
  • Connection pool strain from numerous simultaneous HTTP connections to STS
  • No benefit from credential caching despite 1-hour STS credential lifetime
  • Suspected connection reference counting issues leading to crashes in aws-c-http

Performance Impact

  • Occasional high latency: STS congestion due to concurrent request bursts from multiple threads
  • Connection strain: Multiple simultaneous HTTP connections to STS service
  • Stability issues: Suspected double release bugs in aws-c-http connection reference counting

Reproduction Steps

Minimal Working Example

This example demonstrates the issue using Aws::S3Crt::S3CrtClient with multiple concurrent threads:

#include <aws/core/Aws.h>
#include <aws/core/auth/STSCredentialsProvider.h>
#include <aws/s3-crt/S3CrtClient.h>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>

// Demonstrates the issue with S3 CRT Client and concurrent credential requests
void demonstrateIssue() {
    Aws::SDKOptions options;
    // Enable some logging to see credential provider activity
    options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Warn;
    Aws::InitAPI(options);

    // Configure S3 CRT Client (important: this is different from regular S3Client)
    Aws::S3Crt::ClientConfiguration config;
    config.region = "us-east-1";  // Replace with your region
    config.throughputTargetGbps = 10.0;
    config.partSize = 8 * 1024 * 1024; // 8MB parts
    
    auto s3CrtClient = std::make_shared<Aws::S3Crt::S3CrtClient>(config);

    // Simulate multi-threaded S3 operations to demonstrate STS congestion
    std::vector<std::thread> threads;
    const int numThreads = 20;  // High thread count to show concurrent STS calls
    const std::string bucketName = "your-test-bucket";  // Replace with actual bucket
    
    std::cout << "Starting " << numThreads << " concurrent S3 operations..." << std::endl;
    std::cout << "This will create concurrent STS calls leading to occasional high latency" << std::endl;
    
    auto overallStart = std::chrono::steady_clock::now();
    
    for (int i = 0; i < numThreads; ++i) {
        threads.emplace_back([s3CrtClient, bucketName, i]() {
            auto start = std::chrono::steady_clock::now();
            
            // This will trigger credential resolution for each thread
            Aws::S3Crt::Model::ListObjectsV2Request request;
            request.SetBucket(bucketName);
            request.SetMaxKeys(1); // Minimize data transfer to focus on credential timing
            
            auto outcome = s3CrtClient->ListObjectsV2(request);
            
            auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
                std::chrono::steady_clock::now() - start
            );
            
            std::cout << "Thread " << i << " completed in " << duration.count() << "ms";
            if (outcome.IsSuccess()) {
                std::cout << " (SUCCESS)" << std::endl;
            } else {
                std::cout << " (ERROR: " << outcome.GetError().GetMessage() << ")" << std::endl;
            }
        });
    }

    for (auto& thread : threads) {
        thread.join();
    }
    
    auto overallDuration = std::chrono::duration_cast<std::chrono::seconds>(
        std::chrono::steady_clock::now() - overallStart
    );
    
    std::cout << "\nTotal execution time: " << overallDuration.count() << " seconds" << std::endl;
    std::cout << "Without caching: Expect occasional very high latency due to STS congestion" << std::endl;
    std::cout << "With caching: Most operations should complete quickly using cached credentials" << std::endl;

    Aws::ShutdownAPI(options);
}

int main() {
    // Ensure environment variables are set for IRSA:
    // AWS_REGION, AWS_ROLE_ARN, AWS_WEB_IDENTITY_TOKEN_FILE
    // (These are typically set automatically in Kubernetes IRSA environments)
    
    demonstrateIssue();
    return 0;
}

What to observe when running this example:

  • Without the fix: Occasional very high latency on multiple threads due to STS service congestion from concurrent requests
  • With proper caching: Consistent fast performance with only occasional STS calls when credentials need refresh
  • Reduced risk of connection crashes in aws-c-http due to lower connection pressure

Recommended Solution

Implement CRT-Level Credential Caching

See attached working patch that can be applied directly to tag 1.11.653.

AWS_SDK_CPP_BUG_REPORT_PATCH.patch

// Create underlying STS provider
auto stsProvider = Aws::Crt::Auth::CredentialsProvider::CreateCredentialsProviderSTSWebIdentity(stsConfig);
if (!stsProvider || !stsProvider->IsValid()) {
  AWS_LOGSTREAM_WARN(STS_LOG_TAG, "Failed to create underlying STS credentials provider");
  return;
}

// Wrap with caching provider (50 minutes TTL to refresh before 1-hour STS expiry)
Aws::Crt::Auth::CredentialsProviderCachedConfig cachedConfig;
cachedConfig.Provider = stsProvider;
cachedConfig.CachedCredentialTTL = std::chrono::minutes(50);

m_credentialsProvider = Aws::Crt::Auth::CredentialsProvider::CreateCredentialsProviderCached(cachedConfig);
if (m_credentialsProvider && m_credentialsProvider->IsValid()) {
  m_state = STATE::INITIALIZED;
  AWS_LOGSTREAM_INFO(STS_LOG_TAG, "STS credentials provider initialized with 50-minute cache TTL");
} else {
  AWS_LOGSTREAM_WARN(STS_LOG_TAG, "Failed to create cached STS credentials provider");
}

Fix Thread Synchronization Race Condition

class STSAssumeRoleWebIdentityCredentialsProvider : public AWSCredentialsProvider {
private:
    // Thread-safe credential fetch coordination
    mutable std::atomic<bool> m_refreshInProgress{false};
    mutable std::shared_ptr<AWSCredentials> m_pendingCredentials;
    mutable std::mutex m_refreshMutex;
    mutable std::condition_variable m_refreshSignal;

    // Helper methods for credential retrieval
    AWSCredentials waitForSharedCredentials(std::chrono::steady_clock::time_point requestStartTime) const;
    AWSCredentials extractCredentialsFromCrt(const Aws::Crt::Auth::Credentials& crtCredentials) const;
    AWSCredentials fetchCredentialsAsync(std::chrono::steady_clock::time_point requestStartTime);
};

AWSCredentials STSAssumeRoleWebIdentityCredentialsProvider::GetAWSCredentials() {
  if (m_state != STATE::INITIALIZED) {
    return AWSCredentials{};
  }

  auto requestStartTime = std::chrono::steady_clock::now();

  // Thread-safe check: If another thread is already fetching, wait for its result
  auto expected = false;
  if (!m_refreshInProgress.compare_exchange_strong(expected, true)) {
    return waitForSharedCredentials(requestStartTime);
  }

  // This thread will fetch the credentials
  auto credentials = fetchCredentialsAsync(requestStartTime);
  
  if (!credentials.IsEmpty()) {
    credentials.AddUserAgentFeature(Aws::Client::UserAgentFeature::CREDENTIALS_STS_WEB_IDENTITY_TOKEN);
  }

  return credentials;
}

Additional Context

This issue is particularly severe in Kubernetes environments using IRSA (IAM Roles for Service Accounts) where:

  • Applications typically run multiple worker threads (10-50+ concurrent operations)
  • Each thread may independently access AWS services using S3 CRT Client
  • The lack of credential caching creates concurrent request bursts to STS
  • Connection crashes occur due to suspected double release bugs in aws-c-http connection reference counting under high concurrent load
  • The S3 CRT Client may have different credential provider instantiation patterns than the regular S3 client

The fix aligns with AWS SDK credential caching patterns used in other providers like DefaultAWSCredentialsProviderChain, which properly wraps underlying providers with caching layers, reducing both latency and connection pressure on the underlying HTTP layer.

Files Affected

  • src/aws-cpp-sdk-core/include/aws/core/auth/STSCredentialsProvider.h
  • src/aws-cpp-sdk-core/source/auth/STSCredentialsProvider.cpp

Environment

AWS CPP SDK version used

1.11.622+ (currently 1.11.653)

Compiler and Version used

g++ (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0

Operating System and version

Ubuntu 22.04 LTS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.p1This is a high priority issuepending-releaseThis issue will be fixed by an approved PR that hasn't been released yet.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions