-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Bug Description
The STSAssumeRoleWebIdentityCredentialsProvider in AWS SDK C++ bypasses the credential caching layer, causing severe performance issues and connection crashes in multi-threaded applications using Kubernetes IAM Roles for Service Accounts (IRSA).
This is a regression - credential caching was working correctly in AWS SDK C++ version 1.11.621 and was broken by commit c6ed3e3
Observed Symptoms
- STS Congestion: Occasional very high latency through STS congestion due to a large number of concurrent calls from different threads when calling GetObject for S3 data access
- Connection Crashes: Recurring crashes in aws-c-http connection handling at https://github.com/awslabs/aws-c-http/blob/main/source/connection.c#L429, suspected to be a double release bug in reference counting
Environment
- AWS SDK C++ Version: 1.11.653 (regression introduced in 1.11.622+)
- Last Working Version: 1.11.621
- Operating System: Ubuntu 22.04
- Authentication Method: AssumeRoleWithWebIdentity (Kubernetes IRSA)
- Cloud Setup: IAM role-based authentication with Kubernetes service accounts annotated with AWS IAM roles
- Application: Multi-threaded application using
Aws::S3Crt::S3CrtClientmaking concurrent AWS S3 operations
Important: This issue is particularly relevant when using the S3 CRT Client (Aws::S3Crt::S3CrtClient) as it may have different credential provider behavior compared to the standard S3 client.
Root Cause Analysis
Primary Issue: Missing Credential Caching
File: src/aws-cpp-sdk-core/source/auth/STSCredentialsProvider.cpp (line 39)
// Current problematic implementation - creates uncached provider
m_credentialsProvider = Aws::Crt::Auth::CredentialsProvider::CreateCredentialsProviderSTSWebIdentity(stsConfig);Problem: The STSAssumeRoleWebIdentityCredentialsProvider creates a raw, uncached STS Web Identity provider. Every call to GetAWSCredentials() results in a direct network call to AWS STS, with variable latency typically ranging from 5-15 seconds. This behavior was introduced as a regression in commit c6ed3e3, which removed the previous credential caching without introducing the proper caching through the CRT layer.
Impact:
- Every AWS S3 operation (GetObject, PutObject, ListObjects) triggers a fresh STS call
- Multi-threaded applications create concurrent STS request bursts leading to congestion
- Connection pool strain from numerous simultaneous HTTP connections to STS
- No benefit from credential caching despite 1-hour STS credential lifetime
- Suspected connection reference counting issues leading to crashes in aws-c-http
Secondary Issue: Race Condition in Thread Synchronization
File: src/aws-cpp-sdk-core/source/auth/STSCredentialsProvider.cpp (GetAWSCredentials method)
AWSCredentials STSAssumeRoleWebIdentityCredentialsProvider::GetAWSCredentials() {
AWSCredentials credentials{};
auto refreshDone = false; // LOCAL VARIABLE - NOT THREAD SAFE
m_credentialsProvider->GetCredentials(
[this, &credentials, &refreshDone](std::shared_ptr<Aws::Crt::Auth::Credentials> crtCredentials, int errorCode) -> void {
const std::unique_lock<std::mutex> lock{m_refreshMutex};
// Process credentials...
refreshDone = true; // SETS LOCAL VARIABLE OF CALLING THREAD ONLY
});
std::unique_lock<std::mutex> lock{m_refreshMutex};
m_refreshSignal.wait_for(lock, m_providerFuturesTimeoutMs, [&refreshDone]() -> bool {
return refreshDone; // Each thread waits on its own local variable
});
return credentials;
}Race Condition: Each thread has its own local refreshDone variable, causing multiple concurrent threads to make separate STS calls instead of sharing results.
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
Credentials should be cached for approximately 50-55 minutes (standard STS credential lifetime minus refresh buffer) with:
- Cache hits completing in <40ms
- Fresh STS calls only when credentials are near expiry
- Thread-safe coordination between concurrent requests
- Reduced connection pressure on aws-c-http layer
Current Behavior
Impact:
- Every AWS S3 operation (GetObject, PutObject, ListObjects) triggers a fresh STS call
- Multi-threaded applications create concurrent STS request bursts leading to congestion
- Connection pool strain from numerous simultaneous HTTP connections to STS
- No benefit from credential caching despite 1-hour STS credential lifetime
- Suspected connection reference counting issues leading to crashes in aws-c-http
Performance Impact
- Occasional high latency: STS congestion due to concurrent request bursts from multiple threads
- Connection strain: Multiple simultaneous HTTP connections to STS service
- Stability issues: Suspected double release bugs in aws-c-http connection reference counting
Reproduction Steps
Minimal Working Example
This example demonstrates the issue using Aws::S3Crt::S3CrtClient with multiple concurrent threads:
#include <aws/core/Aws.h>
#include <aws/core/auth/STSCredentialsProvider.h>
#include <aws/s3-crt/S3CrtClient.h>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
// Demonstrates the issue with S3 CRT Client and concurrent credential requests
void demonstrateIssue() {
Aws::SDKOptions options;
// Enable some logging to see credential provider activity
options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Warn;
Aws::InitAPI(options);
// Configure S3 CRT Client (important: this is different from regular S3Client)
Aws::S3Crt::ClientConfiguration config;
config.region = "us-east-1"; // Replace with your region
config.throughputTargetGbps = 10.0;
config.partSize = 8 * 1024 * 1024; // 8MB parts
auto s3CrtClient = std::make_shared<Aws::S3Crt::S3CrtClient>(config);
// Simulate multi-threaded S3 operations to demonstrate STS congestion
std::vector<std::thread> threads;
const int numThreads = 20; // High thread count to show concurrent STS calls
const std::string bucketName = "your-test-bucket"; // Replace with actual bucket
std::cout << "Starting " << numThreads << " concurrent S3 operations..." << std::endl;
std::cout << "This will create concurrent STS calls leading to occasional high latency" << std::endl;
auto overallStart = std::chrono::steady_clock::now();
for (int i = 0; i < numThreads; ++i) {
threads.emplace_back([s3CrtClient, bucketName, i]() {
auto start = std::chrono::steady_clock::now();
// This will trigger credential resolution for each thread
Aws::S3Crt::Model::ListObjectsV2Request request;
request.SetBucket(bucketName);
request.SetMaxKeys(1); // Minimize data transfer to focus on credential timing
auto outcome = s3CrtClient->ListObjectsV2(request);
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now() - start
);
std::cout << "Thread " << i << " completed in " << duration.count() << "ms";
if (outcome.IsSuccess()) {
std::cout << " (SUCCESS)" << std::endl;
} else {
std::cout << " (ERROR: " << outcome.GetError().GetMessage() << ")" << std::endl;
}
});
}
for (auto& thread : threads) {
thread.join();
}
auto overallDuration = std::chrono::duration_cast<std::chrono::seconds>(
std::chrono::steady_clock::now() - overallStart
);
std::cout << "\nTotal execution time: " << overallDuration.count() << " seconds" << std::endl;
std::cout << "Without caching: Expect occasional very high latency due to STS congestion" << std::endl;
std::cout << "With caching: Most operations should complete quickly using cached credentials" << std::endl;
Aws::ShutdownAPI(options);
}
int main() {
// Ensure environment variables are set for IRSA:
// AWS_REGION, AWS_ROLE_ARN, AWS_WEB_IDENTITY_TOKEN_FILE
// (These are typically set automatically in Kubernetes IRSA environments)
demonstrateIssue();
return 0;
}What to observe when running this example:
- Without the fix: Occasional very high latency on multiple threads due to STS service congestion from concurrent requests
- With proper caching: Consistent fast performance with only occasional STS calls when credentials need refresh
- Reduced risk of connection crashes in aws-c-http due to lower connection pressure
Recommended Solution
Implement CRT-Level Credential Caching
See attached working patch that can be applied directly to tag 1.11.653.
AWS_SDK_CPP_BUG_REPORT_PATCH.patch
// Create underlying STS provider
auto stsProvider = Aws::Crt::Auth::CredentialsProvider::CreateCredentialsProviderSTSWebIdentity(stsConfig);
if (!stsProvider || !stsProvider->IsValid()) {
AWS_LOGSTREAM_WARN(STS_LOG_TAG, "Failed to create underlying STS credentials provider");
return;
}
// Wrap with caching provider (50 minutes TTL to refresh before 1-hour STS expiry)
Aws::Crt::Auth::CredentialsProviderCachedConfig cachedConfig;
cachedConfig.Provider = stsProvider;
cachedConfig.CachedCredentialTTL = std::chrono::minutes(50);
m_credentialsProvider = Aws::Crt::Auth::CredentialsProvider::CreateCredentialsProviderCached(cachedConfig);
if (m_credentialsProvider && m_credentialsProvider->IsValid()) {
m_state = STATE::INITIALIZED;
AWS_LOGSTREAM_INFO(STS_LOG_TAG, "STS credentials provider initialized with 50-minute cache TTL");
} else {
AWS_LOGSTREAM_WARN(STS_LOG_TAG, "Failed to create cached STS credentials provider");
}Fix Thread Synchronization Race Condition
class STSAssumeRoleWebIdentityCredentialsProvider : public AWSCredentialsProvider {
private:
// Thread-safe credential fetch coordination
mutable std::atomic<bool> m_refreshInProgress{false};
mutable std::shared_ptr<AWSCredentials> m_pendingCredentials;
mutable std::mutex m_refreshMutex;
mutable std::condition_variable m_refreshSignal;
// Helper methods for credential retrieval
AWSCredentials waitForSharedCredentials(std::chrono::steady_clock::time_point requestStartTime) const;
AWSCredentials extractCredentialsFromCrt(const Aws::Crt::Auth::Credentials& crtCredentials) const;
AWSCredentials fetchCredentialsAsync(std::chrono::steady_clock::time_point requestStartTime);
};
AWSCredentials STSAssumeRoleWebIdentityCredentialsProvider::GetAWSCredentials() {
if (m_state != STATE::INITIALIZED) {
return AWSCredentials{};
}
auto requestStartTime = std::chrono::steady_clock::now();
// Thread-safe check: If another thread is already fetching, wait for its result
auto expected = false;
if (!m_refreshInProgress.compare_exchange_strong(expected, true)) {
return waitForSharedCredentials(requestStartTime);
}
// This thread will fetch the credentials
auto credentials = fetchCredentialsAsync(requestStartTime);
if (!credentials.IsEmpty()) {
credentials.AddUserAgentFeature(Aws::Client::UserAgentFeature::CREDENTIALS_STS_WEB_IDENTITY_TOKEN);
}
return credentials;
}Additional Context
This issue is particularly severe in Kubernetes environments using IRSA (IAM Roles for Service Accounts) where:
- Applications typically run multiple worker threads (10-50+ concurrent operations)
- Each thread may independently access AWS services using S3 CRT Client
- The lack of credential caching creates concurrent request bursts to STS
- Connection crashes occur due to suspected double release bugs in aws-c-http connection reference counting under high concurrent load
- The S3 CRT Client may have different credential provider instantiation patterns than the regular S3 client
The fix aligns with AWS SDK credential caching patterns used in other providers like DefaultAWSCredentialsProviderChain, which properly wraps underlying providers with caching layers, reducing both latency and connection pressure on the underlying HTTP layer.
Files Affected
src/aws-cpp-sdk-core/include/aws/core/auth/STSCredentialsProvider.hsrc/aws-cpp-sdk-core/source/auth/STSCredentialsProvider.cpp
Environment
AWS CPP SDK version used
1.11.622+ (currently 1.11.653)
Compiler and Version used
g++ (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Operating System and version
Ubuntu 22.04 LTS