Skip to content

Commit 9c0ad28

Browse files
fix(sagemaker): Adjust retry configuration for StartSession (aws#8219)
## Problem When a user tries to connect to a SageMaker Space that is in the Stopped status (i.e., the underlying App is Deleted or has not been created), the Space will be automatically started by the toolkit before the connection is attempted. In some cases, the Space reaches the Running status (i.e., the App reaches the InService status) but the remote access capability is not yet ready as it starts asynchronously, leading to the SageMaker:StartSession API receiving an Internal Failure response. The client already retries, but the retries happen too quickly, before remote access becomes ready. ## Solution Adjust the SageMaker client retry configuration for StartSession calls made from the detached server (called via the `sagemaker_connect` script) to spread out the retries over multiple seconds. --- - Treat all work as PUBLIC. Private `feature/x` branches will not be squash-merged at release time. - Your code changes must meet the guidelines in [CONTRIBUTING.md](https://github.com/aws/aws-toolkit-vscode/blob/master/CONTRIBUTING.md#guidelines). - License: I confirm that my contribution is made under the terms of the Apache 2.0 license. Co-authored-by: Laxman Reddy <[email protected]>
1 parent d34ddfe commit 9c0ad28

File tree

1 file changed

+9
-1
lines changed
  • packages/core/src/awsService/sagemaker/detached-server

1 file changed

+9
-1
lines changed

packages/core/src/awsService/sagemaker/detached-server/utils.ts

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ import os from 'os'
1313
import { join } from 'path'
1414
import { SpaceMappings } from '../types'
1515
import open from 'open'
16+
import { ConfiguredRetryStrategy } from '@smithy/util-retry'
1617
export { open }
1718

1819
export const mappingFilePath = join(os.homedir(), '.aws', '.sagemaker-space-profiles')
@@ -22,6 +23,13 @@ const tempFilePath = `${mappingFilePath}.tmp`
2223
let isWriting = false
2324
const writeQueue: Array<() => Promise<void>> = []
2425

26+
// Currently SSM registration happens asynchronously with App launch, which can lead to
27+
// StartSession Internal Failure when connecting to a fresly-started Space.
28+
// To mitigate, spread out retries over multiple seconds instead of sending all retries within a second.
29+
// Backoff sequence: 1500ms, 2250ms, 3375ms
30+
// Retry timing: 1500ms, 3750ms, 7125ms
31+
const startSessionRetryStrategy = new ConfiguredRetryStrategy(3, (attempt: number) => 1000 * 1.5 ** attempt)
32+
2533
/**
2634
* Reads the local endpoint info file (default or via env) and returns pid & port.
2735
* @throws Error if the file is missing, invalid JSON, or missing fields
@@ -83,7 +91,7 @@ export function parseArn(arn: string): { region: string; accountId: string; spac
8391

8492
export async function startSagemakerSession({ region, connectionIdentifier, credentials }: any) {
8593
const endpoint = process.env.SAGEMAKER_ENDPOINT || `https://sagemaker.${region}.amazonaws.com`
86-
const client = new SageMakerClient({ region, credentials, endpoint })
94+
const client = new SageMakerClient({ region, credentials, endpoint, retryStrategy: startSessionRetryStrategy })
8795
const command = new StartSessionCommand({ ResourceIdentifier: connectionIdentifier })
8896
return client.send(command)
8997
}

0 commit comments

Comments
 (0)