Skip to content

Conversation

@dylanraws
Copy link
Contributor

Problem

When a user tries to connect to a SageMaker Space that is in the Stopped status (i.e., the underlying App is Deleted or has not been created), the Space will be automatically started by the toolkit before the connection is attempted. In some cases, the Space reaches the Running status (i.e., the App reaches the InService status) but the remote access capability is not yet ready as it starts asynchronously, leading to the SageMaker:StartSession API receiving an Internal Failure response. The client already retries, but the retries happen too quickly, before remote access becomes ready.

Solution

Adjust the SageMaker client retry configuration for StartSession calls made from the detached server (called via the sagemaker_connect script) to spread out the retries over multiple seconds.


  • Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
  • Your code changes must meet the guidelines in CONTRIBUTING.md.
  • License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dylanraws dylanraws requested a review from a team as a code owner October 21, 2025 23:47
@amazon-inspector-ohio
Copy link

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

@github-actions
Copy link

  • This pull request modifies code in src/* but no tests were added/updated.
    • Confirm whether tests should be added or ensure the PR description explains why tests are not required.
  • This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
    • Note: beta or "experiment" features that have active users should announce fixes in the changelog.
    • If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

@amazon-inspector-ohio
Copy link

✅ I finished the code review, and didn't find any security or code quality issues.

Copy link
Contributor

@vpbhargav vpbhargav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Hope we tested across both SM AI and SMUS to ensure no regressions as well.

@laileni-aws laileni-aws enabled auto-merge (squash) October 22, 2025 16:58
@laileni-aws
Copy link
Contributor

/retryBuilds

@laileni-aws laileni-aws merged commit 9c0ad28 into aws:master Oct 22, 2025
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants