fix(smus): Improve error handling when the Space takes too long to start #8277

PotatoWKY · 2025-11-11T04:53:23Z

Problem

Occasionally, then a user clicks the Connect button for a Space that is in the Stopped status, the corresponding App that gets created eventually takes too long to become Running, so the user is shown the following error message.

Remote connection failed: Timed out waiting for app "default-b97e54b8-e0e1-70b7-a216-856fcbb3cc61" to reach "InService" status. | Timed out waiting for app "default-b97e54b8-e0e1-70b7-a216-856fcbb3cc61" to reach "InService" status.

We can't prevent this from happening as it depends on the SageMaker platform, but we can improve the user experience around this.

Solution

Add more time to hard timeout
update the process messages when App takes longer than usual to connect

Appearance

currently:

Screenshot 2025-11-12 at 12 13 53 PM (3)

(2 min 30 sec)

->

this change:

(1 min)

->

Note: Based on @dylanraws' and ricokyle@'s suggestions, the exact wording is changed to "Connecting to testX: Starting the Space is taking longer than usual. The space will connect when ready"

(9 min)

->

Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
Your code changes must meet the guidelines in CONTRIBUTING.md.
License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

amazon-inspector-ohio · 2025-11-11T04:53:27Z

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

github-actions · 2025-11-11T04:53:39Z

This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
- Note: beta or "experiment" features that have active users should announce fixes in the changelog.
- If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

amazon-inspector-ohio · 2025-11-11T04:53:58Z

✅ I finished the code review, and didn't find any security or code quality issues.

dylanraws · 2025-11-12T00:32:12Z

packages/core/src/awsService/sagemaker/sagemakerSpace.ts

+        // Edge case when this.spaceApp.App is null, returned by ListApp API for a Space that is not connected to for over 24 hours
+        if (!this.spaceApp.App) {
+            this.spaceApp.App = spaceApp.App
+        }


Regarding the comment:

This is not an "edge" case. This will happen to users often. Even for users who access their Space every weekday, this will happen after every weekend.

The technical details we are describing here (usage of ListApps; 24 hour TTL) are probably not going to be easily understood by the reader/maintainer of this code. The relevant point to get across is that the App may not exist and that is expected/normal behavior.

this change is already merged and should not be in this PR. But I can update this inline comment, Thanks for clarification!

dylanraws · 2025-11-12T17:22:07Z

packages/core/src/shared/clients/sagemaker.ts

+            .flatten()
+            .promise()
+        return appsList[0] // At most one App for one SagemakerSpace
+    }


nit: listApp* (without an s) is a confusing name, easily interpreted as a mistake which could distract the reader. Possible clearer names:

getAppViaListApps

getAppForSpaceViaListApps

dylanraws · 2025-11-12T17:22:09Z

packages/core/src/shared/clients/sagemaker.ts

+        softTimeoutRetries = 12, // 1 minute
+        hardTimeoutRetries = 120, // 10 minutes


Are these parameters ever passed by the caller? If not, consider simply making them constants at the top of the file. Unused parameters increase the cognitive load of the reader/maintainer as they need to think about potential usage of those parameters.

bhavya2109sharma · 2025-11-12T20:45:54Z

packages/core/src/shared/clients/sagemaker.ts

        appType: string,
-        maxRetries = 30,
+        progress?: vscode.Progress<{ message?: string; increment?: number }>,
+        softTimeoutRetries = 12, // 1 minute


Can we keep the original 2.5-minute timeout and display a message if it exceeds that time

I think making the softTimeout shorter than the original timeout is reasonable. We do have a long time after this softTimeout.
SMUS space starts faster than SMAI spaces, so I been testing using SMAI spaces and likely they'll start in 30 - 40 seconds, unless the space is too large like ml.m5.16xlarge, 64, 256 GiB took 4min 30sec to start

I mean I was thinking that we can decide on a new soft-time out now. As it is different from the original 2.5 min timeout which is a hard-timeout

If your regression testing shows that the current reporting time is acceptable, that's fine. However, future changes might increase this time or decrease it I am not sure. By showing the 'taking too long' message after just 1 minute, we may be setting wrong expectations from a user's perspective. That's why I suggested keeping the threshold at original 2.5 minutes. If everyone else is fine with 1 min then fine no need to change.

amazon-inspector-ohio · 2025-11-13T02:17:39Z

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

amazon-inspector-ohio · 2025-11-13T02:18:10Z

✅ I finished the code review, and didn't find any security or code quality issues.

PotatoWKY requested a review from a team as a code owner November 11, 2025 04:53

dylanraws reviewed Nov 12, 2025

View reviewed changes

PotatoWKY force-pushed the timeout branch from af8c663 to 9794447 Compare November 12, 2025 18:25

dylanraws approved these changes Nov 12, 2025

View reviewed changes

bhavya2109sharma reviewed Nov 12, 2025

View reviewed changes

bhavya2109sharma approved these changes Nov 12, 2025

View reviewed changes

PotatoWKY force-pushed the timeout branch 3 times, most recently from ea1ae49 to 117d483 Compare November 13, 2025 01:27

PotatoWKY closed this Nov 13, 2025

PotatoWKY reopened this Nov 13, 2025

fix(smus): Improve error handling when the Space takes too long to start

773f62d

PotatoWKY force-pushed the timeout branch from 117d483 to 773f62d Compare November 13, 2025 02:30

laileni-aws approved these changes Nov 13, 2025

View reviewed changes

Will-ShaoHua approved these changes Nov 13, 2025

View reviewed changes

Will-ShaoHua merged commit e369ff3 into aws:master Nov 13, 2025
31 checks passed

		softTimeoutRetries = 12, // 1 minute
		hardTimeoutRetries = 120, // 10 minutes

fix(smus): Improve error handling when the Space takes too long to start #8277

fix(smus): Improve error handling when the Space takes too long to start #8277

Conversation

PotatoWKY commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Appearance

currently:

this change:

Uh oh!

amazon-inspector-ohio bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

amazon-inspector-ohio bot commented Nov 11, 2025

Uh oh!

dylanraws Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

PotatoWKY Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanraws Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

dylanraws Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

bhavya2109sharma Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

PotatoWKY Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

PotatoWKY Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

bhavya2109sharma Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

amazon-inspector-ohio bot commented Nov 13, 2025

Uh oh!

amazon-inspector-ohio bot commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PotatoWKY commented Nov 11, 2025 •

edited

Loading

PotatoWKY Nov 12, 2025 •

edited

Loading