Skip to content

Conversation

@PotatoWKY
Copy link
Contributor

@PotatoWKY PotatoWKY commented Nov 11, 2025

Problem

Occasionally, then a user clicks the Connect button for a Space that is in the Stopped status, the corresponding App that gets created eventually takes too long to become Running, so the user is shown the following error message.

Remote connection failed: Timed out waiting for app "default-b97e54b8-e0e1-70b7-a216-856fcbb3cc61" to reach "InService" status. | Timed out waiting for app "default-b97e54b8-e0e1-70b7-a216-856fcbb3cc61" to reach "InService" status.

We can't prevent this from happening as it depends on the SageMaker platform, but we can improve the user experience around this.

Solution

  • Add more time to hard timeout
  • update the process messages when App takes longer than usual to connect

Appearance

currently:

Screenshot 2025-11-12 at 12 13 53 PM (3) (2 min 30 sec)

->

Screenshot 2025-11-12 at 12 05 16 PM

this change:

Screenshot 2025-11-12 at 12 13 53 PM (3) (1 min)

->

Note: Based on @dylanraws' and ricokyle@'s suggestions, the exact wording is changed to "Connecting to testX: Starting the Space is taking longer than usual. The space will connect when ready"
Screenshot 2025-11-11 at 10 02 37 AM
(9 min)

->

Screenshot 2025-11-12 at 12 05 16 PM
  • Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
  • Your code changes must meet the guidelines in CONTRIBUTING.md.
  • License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@PotatoWKY PotatoWKY requested a review from a team as a code owner November 11, 2025 04:53
@amazon-inspector-ohio
Copy link

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

@github-actions
Copy link

  • This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
    • Note: beta or "experiment" features that have active users should announce fixes in the changelog.
    • If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

@amazon-inspector-ohio
Copy link

✅ I finished the code review, and didn't find any security or code quality issues.

Comment on lines 40 to 43
// Edge case when this.spaceApp.App is null, returned by ListApp API for a Space that is not connected to for over 24 hours
if (!this.spaceApp.App) {
this.spaceApp.App = spaceApp.App
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the comment:

  1. This is not an "edge" case. This will happen to users often. Even for users who access their Space every weekday, this will happen after every weekend.
  2. The technical details we are describing here (usage of ListApps; 24 hour TTL) are probably not going to be easily understood by the reader/maintainer of this code. The relevant point to get across is that the App may not exist and that is expected/normal behavior.

Copy link
Contributor Author

@PotatoWKY PotatoWKY Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is already merged and should not be in this PR. But I can update this inline comment, Thanks for clarification!

.flatten()
.promise()
return appsList[0] // At most one App for one SagemakerSpace
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: listApp* (without an s) is a confusing name, easily interpreted as a mistake which could distract the reader. Possible clearer names:

  • getAppViaListApps
  • getAppForSpaceViaListApps

Comment on lines 368 to 369
softTimeoutRetries = 12, // 1 minute
hardTimeoutRetries = 120, // 10 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these parameters ever passed by the caller? If not, consider simply making them constants at the top of the file. Unused parameters increase the cognitive load of the reader/maintainer as they need to think about potential usage of those parameters.

appType: string,
maxRetries = 30,
progress?: vscode.Progress<{ message?: string; increment?: number }>,
softTimeoutRetries = 12, // 1 minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the original 2.5-minute timeout and display a message if it exceeds that time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making the softTimeout shorter than the original timeout is reasonable. We do have a long time after this softTimeout.
SMUS space starts faster than SMAI spaces, so I been testing using SMAI spaces and likely they'll start in 30 - 40 seconds, unless the space is too large like ml.m5.16xlarge, 64, 256 GiB took 4min 30sec to start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I was thinking that we can decide on a new soft-time out now. As it is different from the original 2.5 min timeout which is a hard-timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your regression testing shows that the current reporting time is acceptable, that's fine. However, future changes might increase this time or decrease it I am not sure. By showing the 'taking too long' message after just 1 minute, we may be setting wrong expectations from a user's perspective. That's why I suggested keeping the threshold at original 2.5 minutes. If everyone else is fine with 1 min then fine no need to change.

@PotatoWKY PotatoWKY force-pushed the timeout branch 3 times, most recently from ea1ae49 to 117d483 Compare November 13, 2025 01:27
@PotatoWKY PotatoWKY closed this Nov 13, 2025
@PotatoWKY PotatoWKY reopened this Nov 13, 2025
@amazon-inspector-ohio
Copy link

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

@amazon-inspector-ohio
Copy link

✅ I finished the code review, and didn't find any security or code quality issues.

@Will-ShaoHua Will-ShaoHua merged commit e369ff3 into aws:master Nov 13, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants