-
Notifications
You must be signed in to change notification settings - Fork 734
fix(smus): Improve error handling when the Space takes too long to start #8277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done |
|
|
✅ I finished the code review, and didn't find any security or code quality issues. |
| // Edge case when this.spaceApp.App is null, returned by ListApp API for a Space that is not connected to for over 24 hours | ||
| if (!this.spaceApp.App) { | ||
| this.spaceApp.App = spaceApp.App | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the comment:
- This is not an "edge" case. This will happen to users often. Even for users who access their Space every weekday, this will happen after every weekend.
- The technical details we are describing here (usage of ListApps; 24 hour TTL) are probably not going to be easily understood by the reader/maintainer of this code. The relevant point to get across is that the App may not exist and that is expected/normal behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is already merged and should not be in this PR. But I can update this inline comment, Thanks for clarification!
| .flatten() | ||
| .promise() | ||
| return appsList[0] // At most one App for one SagemakerSpace | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: listApp* (without an s) is a confusing name, easily interpreted as a mistake which could distract the reader. Possible clearer names:
getAppViaListAppsgetAppForSpaceViaListApps
| softTimeoutRetries = 12, // 1 minute | ||
| hardTimeoutRetries = 120, // 10 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these parameters ever passed by the caller? If not, consider simply making them constants at the top of the file. Unused parameters increase the cognitive load of the reader/maintainer as they need to think about potential usage of those parameters.
| appType: string, | ||
| maxRetries = 30, | ||
| progress?: vscode.Progress<{ message?: string; increment?: number }>, | ||
| softTimeoutRetries = 12, // 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the original 2.5-minute timeout and display a message if it exceeds that time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making the softTimeout shorter than the original timeout is reasonable. We do have a long time after this softTimeout.
SMUS space starts faster than SMAI spaces, so I been testing using SMAI spaces and likely they'll start in 30 - 40 seconds, unless the space is too large like ml.m5.16xlarge, 64, 256 GiB took 4min 30sec to start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I was thinking that we can decide on a new soft-time out now. As it is different from the original 2.5 min timeout which is a hard-timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your regression testing shows that the current reporting time is acceptable, that's fine. However, future changes might increase this time or decrease it I am not sure. By showing the 'taking too long' message after just 1 minute, we may be setting wrong expectations from a user's perspective. That's why I suggested keeping the threshold at original 2.5 minutes. If everyone else is fine with 1 min then fine no need to change.
ea1ae49 to
117d483
Compare
|
⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done |
|
✅ I finished the code review, and didn't find any security or code quality issues. |
Problem
Occasionally, then a user clicks the Connect button for a Space that is in the Stopped status, the corresponding App that gets created eventually takes too long to become Running, so the user is shown the following error message.
We can't prevent this from happening as it depends on the SageMaker platform, but we can improve the user experience around this.
Solution
Appearance
currently:
->
this change:
->
Note: Based on @dylanraws' and ricokyle@'s suggestions, the exact wording is changed to "Connecting to testX: Starting the Space is taking longer than usual. The space will connect when ready"

(9 min)
->
feature/xbranches will not be squash-merged at release time.