Skip to content

Conversation

@bhavya2109sharma
Copy link
Contributor

@bhavya2109sharma bhavya2109sharma commented Nov 12, 2025

Problem

When attempting to connect to a SageMaker Space, if the user's SSH config had issues (like the old User '%r' directive),
users would see an unhelpful error :

ssh check against host failed: 255

This happened because:

  1. SSH validation ran before checking if the config was outdated
  2. The outdated config caused SSH validation to fail
  3. Users had no opportunity to fix the issue before seeing the error

Solution

Implemented a proactive SSH config validation that checks for outdated config before running SSH validation
Prompts users to update outdated config automatically
Provides helpful error messages with line numbers for external errors
Validates before showing progress dialogs (better UX)

Changes Made

New Files Created

  1. packages/core/src/awsService/sagemaker/sshConfig.ts : Created SageMakerSshConfig class that extends the base SshConfig class with SageMaker specific SSH logic.
  2. packages/core/src/test/awsService/sagemaker/sshConfig.test.ts- test suite

Modified

  1. packages/core/src/awsService/sagemaker/model.ts: Removed ensureValid() from prepareDevEnvConnection() and added comment explaining validation happens in commands.ts

  2. packages/core/src/awsService/sagemaker/commands.ts : Added validateSshConfig() which will validate before showing progress dialogs so that users can fix issues before any connection attempt starts

Testing

  1. SSH Config Exists, No sm_* Section
  2. Happy path : Up-to-Date sm_* Section
  3. Outdated or has error sm_* Section (e.g., has User '%r')
  4. Syntax Error in SSH Config (Outside sm_* Section)
  • All the testing is done for SMUS and SM-AI (with deeplink)
  • Tested changes for Platforms : Mac, Win, Linux

  • Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
  • Your code changes must meet the guidelines in CONTRIBUTING.md.
  • License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bhavya2109sharma bhavya2109sharma requested a review from a team as a code owner November 12, 2025 22:54
@bhavya2109sharma bhavya2109sharma changed the title Fix(Sagemaker) : Improve error handling when the Space SSH host check fails Fix(Sagemaker): Improve error handling when the Space SSH host check fails Nov 12, 2025
@bhavya2109sharma bhavya2109sharma changed the title Fix(Sagemaker): Improve error handling when the Space SSH host check fails fix(Sagemaker): Improve error handling when the Space SSH host check fails Nov 12, 2025
@github-actions
Copy link

  • This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
    • Note: beta or "experiment" features that have active users should announce fixes in the changelog.
    • If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

@bhavya2109sharma bhavya2109sharma changed the title fix(Sagemaker): Improve error handling when the Space SSH host check fails fix(sagemaker): improve error handling when the Space SSH host check fails Nov 12, 2025
Copy link
Contributor

@vpbhargav vpbhargav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worth testing this change in Widows/Mac and Linux if possible. Hard to validate regex validations and some of these platforms have casing, newline peculiarities.


// Check against known versions
const knownVersions = [
this.createSSHConfigSection(proxyCommand).trim(), // Current version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but something to think about: Instead of generating all possible config values (every time), you could store a list of functions here. Instead of iterating the values, you iterate the functions, and for each one, you call the function and then check for the result in the config. If found, then you shouldn't need to proceed to the next function, avoiding generating additional strings. This isn't necessary in this case because these strings are relatively short, and there won't be very many, so the performance impact is negligible, but it's worth being aware of this "lazy" pattern for future situations.

Comment on lines +279 to +280
let startIndex = -1
let endIndex = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to use these pointers? This feels over-complicated. If we stick to exact string matching (using the list of past expected sm_* sections), we can easily determine the start and end indices if we need to make a replacement.

const startIndex = configContent.indexOf(pastSection)
if (startIndex == -1) {
    // Not found
}

const endIndex = startIndex + pastSection.length

// Then slice and generate the new content string

We can probably simplify even further, as I don't imagine we need the indices.

if (configContent.includes(pastSection)) {
    const newContent = configContent.replace(pastSection, newSection)
    // Then write newContent to the file
}

errorDetails = `\n\nError: ${updateError.message}`
}

const message = `Failed to update your ${sshConfigPath} file automatically.${errorDetails}\n\nOpen the file to fix the issue manually.`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to tell the user we failed to make an update.

When we tell the user "fix the issue manually", will it be clear what they should fix? For example, will we be displaying the ssh -G sm_test error output? If this applies to the case where the sm_* section needs to be deleted by the user, we can specify that in the message.

await this.removeSshConfigSection()
// Write the new section
await this.writeSectionToConfig(proxyCommand)
logger.info('Successfully updated sm_* section')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a constant or variable in place of sm_*? It's possible we will need to use a different string for other IDEs (since the global storage path for the sagemaker_connect script will be different) so it's better not to hard-code this.

* This is shown when the host section exists but is outdated.
*/
private async promptToUpdateSshConfig(): Promise<boolean> {
logger.warn(`Section is outdated for ${this.configHostName}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these comments don't specifically mention the SSH Config. Previously, you had manually prefixed them all with SSH Config, but now that we are using a logger with the sagemaker prefix, SSH Config context is not present, making it hard to understand what these log messages are about. To improve, we can either add more context to the logger (e.g., sagemaker SSH config), or update the wording of each log message.

try {
await this.promptOtherSshConfigError(sshError)
const configOpenedError = new ToolkitError(SshConfigOpenedForEditMessage(), {
code: 'SshConfigOpenedForEdit',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a previous comment about defining constants for these error codes, but looks like that feedback wasn't yet implemented. Not required, but I figured I would remind you in case you planned on doing that.

return Result.err(externalConfigError)
}
return Result.err(
ToolkitError.chain(e, 'Unexpected error while handling SSH config error', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"error while handling ... error" sounds odd. Makes me wonder about how this logic is structured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants