Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions packages/core/src/awsService/ec2/model.ts
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ import {
testSshConnection,
} from '../../shared/extensions/ssh'
import { getLogger } from '../../shared/logger/logger'
import { CancellationError, Timeout } from '../../shared/utilities/timeoutUtils'
import { CancellationError, Timeout, waitUntil } from '../../shared/utilities/timeoutUtils'
import { showMessageWithCancel } from '../../shared/utilities/messages'
import { SshConfig } from '../../shared/sshConfig'
import { SshKeyPair } from './sshKeyPair'
Expand Down Expand Up @@ -150,8 +150,14 @@ export class Ec2Connecter implements vscode.Disposable {
}
}

private async checkForInstanceSsmError(selection: Ec2Selection): Promise<void> {
const isSsmAgentRunning = (await this.ssmClient.getInstanceAgentPingStatus(selection.instanceId)) === 'Online'
public async checkForInstanceSsmError(
selection: Ec2Selection,
options?: Partial<{ interval: number; timeout: number }>
): Promise<void> {
const isSsmAgentRunning = await waitUntil(
async () => (await this.ssmClient.getInstanceAgentPingStatus(selection.instanceId)) === 'Online',
{ interval: options?.interval ?? 500, timeout: options?.timeout ?? 5000 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look at telemetry to figure out these numbers or were these just numbers that kind of just make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitUntil has backoff, we should probably use that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose these somewhat arbitrarily. I didn't want to make the timeout too long since a long loading bar feels bad, but it is cancellable so perhaps I could extend it.

Based on experience, the agent can take a while to startup and sometimes restarting it can make it faster. I still do not fully understand why. I've dug around a little bit before and found this is usually caused by permissions problems, but in our case we explicitly check for the correct permissions before checking the agent status. Usually if the ec2 instance has been running for more than ~10 minutes, the agent is also running.

Its possible adding this retry is insignificant, but seems like an easy thing to try. Next steps would likely be reaching out to the relevant internal channels to see if someone can provide guidance.

)

if (!isSsmAgentRunning) {
this.throwConnectionError('Is SSM Agent running on the target instance?', selection, {
Expand Down
11 changes: 11 additions & 0 deletions packages/core/src/test/awsService/ec2/model.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,17 @@ describe('Ec2ConnectClient', function () {
}
})

it('retries if agent status is not online', async function () {
const instanceAgentStatus = sinon.stub(SsmClient.prototype, 'getInstanceAgentPingStatus')
instanceAgentStatus.onFirstCall().resolves('Offline')
instanceAgentStatus.onSecondCall().resolves('Online')
try {
await client.checkForInstanceSsmError(instanceSelection, { interval: 10, timeout: 100 })
} catch (err) {
assert.ok(false, `checkForInstanceSsmError failed with error '${err}'`)
}
})

it('does not throw an error if all checks pass', async function () {
sinon.stub(Ec2Connecter.prototype, 'isInstanceRunning').resolves(true)
sinon.stub(Ec2Connecter.prototype, 'getAttachedIamRole').resolves({ Arn: 'testRole' } as IAM.Role)
Expand Down
Loading