Skip to content

Commit 7cbd9e0

Browse files
stuartp44npalmCopilotgithub-aws-runners-pr|bot
authored
feat: Add last chance check before orphan termination (#4595)
This pull request introduces enhancements to the AWS Lambda functions for scaling GitHub Actions runners. Key changes include the addition of functionality to manage runner tags (tagging and untagging). - For JIT-instances the GitHub runner id is is added in the tag `ghr:github_runner_id` during registrartion in the scale-up lambda - For non-JIT-instances the tag `ghr:github_runner_id` is added in in the start script. - The IAM permission for the runner instances now allows to create only for the instance self to create the tag `ghr:github_runner_id`. - The scale down function is using the runner id to efficient check the status of a runner. - Orphan tags are removed when a runner seems to be active (again). fix: #4391 ### Functionality Enhancements: * **Tag Management:** - Added `untag` function to remove tags from EC2 instances. (`lambdas/functions/control-plane/src/aws/runners.ts`, [lambdas/functions/control-plane/src/aws/runners.tsR117-R122](diffhunk://#diff-69076cf041b0bcf70b90ba1dde121bfc3cb3b57bc9d293d29f7da72affcf0041R117-R122)) - Updated orphan runner handling to include untagging logic when a runner is no longer orphaned. (`lambdas/functions/control-plane/src/scale-runners/scale-down.ts`, [lambdas/functions/control-plane/src/scale-runners/scale-down.tsL197-R261](diffhunk://#diff-2bc035e9625a802da709edd04410265c80ad992a4ead3d76024bcdb001f34d2eL197-R261)) * **JIT Runner Support:** - Introduced logic to handle JIT runners by checking their status and activity before untagging or terminating. (`lambdas/functions/control-plane/src/scale-runners/scale-down.ts`, [lambdas/functions/control-plane/src/scale-runners/scale-down.tsL197-R261](diffhunk://#diff-2bc035e9625a802da709edd04410265c80ad992a4ead3d76024bcdb001f34d2eL197-R261)) - Added `runnerId` to runner information for better identification and handling of JIT runners. (`lambdas/functions/control-plane/src/aws/runners.ts`, [lambdas/functions/control-plane/src/aws/runners.tsR95](diffhunk://#diff-69076cf041b0bcf70b90ba1dde121bfc3cb3b57bc9d293d29f7da72affcf0041R95)) ### Testing Improvements: * **Expanded Test Coverage:** - Added tests for the new `untag` functionality. (`lambdas/functions/control-plane/src/aws/runners.test.ts`, [lambdas/functions/control-plane/src/aws/runners.test.tsL232-R295](diffhunk://#diff-a41db796ea6eeeb679f8cda3eb91243ffc16e68b8ebbeaea6837fbf85e7241baL232-R295)) - Enhanced scale-down tests to cover JIT runner scenarios, including untagging and termination logic. (`lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts`, [lambdas/functions/control-plane/src/scale-runners/scale-down.test.tsR354-R405](diffhunk://#diff-5ede55c4f52b3aa79e856e42bcd9f5047d38f4ea4edb3903db7bce9fdb65cc38R354-R405)) * **Mock Updates:** - Updated mocks to include the new `untag` function for testing purposes. (`lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts`, [lambdas/functions/control-plane/src/scale-runners/scale-down.test.tsR36](diffhunk://#diff-5ede55c4f52b3aa79e856e42bcd9f5047d38f4ea4edb3903db7bce9fdb65cc38R36)) These changes improve the scalability and robustness of the control-plane functions, ensuring better handling of dynamic runner configurations and edge cases. --------- Co-authored-by: Niek Palm <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
1 parent cbeea89 commit 7cbd9e0

File tree

15 files changed

+469
-23
lines changed

15 files changed

+469
-23
lines changed

lambdas/functions/control-plane/src/aws/runners.d.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ export interface RunnerList {
1010
repo?: string;
1111
org?: string;
1212
orphan?: boolean;
13+
runnerId?: string;
1314
}
1415

1516
export interface RunnerInfo {

lambdas/functions/control-plane/src/aws/runners.test.ts

Lines changed: 63 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import {
44
CreateFleetInstance,
55
CreateFleetResult,
66
CreateTagsCommand,
7+
DeleteTagsCommand,
78
DefaultTargetCapacityType,
89
DescribeInstancesCommand,
910
DescribeInstancesResult,
@@ -17,7 +18,7 @@ import { mockClient } from 'aws-sdk-client-mock';
1718
import 'aws-sdk-client-mock-jest/vitest';
1819

1920
import ScaleError from './../scale-runners/ScaleError';
20-
import { createRunner, listEC2Runners, tag, terminateRunner } from './runners';
21+
import { createRunner, listEC2Runners, tag, untag, terminateRunner } from './runners';
2122
import { RunnerInfo, RunnerInputParameters, RunnerType } from './runners.d';
2223
import { describe, it, expect, beforeEach, vi } from 'vitest';
2324

@@ -53,14 +54,34 @@ const mockRunningInstances: DescribeInstancesResult = {
5354
},
5455
],
5556
};
57+
const mockRunningInstancesJit: DescribeInstancesResult = {
58+
Reservations: [
59+
{
60+
Instances: [
61+
{
62+
LaunchTime: new Date('2020-10-10T14:48:00.000+09:00'),
63+
InstanceId: 'i-1234',
64+
Tags: [
65+
{ Key: 'ghr:Application', Value: 'github-action-runner' },
66+
{ Key: 'ghr:runner_name_prefix', Value: RUNNER_NAME_PREFIX },
67+
{ Key: 'ghr:created_by', Value: 'scale-up-lambda' },
68+
{ Key: 'ghr:Type', Value: 'Org' },
69+
{ Key: 'ghr:Owner', Value: 'CoderToCat' },
70+
{ Key: 'ghr:github_runner_id', Value: '9876543210' },
71+
],
72+
},
73+
],
74+
},
75+
],
76+
};
5677

5778
describe('list instances', () => {
5879
beforeEach(() => {
5980
vi.resetModules();
6081
vi.clearAllMocks();
6182
});
6283

63-
it('returns a list of instances', async () => {
84+
it('returns a list of instances (Non JIT)', async () => {
6485
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstances);
6586
const resp = await listEC2Runners();
6687
expect(resp.length).toBe(1);
@@ -73,6 +94,20 @@ describe('list instances', () => {
7394
});
7495
});
7596

97+
it('returns a list of instances (JIT)', async () => {
98+
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstancesJit);
99+
const resp = await listEC2Runners();
100+
expect(resp.length).toBe(1);
101+
expect(resp).toContainEqual({
102+
instanceId: 'i-1234',
103+
launchTime: new Date('2020-10-10T14:48:00.000+09:00'),
104+
type: 'Org',
105+
owner: 'CoderToCat',
106+
orphan: false,
107+
runnerId: '9876543210',
108+
});
109+
});
110+
76111
it('check orphan tag.', async () => {
77112
const instances: DescribeInstancesResult = mockRunningInstances;
78113
instances.Reservations![0].Instances![0].Tags!.push({ Key: 'ghr:orphan', Value: 'true' });
@@ -229,11 +264,35 @@ describe('tag runner', () => {
229264
owner: 'owner-2',
230265
type: 'Repo',
231266
};
232-
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'truer' }]);
267+
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
268+
269+
expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
270+
Resources: [runner.instanceId],
271+
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
272+
});
273+
});
274+
});
233275

276+
describe('untag runner', () => {
277+
beforeEach(() => {
278+
vi.clearAllMocks();
279+
});
280+
it('removing extra tag', async () => {
281+
mockEC2Client.on(DeleteTagsCommand).resolves({});
282+
const runner: RunnerInfo = {
283+
instanceId: 'instance-2',
284+
owner: 'owner-2',
285+
type: 'Repo',
286+
};
287+
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
234288
expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
235289
Resources: [runner.instanceId],
236-
Tags: [{ Key: 'ghr:orphan', Value: 'truer' }],
290+
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
291+
});
292+
await untag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
293+
expect(mockEC2Client).toHaveReceivedCommandWith(DeleteTagsCommand, {
294+
Resources: [runner.instanceId],
295+
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
237296
});
238297
});
239298
});

lambdas/functions/control-plane/src/aws/runners.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ import {
22
CreateFleetCommand,
33
CreateFleetResult,
44
CreateTagsCommand,
5+
DeleteTagsCommand,
56
DescribeInstancesCommand,
67
DescribeInstancesResult,
78
EC2Client,
@@ -91,6 +92,7 @@ function getRunnerInfo(runningInstances: DescribeInstancesResult) {
9192
repo: i.Tags?.find((e) => e.Key === 'ghr:Repo')?.Value as string,
9293
org: i.Tags?.find((e) => e.Key === 'ghr:Org')?.Value as string,
9394
orphan: i.Tags?.find((e) => e.Key === 'ghr:orphan')?.Value === 'true',
95+
runnerId: i.Tags?.find((e) => e.Key === 'ghr:github_runner_id')?.Value as string,
9496
});
9597
}
9698
}
@@ -112,6 +114,12 @@ export async function tag(instanceId: string, tags: Tag[]): Promise<void> {
112114
await ec2.send(new CreateTagsCommand({ Resources: [instanceId], Tags: tags }));
113115
}
114116

117+
export async function untag(instanceId: string, tags: Tag[]): Promise<void> {
118+
logger.debug(`Untagging '${instanceId}'`, { tags });
119+
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
120+
await ec2.send(new DeleteTagsCommand({ Resources: [instanceId], Tags: tags }));
121+
}
122+
115123
function generateFleetOverrides(
116124
subnetIds: string[],
117125
instancesTypes: string[],

lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts

Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ import nock from 'nock';
44

55
import { RunnerInfo, RunnerList } from '../aws/runners.d';
66
import * as ghAuth from '../github/auth';
7-
import { listEC2Runners, terminateRunner, tag } from './../aws/runners';
7+
import { listEC2Runners, terminateRunner, tag, untag } from './../aws/runners';
88
import { githubCache } from './cache';
99
import { newestFirstStrategy, oldestFirstStrategy, scaleDown } from './scale-down';
1010
import { describe, it, expect, beforeEach, vi } from 'vitest';
@@ -33,6 +33,7 @@ vi.mock('./../aws/runners', async (importOriginal) => {
3333
return {
3434
...actual,
3535
tag: vi.fn(),
36+
untag: vi.fn(),
3637
terminateRunner: vi.fn(),
3738
listEC2Runners: vi.fn(),
3839
};
@@ -62,6 +63,7 @@ const mockedInstallationAuth = vi.mocked(ghAuth.createGithubInstallationAuth);
6263
const mockCreateClient = vi.mocked(ghAuth.createOctokitClient);
6364
const mockListRunners = vi.mocked(listEC2Runners);
6465
const mockTagRunners = vi.mocked(tag);
66+
const mockUntagRunners = vi.mocked(untag);
6567
const mockTerminateRunners = vi.mocked(terminateRunner);
6668

6769
export interface TestData {
@@ -312,7 +314,7 @@ describe('Scale down runners', () => {
312314
checkNonTerminated(runners);
313315
});
314316

315-
it(`Should terminate orphan.`, async () => {
317+
it(`Should terminate orphan (Non JIT)`, async () => {
316318
// setup
317319
const orphanRunner = createRunnerTestData('orphan-1', type, MINIMUM_BOOT_TIME + 1, false, false, false);
318320
const idleRunner = createRunnerTestData('idle-1', type, MINIMUM_BOOT_TIME + 1, true, false, false);
@@ -334,6 +336,7 @@ describe('Scale down runners', () => {
334336
Value: 'true',
335337
},
336338
]);
339+
337340
expect(mockTagRunners).not.toHaveBeenCalledWith(idleRunner.instanceId, expect.anything());
338341

339342
// next cycle, update test data set orphan to true and terminate should be true
@@ -348,6 +351,58 @@ describe('Scale down runners', () => {
348351
checkNonTerminated(runners);
349352
});
350353

354+
it('Should test if orphaned runner, untag if online and busy, else terminate (JIT)', async () => {
355+
// arrange
356+
const orphanRunner = createRunnerTestData(
357+
'orphan-jit',
358+
type,
359+
MINIMUM_BOOT_TIME + 1,
360+
false,
361+
true,
362+
false,
363+
undefined,
364+
1234567890,
365+
);
366+
const runners = [orphanRunner];
367+
368+
mockGitHubRunners([]);
369+
mockAwsRunners(runners);
370+
371+
if (type === 'Repo') {
372+
mockOctokit.actions.getSelfHostedRunnerForRepo.mockResolvedValueOnce({
373+
data: { id: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'online' },
374+
});
375+
} else {
376+
mockOctokit.actions.getSelfHostedRunnerForOrg.mockResolvedValueOnce({
377+
data: { id: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'online' },
378+
});
379+
}
380+
381+
// act
382+
await scaleDown();
383+
384+
// assert
385+
expect(mockUntagRunners).toHaveBeenCalledWith(orphanRunner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
386+
expect(mockTerminateRunners).not.toHaveBeenCalledWith(orphanRunner.instanceId);
387+
388+
// arrange
389+
if (type === 'Repo') {
390+
mockOctokit.actions.getSelfHostedRunnerForRepo.mockResolvedValueOnce({
391+
data: { runnerId: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'offline' },
392+
});
393+
} else {
394+
mockOctokit.actions.getSelfHostedRunnerForOrg.mockResolvedValueOnce({
395+
data: { runnerId: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'offline' },
396+
});
397+
}
398+
399+
// act
400+
await scaleDown();
401+
402+
// assert
403+
expect(mockTerminateRunners).toHaveBeenCalledWith(orphanRunner.instanceId);
404+
});
405+
351406
it(`Should ignore errors when termination orphan fails.`, async () => {
352407
// setup
353408
const orphanRunner = createRunnerTestData('orphan-1', type, MINIMUM_BOOT_TIME + 1, false, true, true);
@@ -625,6 +680,7 @@ function createRunnerTestData(
625680
orphan: boolean,
626681
shouldBeTerminated: boolean,
627682
owner?: string,
683+
runnerId?: number,
628684
): RunnerTestItem {
629685
return {
630686
instanceId: `i-${name}-${type.toLowerCase()}`,
@@ -638,5 +694,6 @@ function createRunnerTestData(
638694
registered,
639695
orphan,
640696
shouldBeTerminated,
697+
runnerId: runnerId !== undefined ? String(runnerId) : undefined,
641698
};
642699
}

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

Lines changed: 60 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
import { Octokit } from '@octokit/rest';
2+
import { Endpoints } from '@octokit/types';
23
import { createChildLogger } from '@aws-github-runner/aws-powertools-util';
34
import moment from 'moment';
45

56
import { createGithubAppAuth, createGithubInstallationAuth, createOctokitClient } from '../github/auth';
6-
import { bootTimeExceeded, listEC2Runners, tag, terminateRunner } from './../aws/runners';
7+
import { bootTimeExceeded, listEC2Runners, tag, untag, terminateRunner } from './../aws/runners';
78
import { RunnerInfo, RunnerList } from './../aws/runners.d';
89
import { GhRunners, githubCache } from './cache';
910
import { ScalingDownConfig, getEvictionStrategy, getIdleRunnerCount } from './scale-down-config';
@@ -12,6 +13,10 @@ import { getGitHubEnterpriseApiUrl } from './scale-up';
1213

1314
const logger = createChildLogger('scale-down');
1415

16+
type OrgRunnerList = Endpoints['GET /orgs/{org}/actions/runners']['response']['data']['runners'];
17+
type RepoRunnerList = Endpoints['GET /repos/{owner}/{repo}/actions/runners']['response']['data']['runners'];
18+
type RunnerState = OrgRunnerList[number] | RepoRunnerList[number];
19+
1520
async function getOrCreateOctokit(runner: RunnerInfo): Promise<Octokit> {
1621
const key = runner.owner;
1722
const cachedOctokit = githubCache.clients.get(key);
@@ -46,7 +51,11 @@ async function getOrCreateOctokit(runner: RunnerInfo): Promise<Octokit> {
4651
return octokit;
4752
}
4853

49-
async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo, runnerId: number): Promise<boolean> {
54+
async function getGitHubSelfHostedRunnerState(
55+
client: Octokit,
56+
ec2runner: RunnerInfo,
57+
runnerId: number,
58+
): Promise<RunnerState> {
5059
const state =
5160
ec2runner.type === 'Org'
5261
? await client.actions.getSelfHostedRunnerForOrg({
@@ -58,12 +67,15 @@ async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo,
5867
owner: ec2runner.owner.split('/')[0],
5968
repo: ec2runner.owner.split('/')[1],
6069
});
61-
62-
logger.info(`Runner '${ec2runner.instanceId}' - GitHub Runner ID '${runnerId}' - Busy: ${state.data.busy}`);
63-
6470
metricGitHubAppRateLimit(state.headers);
6571

66-
return state.data.busy;
72+
return state.data;
73+
}
74+
75+
async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo, runnerId: number): Promise<boolean> {
76+
const state = await getGitHubSelfHostedRunnerState(client, ec2runner, runnerId);
77+
logger.info(`Runner '${ec2runner.instanceId}' - GitHub Runner ID '${runnerId}' - Busy: ${state.busy}`);
78+
return state.busy;
6779
}
6880

6981
async function listGitHubRunners(runner: RunnerInfo): Promise<GhRunners> {
@@ -194,24 +206,59 @@ async function evaluateAndRemoveRunners(
194206
async function markOrphan(instanceId: string): Promise<void> {
195207
try {
196208
await tag(instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
197-
logger.info(`Runner '${instanceId}' marked as orphan.`);
209+
logger.info(`Runner '${instanceId}' tagged as orphan.`);
198210
} catch (e) {
199-
logger.error(`Failed to mark runner '${instanceId}' as orphan.`, { error: e });
211+
logger.error(`Failed to tag runner '${instanceId}' as orphan.`, { error: e });
200212
}
201213
}
202214

215+
async function unMarkOrphan(instanceId: string): Promise<void> {
216+
try {
217+
await untag(instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
218+
logger.info(`Runner '${instanceId}' untagged as orphan.`);
219+
} catch (e) {
220+
logger.error(`Failed to un-tag runner '${instanceId}' as orphan.`, { error: e });
221+
}
222+
}
223+
224+
async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<boolean> {
225+
const client = await getOrCreateOctokit(runner as RunnerInfo);
226+
const runnerId = parseInt(runner.runnerId || '0');
227+
const ec2Instance = runner as RunnerInfo;
228+
const state = await getGitHubSelfHostedRunnerState(client, ec2Instance, runnerId);
229+
let isOrphan = false;
230+
logger.debug(
231+
`Runner '${runner.instanceId}' is '${state.status}' and is currently '${state.busy ? 'busy' : 'idle'}'.`,
232+
);
233+
const isOfflineAndBusy = state.status === 'offline' && state.busy;
234+
if (isOfflineAndBusy) {
235+
isOrphan = true;
236+
}
237+
logger.info(`Runner '${runner.instanceId}' is judged to ${isOrphan ? 'be' : 'not be'} orphaned.`);
238+
return isOrphan;
239+
}
240+
203241
async function terminateOrphan(environment: string): Promise<void> {
204242
try {
205243
const orphanRunners = await listEC2Runners({ environment, orphan: true });
206244

207245
for (const runner of orphanRunners) {
208-
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
209-
await terminateRunner(runner.instanceId).catch((e) => {
210-
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
211-
});
246+
if (runner.runnerId) {
247+
const isOrphan = await lastChanceCheckOrphanRunner(runner);
248+
if (isOrphan) {
249+
await terminateRunner(runner.instanceId);
250+
} else {
251+
await unMarkOrphan(runner.instanceId);
252+
}
253+
} else {
254+
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
255+
await terminateRunner(runner.instanceId).catch((e) => {
256+
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
257+
});
258+
}
212259
}
213260
} catch (e) {
214-
logger.warn(`Failure during orphan runner termination.`, { error: e });
261+
logger.warn(`Failure during orphan termination processing.`, { error: e });
215262
}
216263
}
217264

0 commit comments

Comments
 (0)