Skip to content

Commit 9aa3203

Browse files
npalmgithub-aws-runners-pr|bot
andauthored
fix: Tag non jit instances with runner id and add docs (#4667)
⚠️ This PR is on top of #4539 This pull request is an enhancement on top of #4539. The PR also introduces tagging via the start scripts for non-JIT configured instances. And it adds a state diagram to the ever growing complexity of the state diagram below terminating instances. ### Instance Tagging Enhancements: * Add function to tag runners with runner id for non JIT instances. Functions are designed to ignore errors to avoid causing a crash of the runner registration process. * side effect is that the instance is allowed to create the tag `ghr:github_runner_id` the instance is allowed to create the tag only on itself. ### Added docs * State diagram for scale-down added <img width="2084" height="3840" alt="Untitled diagram _ Mermaid Chart-2025-07-20-140732" src="https://github.com/user-attachments/assets/e88af647-98e5-46cb-8da4-28110c608d8d" /> --------- Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
1 parent 1a59d77 commit 9aa3203

File tree

7 files changed

+258
-0
lines changed

7 files changed

+258
-0
lines changed

mkdocs.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,11 @@ markdown_extensions:
4646
- admonition
4747
- pymdownx.details
4848
- pymdownx.superfences
49+
- pymdownx.superfences:
50+
custom_fences:
51+
- name: mermaid
52+
class: mermaid
53+
format: !!python/name:pymdownx.superfences.fence_code_format
4954

5055
nav:
5156
- Introduction: index.md

modules/runners/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ The scale up lambda is triggered by events on a SQS queue. Events on this queue
1818

1919
The scale down lambda is triggered via a CloudWatch event. The event is triggered by a cron expression defined in the variable `scale_down_schedule_expression` (https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html). For scaling down GitHub does not provide a good API yet, therefore we run the scaling down based on this event every x minutes. Each time the lambda is triggered it tries to remove all runners older than x minutes (configurable) managed in this deployment. In case the runner can be removed from GitHub, which means it is not executing a workflow, the lambda will terminate the EC2 instance.
2020

21+
--8<-- "modules/runners/scale-down-state-diagram.md:mkdocs_scale_down_state_diagram"
22+
2123
## Lambda Function
2224

2325
The Lambda function is written in [TypeScript](https://www.typescriptlang.org/) and requires Node 12.x and yarn. Sources are located in [./lambdas/runners]. Two lambda functions share the same sources, there is one entry point for `scaleDown` and another one for `scaleUp`.
@@ -85,6 +87,7 @@ yarn run dist
8587
| [aws_iam_role.scale_up](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
8688
| [aws_iam_role.ssm_housekeeper](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
8789
| [aws_iam_role_policy.cloudwatch](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource |
90+
| [aws_iam_role_policy.create_tag](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource |
8891
| [aws_iam_role_policy.describe_tags](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource |
8992
| [aws_iam_role_policy.dist_bucket](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource |
9093
| [aws_iam_role_policy.ec2](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource |

modules/runners/policies-runner.tf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,12 @@ resource "aws_iam_role_policy" "describe_tags" {
5757
policy = file("${path.module}/policies/instance-describe-tags-policy.json")
5858
}
5959

60+
resource "aws_iam_role_policy" "create_tag" {
61+
name = "runner-create-tags"
62+
role = aws_iam_role.runner.name
63+
policy = templatefile("${path.module}/policies/instance-create-tags-policy.json", {})
64+
}
65+
6066
resource "aws_iam_role_policy_attachment" "managed_policies" {
6167
count = length(var.runner_iam_role_managed_policy_arns)
6268
role = aws_iam_role.runner.name
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"Version": "2012-10-17",
3+
"Statement": [
4+
{
5+
"Action": "ec2:CreateTags",
6+
"Condition": {
7+
"ForAllValues:StringEquals": {
8+
"aws:TagKeys": [
9+
"ghr:github_runner_id"
10+
]
11+
},
12+
"StringEquals": {
13+
"aws:ARN": "$${ec2:SourceInstanceARN}"
14+
}
15+
},
16+
"Effect": "Allow",
17+
"Resource": "arn:*:ec2:*:*:instance/*"
18+
}
19+
]
20+
}
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# GitHub Actions Runner Scale-Down State Diagram
2+
3+
<!-- --8<-- [start:mkdocs_scale_down_state_diagram] -->
4+
5+
The scale-down Lambda function runs on a scheduled basis (every 5 minutes by default) to manage GitHub Actions runner instances. It performs a two-phase cleanup process: first terminating confirmed orphaned instances, then evaluating active runners to maintain the desired idle capacity while removing unnecessary instances.
6+
7+
```mermaid
8+
stateDiagram-v2
9+
[*] --> ScheduledExecution : Cron Trigger every 5 min
10+
11+
ScheduledExecution --> Phase1_OrphanTermination : Start Phase 1
12+
13+
state Phase1_OrphanTermination {
14+
[*] --> ListOrphanInstances : Query EC2 for ghr orphan true
15+
16+
ListOrphanInstances --> CheckOrphanType : For each orphan
17+
18+
state CheckOrphanType <<choice>>
19+
CheckOrphanType --> HasRunnerIdTag : Has ghr github runner id
20+
CheckOrphanType --> TerminateOrphan : No runner ID tag
21+
22+
HasRunnerIdTag --> LastChanceCheck : Query GitHub API
23+
24+
state LastChanceCheck <<choice>>
25+
LastChanceCheck --> ConfirmedOrphan : Offline and busy
26+
LastChanceCheck --> FalsePositive : Exists and not problematic
27+
28+
ConfirmedOrphan --> TerminateOrphan
29+
FalsePositive --> RemoveOrphanTag
30+
31+
TerminateOrphan --> NextOrphan : Continue processing
32+
RemoveOrphanTag --> NextOrphan
33+
34+
NextOrphan --> CheckOrphanType : More orphans?
35+
NextOrphan --> Phase2_ActiveRunners : All processed
36+
}
37+
38+
Phase1_OrphanTermination --> Phase2_ActiveRunners : Phase 1 Complete
39+
40+
state Phase2_ActiveRunners {
41+
[*] --> ListActiveRunners : Query non-orphan EC2 instances
42+
43+
ListActiveRunners --> GroupByOwner : Sort by owner and repo
44+
45+
GroupByOwner --> ProcessOwnerGroup : For each owner
46+
47+
state ProcessOwnerGroup {
48+
[*] --> SortByStrategy : Apply eviction strategy
49+
SortByStrategy --> ProcessRunner : Oldest first or newest first
50+
51+
ProcessRunner --> QueryGitHub : Get GitHub runners for owner
52+
53+
QueryGitHub --> MatchRunner : Find runner by instance ID suffix
54+
55+
state MatchRunner <<choice>>
56+
MatchRunner --> FoundInGitHub : Runner exists in GitHub
57+
MatchRunner --> NotFoundInGitHub : Runner not in GitHub
58+
59+
state FoundInGitHub {
60+
[*] --> CheckMinimumTime : Has minimum runtime passed?
61+
62+
state CheckMinimumTime <<choice>>
63+
CheckMinimumTime --> TooYoung : Runtime less than minimum
64+
CheckMinimumTime --> CheckIdleQuota : Runtime greater than or equal to minimum
65+
66+
TooYoung --> NextRunner
67+
68+
state CheckIdleQuota <<choice>>
69+
CheckIdleQuota --> KeepIdle : Idle quota available
70+
CheckIdleQuota --> CheckBusyState : Quota full
71+
72+
KeepIdle --> NextRunner
73+
74+
state CheckBusyState <<choice>>
75+
CheckBusyState --> KeepBusy : Runner busy
76+
CheckBusyState --> TerminateIdle : Runner idle
77+
78+
KeepBusy --> NextRunner
79+
TerminateIdle --> DeregisterFromGitHub
80+
DeregisterFromGitHub --> TerminateInstance
81+
TerminateInstance --> NextRunner
82+
}
83+
84+
state NotFoundInGitHub {
85+
[*] --> CheckBootTime : Has boot time exceeded?
86+
87+
state CheckBootTime <<choice>>
88+
CheckBootTime --> StillBooting : Boot time less than threshold
89+
CheckBootTime --> MarkOrphan : Boot time greater than or equal to threshold
90+
91+
StillBooting --> NextRunner
92+
MarkOrphan --> TagAsOrphan : Set ghr orphan true
93+
TagAsOrphan --> NextRunner
94+
}
95+
96+
NextRunner --> ProcessRunner : More runners in group?
97+
NextRunner --> NextOwnerGroup : Group complete
98+
}
99+
100+
NextOwnerGroup --> ProcessOwnerGroup : More owner groups?
101+
NextOwnerGroup --> ExecutionComplete : All groups processed
102+
}
103+
104+
Phase2_ActiveRunners --> ExecutionComplete : Phase 2 Complete
105+
106+
ExecutionComplete --> [*] : Wait for next cron trigger
107+
108+
note right of LastChanceCheck
109+
Uses ghr github runner id tag
110+
for precise GitHub API lookup
111+
end note
112+
113+
note right of MatchRunner
114+
Matches GitHub runner name
115+
ending with EC2 instance ID
116+
end note
117+
118+
note right of CheckMinimumTime
119+
Minimum running time in minutes
120+
(Linux: 5min, Windows: 15min)
121+
end note
122+
123+
note right of CheckBootTime
124+
Runner boot time in minutes
125+
Default configuration value
126+
end note
127+
```
128+
<!-- --8<-- [end:mkdocs_scale_down_state_diagram] -->
129+
130+
131+
## Key Decision Points
132+
133+
| State | Condition | Action |
134+
|-------|-----------|--------|
135+
| **Orphan w/ Runner ID** | GitHub: offline + busy | Terminate (confirmed orphan) |
136+
| **Orphan w/ Runner ID** | GitHub: exists + healthy | Remove orphan tag (false positive) |
137+
| **Orphan w/o Runner ID** | Always | Terminate (no way to verify) |
138+
| **Active Runner Found** | Runtime < minimum | Keep (too young) |
139+
| **Active Runner Found** | Idle quota available | Keep as idle |
140+
| **Active Runner Found** | Quota full + idle | Terminate + deregister |
141+
| **Active Runner Found** | Quota full + busy | Keep running |
142+
| **Active Runner Missing** | Boot time exceeded | Mark as orphan |
143+
| **Active Runner Missing** | Still booting | Wait |
144+
145+
## Configuration Parameters
146+
147+
- **Cron Schedule**: `cron(*/5 * * * ? *)` (every 5 minutes)
148+
- **Minimum Runtime**: Linux 5min, Windows 15min
149+
- **Boot Timeout**: Configurable via `runner_boot_time_in_minutes`
150+
- **Idle Config**: Per-environment configuration for desired idle runners

modules/runners/templates/start-runner.ps1

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,44 @@
11

22
## Retrieve instance metadata
33

4+
function Tag-InstanceWithRunnerId {
5+
Write-Host "Checking for .runner file to extract agent ID"
6+
7+
$runnerFilePath = "$pwd\.runner"
8+
if (-not (Test-Path $runnerFilePath)) {
9+
Write-Host "Warning: .runner file not found"
10+
return $true
11+
}
12+
13+
Write-Host "Found .runner file, extracting agent ID"
14+
try {
15+
$runnerConfig = Get-Content $runnerFilePath | ConvertFrom-Json
16+
$agentId = $runnerConfig.agentId
17+
18+
if (-not $agentId -or $agentId -eq $null) {
19+
Write-Host "Warning: Could not extract agent ID from .runner file"
20+
return $true
21+
}
22+
23+
Write-Host "Tagging instance with GitHub runner agent ID: $agentId"
24+
$tagResult = aws ec2 create-tags --region "$Region" --resources "$InstanceId" --tags "Key=ghr:github_runner_id,Value=$agentId" 2>&1
25+
26+
if ($LASTEXITCODE -eq 0) {
27+
Write-Host "Successfully tagged instance with agent ID: $agentId"
28+
return $true
29+
} else {
30+
Write-Host "Warning: Failed to tag instance with agent ID - $tagResult"
31+
return $true
32+
}
33+
}
34+
catch {
35+
Write-Host "Warning: Error processing .runner file - $($_.Exception.Message)"
36+
return $true
37+
}
38+
}
39+
40+
## Retrieve instance metadata
41+
442
Write-Host "Retrieving TOKEN from AWS API"
543
$token=Invoke-RestMethod -Method PUT -Uri "http://169.254.169.254/latest/api/token" -Headers @{"X-aws-ec2-metadata-token-ttl-seconds" = "180"}
644
if ( ! $token ) {
@@ -122,6 +160,9 @@ if ($enable_jit_config -eq "false" -or $agent_mode -ne "ephemeral") {
122160
$configCmd = ".\config.cmd --unattended --name $runner_name_prefix$InstanceId --work `"_work`" $runnerExtraOptions $config"
123161
Write-Host "Configure GH Runner (non ephmeral / no JIT) as user $run_as"
124162
Invoke-Expression $configCmd
163+
164+
# Tag instance with GitHub runner agent ID for non-JIT runners
165+
Tag-InstanceWithRunnerId
125166
}
126167

127168
$jsonBody = @(

modules/runners/templates/start-runner.sh

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,36 @@ create_xray_error_segment() {
5858
echo "$SEGMENT_DOC"
5959
}
6060

61+
tag_instance_with_runner_id() {
62+
echo "Checking for .runner file to extract agent ID"
63+
64+
if [[ ! -f "/opt/actions-runner/.runner" ]]; then
65+
echo "Warning: .runner file not found"
66+
return 0
67+
fi
68+
69+
echo "Found .runner file, extracting agent ID"
70+
local agent_id
71+
agent_id=$(jq -r '.agentId' /opt/actions-runner/.runner 2>/dev/null || echo "")
72+
73+
if [[ -z "$agent_id" || "$agent_id" == "null" ]]; then
74+
echo "Warning: Could not extract agent ID from .runner file"
75+
return 0
76+
fi
77+
78+
echo "Tagging instance with GitHub runner agent ID: $agent_id"
79+
if aws ec2 create-tags \
80+
--region "$region" \
81+
--resources "$instance_id" \
82+
--tags Key=ghr:github_runner_id,Value="$agent_id"; then
83+
echo "Successfully tagged instance with agent ID: $agent_id"
84+
return 0
85+
else
86+
echo "Warning: Failed to tag instance with agent ID"
87+
return 0
88+
fi
89+
}
90+
6191
cleanup() {
6292
local exit_code="$1"
6393
local error_location="$2"
@@ -225,6 +255,9 @@ if [[ "$enable_jit_config" == "false" || $agent_mode != "ephemeral" ]]; then
225255
extra_flags=""
226256
fi
227257
sudo --preserve-env=RUNNER_ALLOW_RUNASROOT -u "$run_as" -- ./config.sh $${extra_flags} --unattended --name "$runner_name_prefix$instance_id" --work "_work" $${config}
258+
259+
# Tag instance with GitHub runner agent ID for non-JIT runners
260+
tag_instance_with_runner_id
228261
fi
229262

230263
create_xray_success_segment "$SEGMENT"

0 commit comments

Comments
 (0)