Skip to content

Commit c17ec7d

Browse files
[CI] Add postmortem info on latest issue (#448)
We now have much better information on the runner set not scaling issues. Add some documentation to the issues page about this for the benefit of those that come along and read it later.
1 parent 97c01e2 commit c17ec7d

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

premerge/issues.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,3 +153,20 @@ prodding while investigating the Linux issues. The issue on the Windows
153153
side was fixed in the same way, by uninstalling the helm charts, deleting
154154
dangling resources, deleting the namespaces, and then reinstalling the
155155
helm charts.
156+
157+
### Postmortem
158+
159+
We ended up running into this issue several more times, hitting an instance
160+
once every couple of weeks.
161+
162+
After some further investigation, it turns out this is mostly intended behavior
163+
of Github ARC. When a `ephemeralrunner` object fails to start more than five
164+
times, it goes into a failure state permanently with the idea that manual
165+
intervention and notification is beneficial. Simply deleting the failed
166+
`epehemralrunner` objects allows Github ARC to recreate them where they schedule
167+
normally. This is how later incidents were resolved.
168+
169+
These runners are most likely failing due to image pull failures which was one
170+
of our original hypotheses on the issue. Recent changes to Github ARC
171+
in https://github.com/actions/actions-runner-controller/pull/4059 should help
172+
with this issues, although further testing is needed.

0 commit comments

Comments
 (0)