Skip to content

Commit c43ea62

Browse files
konardclaude
andcommitted
docs: add case study for issue #60 — manual CI/CD did not produce new releases
Analysis of why v1.3.11 was never released after PR #58 merged, and why the manual workflow_dispatch created v1.3.12 instead of retrying v1.3.11. Root causes identified: 1. Transient network timeouts on GitHub-hosted runners caused the push-triggered build for v1.3.11 to fail (ghcr.io/Docker Hub connectivity issues) 2. GITHUB_TOKEN-based pushes (from apply-changesets) don't trigger new on:push workflows — so when the build fails, there is no automatic retry path 3. No release-only workflow_dispatch mode — bump-and-release always creates extra version increment Includes full CI logs for both relevant workflow runs, timeline reconstruction, root cause analysis, online research, and proposed solutions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 859b274 commit c43ea62

File tree

3 files changed

+78421
-0
lines changed

3 files changed

+78421
-0
lines changed
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# Case Study: Manual CI/CD Did Not Produce Any New Releases (Issue #60)
2+
3+
## Issue Reference
4+
- **Issue**: https://github.com/link-foundation/sandbox/issues/60
5+
- **Commit referenced in issue**: https://github.com/link-foundation/sandbox/commit/7f7671300d152cf110b5d0cf2a9f4e16b3982dab
6+
- **Actions page**: https://github.com/link-foundation/sandbox/actions
7+
8+
---
9+
10+
## Executive Summary
11+
12+
After PR #58 merged on 2026-02-26 (containing the issue-57 fix), the CI/CD pipeline:
13+
1. Correctly bumped the version from 1.3.10 → 1.3.11 via changeset
14+
2. Built Docker images tagged `1.3.11` for most components
15+
3. **Failed to complete the Docker build due to transient network timeouts** on GitHub-hosted runners when connecting to ghcr.io and Docker Hub
16+
4. As a result, **v1.3.11 Docker images were never fully published** and no GitHub Release was created for v1.3.11
17+
18+
The user then manually triggered `workflow_dispatch` with `bump-and-release` mode, which:
19+
1. Bumped 1.3.11 → 1.3.12 (introducing an unintentional version skip)
20+
2. Successfully built and released v1.3.12
21+
22+
**v1.3.11 was never released.** The root cause was **transient network timeouts** on GitHub-hosted runners, compounded by a secondary issue that once a `GITHUB_TOKEN`-based commit is pushed, failures cannot trigger a fresh re-run for that commit automatically.
23+
24+
---
25+
26+
## Timeline of Events
27+
28+
### 2026-02-26T14:05:54Z — Merge Commit `b7462ab` Pushed to `main`
29+
30+
- **Event**: PR #58 merged to `main`, containing the issue-57 fix
31+
- **Trigger**: `push` event on `main`
32+
- **Workflow Run**: [22445566687](https://github.com/link-foundation/sandbox/actions/runs/22445566687)`Build and Release Docker Image`**FAILED**
33+
34+
**What happened inside this run:**
35+
36+
**Job: Apply Changesets** (14:05:57Z - 14:06:00Z) — ✅ SUCCESS
37+
- Found `.changeset/fix-du-exit-code-regression.md` (bump: patch)
38+
- Bumped version 1.3.10 → 1.3.11
39+
- Committed as `e20cf46`: `"1.3.11: Fix CI failure caused by du exit code regression..."`
40+
- **Pushed `e20cf46` to `main` at 14:05:59Z** using GITHUB_TOKEN
41+
42+
**Job: detect-changes** (14:06:05Z - 14:06:08Z) — ✅ SUCCESS
43+
- Checked out `refs/heads/main` and saw HEAD as `e20cf46` (v1.3.11) — correctly fetched the new commit
44+
- Detected: `Detected version: 1.3.11`
45+
- Detected VERSION file changed → `should-build=true`
46+
47+
**Jobs: build-js-arm64, build-js-amd64, build-essentials-*, build-languages-*** (14:06:23Z - 14:18+Z) — ❌ FAILED
48+
- Started building v1.3.11 Docker images (tags included `1.3.11-amd64`, `1.3.11-arm64`)
49+
- JS images: Successfully built and pushed `sandbox-js:1.3.11-*`
50+
- Essentials images: Successfully built and pushed
51+
- Language images: **FAILED** due to transient network timeouts:
52+
- `ruby` (amd64): `DeadlineExceeded: Post "https://results-receiver.actions.githubusercontent.com/...": dial tcp 140.82.112.22:443: i/o timeout`
53+
- `java` (amd64): `Error response from daemon: Get "https://ghcr.io/v2/": context deadline exceeded`
54+
- `java` (arm64): `DeadlineExceeded: failed to fetch oauth token: Post "https://ghcr.io/token": dial tcp 140.82.112.34:443: i/o timeout`
55+
- `rust` (arm64): `net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)`
56+
57+
**Jobs: create-release, docker-manifest** — NEVER RAN (workflow failed before reaching them)
58+
59+
### Why `e20cf46` Never Got Its Own Workflow Run
60+
61+
After `apply-changesets` pushed commit `e20cf46` using `GITHUB_TOKEN`, **GitHub Actions did NOT create a new `on: push` workflow run for that commit**. This is an intentional GitHub restriction to prevent infinite workflow loops.
62+
63+
- **GitHub Documentation**: "If an action pushes code using the repository's GITHUB_TOKEN, a new workflow will not run even when the repository contains a workflow configured to run when push events occur." — [Triggering a workflow from a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
64+
- **Confirmation**: Zero GitHub Actions workflow runs exist with `head_sha == e20cf4648c446c54bf9492b99bb4ed2356b0a2e9`
65+
66+
This means: if the original push-triggered run fails, there is **no automatic mechanism to retry the build for the version bump commit**. The version bump commit is effectively "orphaned" from CI/CD.
67+
68+
### 2026-02-26T20:05:35Z — User Manually Triggers `workflow_dispatch`
69+
70+
- **Actor**: `konard`
71+
- **Mode**: `bump-and-release`, `bump_type: patch`, `description: "Test patch release"`
72+
- **Workflow Run**: [22459099802](https://github.com/link-foundation/sandbox/actions/runs/22459099802)**SUCCESS**
73+
74+
**What happened:**
75+
- Read current version (1.3.11, from the `apply-changesets` commit)
76+
- Bumped 1.3.11 → **1.3.12** (this was not the user's intention — v1.3.11 was already bumped, just not released)
77+
- Committed `7f76713`: "1.3.12: Test patch release"
78+
- Successfully built and pushed all Docker images tagged `1.3.12`
79+
- Created GitHub Release `v1.3.12`
80+
81+
**The user's intention** was likely to force a release, but the `workflow_dispatch` with `bump-and-release` mode performed another version bump instead of releasing the already-bumped v1.3.11.
82+
83+
---
84+
85+
## Version/Release Outcome
86+
87+
| Version | Commit | Docker Images Released | GitHub Release | Notes |
88+
|---------|---------|----------------------|----------------|-------|
89+
| 1.3.10 | f274bfa | ✅ Released | ✅ v1.3.10 | Previous release |
90+
| 1.3.11 | e20cf46 | ⚠️ Partial (JS + essentials only, language images failed) | ❌ None | Changeset bump succeeded, build failed |
91+
| 1.3.12 | 7f76713 | ✅ Released | ✅ v1.3.12 | Manual dispatch created extra bump |
92+
93+
---
94+
95+
## Root Cause Analysis
96+
97+
### Root Cause 1 (Primary): Transient Network Timeouts on GitHub-Hosted Runners
98+
99+
**Description**: The build jobs for language images failed due to transient network timeouts when connecting to GitHub Container Registry (ghcr.io) and Docker Hub from GitHub-hosted runners.
100+
101+
**Errors observed** (from run 22445566687):
102+
```
103+
build-languages-amd64 (ruby):
104+
##[error]buildx failed with: ERROR: failed to build: failed to solve:
105+
DeadlineExceeded: Post "https://results-receiver.actions.githubusercontent.com/...":
106+
dial tcp 140.82.112.22:443: i/o timeout
107+
108+
build-languages-amd64 (java):
109+
##[error]Error response from daemon: Get "https://ghcr.io/v2/":
110+
context deadline exceeded
111+
112+
build-languages-arm64 (java):
113+
##[error]buildx failed with: ERROR: failed to build: failed to solve:
114+
DeadlineExceeded: failed to fetch oauth token: Post "https://ghcr.io/token":
115+
dial tcp 140.82.112.34:443: i/o timeout
116+
117+
build-languages-arm64 (rust):
118+
##[error]Error response from daemon: Get "https://ghcr.io/v2/":
119+
net/http: request canceled while waiting for connection
120+
(Client.Timeout exceeded while awaiting headers)
121+
```
122+
123+
**This is a known issue**: GitHub-hosted ARM64 runners in particular have known network instability when connecting to external registries. See:
124+
- [actions/runner-images#11886](https://github.com/actions/runner-images/issues/11886) — Ubuntu network instability on GitHub Actions runners
125+
- Previous case study: [issue-53 case study](../issue-53/CASE-STUDY.md) documented similar ARM64 network issues
126+
127+
**Impact**: Since language builds failed, the `create-release` and `docker-manifest` jobs never ran. No GitHub Release was created for v1.3.11.
128+
129+
### Root Cause 2 (Secondary): No Retry Mechanism for GITHUB_TOKEN-Pushed Version Bump Commits
130+
131+
**Description**: When `apply-changesets` pushes a version bump commit using `GITHUB_TOKEN`, GitHub Actions deliberately does NOT trigger a new `on: push` workflow run for that commit. This is by design to prevent infinite loops.
132+
133+
**Consequence**: If the push-triggered workflow run fails (as it did here), there is no automatic mechanism to retry the build for the version bump commit `e20cf46`. The commit is effectively "orphaned" — it exists in git history with the correct version but no CI/CD will ever automatically build/release it.
134+
135+
**GitHub's documentation on this**:
136+
> "When you use the GITHUB_TOKEN to perform tasks, events triggered by the GITHUB_TOKEN, with the exception of workflow_dispatch and repository_dispatch, will not create a new workflow run."
137+
> [GitHub Docs](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
138+
139+
**The design gap**: The `workflow_dispatch` `bump-and-release` mode was intended as a manual workaround, but it performs a NEW version bump rather than retrying the failed release of the already-bumped version.
140+
141+
---
142+
143+
## Contributing Factors
144+
145+
### Factor: `workflow_dispatch` bump-and-release Doesn't Retry — It Bumps Again
146+
147+
When the user triggered `workflow_dispatch` with `bump-and-release`, the workflow read the current version (1.3.11) and bumped it to 1.3.12. This created an unintentional extra version increment:
148+
- v1.3.11 was meant to be the release of the issue-57 fix
149+
- v1.3.12 was created as a "Test patch release" but its actual content is identical to v1.3.11
150+
151+
There is no `release-only` mode that would: "build and release the current HEAD version without bumping."
152+
153+
---
154+
155+
## Online Research
156+
157+
### GitHub Actions GITHUB_TOKEN Push Limitation
158+
159+
Sources confirming that GITHUB_TOKEN pushes don't trigger subsequent workflows:
160+
- [GitHub Community Discussion #25702](https://github.com/orgs/community/discussions/25702): "Push from Action does not trigger subsequent action" — confirmed intentional
161+
- [GitHub Community Discussion #37103](https://github.com/orgs/community/discussions/37103): "Push by workflow does not trigger another workflow anymore"
162+
- [GitHub Community Discussion #33804](https://github.com/orgs/community/discussions/33804): "GitHub-actions bot not triggering Actions"
163+
- [GitHub Docs: Triggering a workflow from a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
164+
165+
### Known Workarounds for Triggering Workflows from Workflows
166+
167+
1. **Use a Personal Access Token (PAT)**: Pushes via PAT DO trigger subsequent `on: push` workflows. Requires maintaining a secret PAT.
168+
169+
2. **Use a GitHub App Token** (`tibdex/github-app-token`): More robust than PAT, doesn't expire.
170+
171+
3. **Use `repository_dispatch`**: Fire a repository dispatch event after the push. Works with GITHUB_TOKEN.
172+
173+
4. **Consolidate build into the same workflow run**: Don't rely on a separate push event — run the build pipeline directly after the version bump within the same run (this is how `workflow_dispatch` mode already works successfully).
174+
175+
5. **Add retry logic**: Use GitHub's `gh run rerun` command or a monitoring workflow to retry failed runs.
176+
177+
### Network Timeout Issues on GitHub Runners
178+
179+
- [actions/runner-images#11886](https://github.com/actions/runner-images/issues/11886): Reports of network instability on GitHub-hosted runners
180+
- Previous case study [issue-53](../issue-53/CASE-STUDY.md): Documented ARM64 runner network issues causing build hangs
181+
182+
---
183+
184+
## Proposed Solutions
185+
186+
### Solution A (High Priority): Add `release-only` Mode to `workflow_dispatch`
187+
188+
Add a new `release_mode` option `release-only` that builds and releases the **current HEAD version** without performing a version bump. This would allow:
189+
```
190+
workflow_dispatch → release_mode: release-only
191+
→ Reads current VERSION (1.3.11)
192+
→ Builds Docker images with 1.3.11 tags
193+
→ Creates GitHub Release v1.3.11
194+
```
195+
196+
This directly addresses the "stuck version" problem without creating unwanted version increments.
197+
198+
### Solution B (High Priority): Add Retry Capability for Failed Builds
199+
200+
Implement automatic or semi-automatic retry for failed build runs:
201+
- **Option 1**: A scheduled monitoring workflow that detects failed `push`-triggered runs and re-runs failed jobs
202+
- **Option 2**: Clear documentation telling users to use `gh run rerun <run-id>` to retry a failed run
203+
- **Option 3**: Add retry configuration to build jobs using GitHub Actions' built-in retry (not native, but achievable with `nick-fields/retry` action)
204+
205+
### Solution C (Medium Priority): Address Network Timeout Failures
206+
207+
The network timeouts that caused the original failure:
208+
- Increase timeout settings on Docker build/push operations
209+
- Add retry logic around registry pushes
210+
- Consider caching strategies to reduce registry interaction
211+
- Already partially addressed in issue-53 (timeout reduction for ARM64 language builds)
212+
213+
### Solution D (Low Priority): Use PAT for `apply-changesets` Push
214+
215+
Replace the GITHUB_TOKEN push in `apply-changesets.sh` with a PAT to enable triggering of subsequent workflows. This would cause a new `on: push` workflow run when the version bump is committed, eliminating the "orphaned version bump commit" problem.
216+
217+
**Trade-off**: Requires managing a PAT secret with rotation. If the PAT expires, changesets stop working.
218+
219+
---
220+
221+
## Summary of Issues Found
222+
223+
| Issue | Severity | Type |
224+
|-------|----------|------|
225+
| Transient network timeouts caused build failure for v1.3.11 | High | Infrastructure/Reliability |
226+
| No `release-only` workflow_dispatch mode — bump-and-release always creates extra version increment | High | Design Gap |
227+
| GITHUB_TOKEN push doesn't trigger new workflow — no auto-retry for failed version bump builds | Medium | GitHub Platform Limitation |
228+
| v1.3.11 Docker images partially published (JS+essentials but not language images) | Medium | Side Effect of Failure |
229+
230+
---
231+
232+
## Data Files
233+
234+
- `ci-logs/workflow-dispatch-22459099802.log` — The successful `workflow_dispatch` run that created v1.3.12
235+
- `ci-logs/push-failure-22445566687.log` — The failed `push`-triggered run for `b7462ab` that was supposed to release v1.3.11
236+
237+
---
238+
239+
## References
240+
241+
- [GitHub Docs: Triggering a workflow from a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
242+
- [GitHub Community #25702: Push from Action does not trigger subsequent action](https://github.com/orgs/community/discussions/25702)
243+
- [GitHub Community #37103: Push by workflow does not trigger another workflow anymore](https://github.com/orgs/community/discussions/37103)
244+
- [GitHub Community #33804: GitHub-actions bot not triggering Actions](https://github.com/orgs/community/discussions/33804)
245+
- [actions/runner-images#11886: ARM64 runner network instability](https://github.com/actions/runner-images/issues/11886)
246+
- [Issue #53 Case Study: PHP ARM64 build timeouts](../issue-53/CASE-STUDY.md)

0 commit comments

Comments
 (0)