|
| 1 | +# Case Study: Manual CI/CD Did Not Produce Any New Releases (Issue #60) |
| 2 | + |
| 3 | +## Issue Reference |
| 4 | +- **Issue**: https://github.com/link-foundation/sandbox/issues/60 |
| 5 | +- **Commit referenced in issue**: https://github.com/link-foundation/sandbox/commit/7f7671300d152cf110b5d0cf2a9f4e16b3982dab |
| 6 | +- **Actions page**: https://github.com/link-foundation/sandbox/actions |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Executive Summary |
| 11 | + |
| 12 | +After PR #58 merged on 2026-02-26 (containing the issue-57 fix), the CI/CD pipeline: |
| 13 | +1. Correctly bumped the version from 1.3.10 → 1.3.11 via changeset |
| 14 | +2. Built Docker images tagged `1.3.11` for most components |
| 15 | +3. **Failed to complete the Docker build due to transient network timeouts** on GitHub-hosted runners when connecting to ghcr.io and Docker Hub |
| 16 | +4. As a result, **v1.3.11 Docker images were never fully published** and no GitHub Release was created for v1.3.11 |
| 17 | + |
| 18 | +The user then manually triggered `workflow_dispatch` with `bump-and-release` mode, which: |
| 19 | +1. Bumped 1.3.11 → 1.3.12 (introducing an unintentional version skip) |
| 20 | +2. Successfully built and released v1.3.12 |
| 21 | + |
| 22 | +**v1.3.11 was never released.** The root cause was **transient network timeouts** on GitHub-hosted runners, compounded by a secondary issue that once a `GITHUB_TOKEN`-based commit is pushed, failures cannot trigger a fresh re-run for that commit automatically. |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Timeline of Events |
| 27 | + |
| 28 | +### 2026-02-26T14:05:54Z — Merge Commit `b7462ab` Pushed to `main` |
| 29 | + |
| 30 | +- **Event**: PR #58 merged to `main`, containing the issue-57 fix |
| 31 | +- **Trigger**: `push` event on `main` |
| 32 | +- **Workflow Run**: [22445566687](https://github.com/link-foundation/sandbox/actions/runs/22445566687) — `Build and Release Docker Image` — **FAILED** |
| 33 | + |
| 34 | +**What happened inside this run:** |
| 35 | + |
| 36 | +**Job: Apply Changesets** (14:05:57Z - 14:06:00Z) — ✅ SUCCESS |
| 37 | +- Found `.changeset/fix-du-exit-code-regression.md` (bump: patch) |
| 38 | +- Bumped version 1.3.10 → 1.3.11 |
| 39 | +- Committed as `e20cf46`: `"1.3.11: Fix CI failure caused by du exit code regression..."` |
| 40 | +- **Pushed `e20cf46` to `main` at 14:05:59Z** using GITHUB_TOKEN |
| 41 | + |
| 42 | +**Job: detect-changes** (14:06:05Z - 14:06:08Z) — ✅ SUCCESS |
| 43 | +- Checked out `refs/heads/main` and saw HEAD as `e20cf46` (v1.3.11) — correctly fetched the new commit |
| 44 | +- Detected: `Detected version: 1.3.11` |
| 45 | +- Detected VERSION file changed → `should-build=true` |
| 46 | + |
| 47 | +**Jobs: build-js-arm64, build-js-amd64, build-essentials-*, build-languages-*** (14:06:23Z - 14:18+Z) — ❌ FAILED |
| 48 | +- Started building v1.3.11 Docker images (tags included `1.3.11-amd64`, `1.3.11-arm64`) |
| 49 | +- JS images: Successfully built and pushed `sandbox-js:1.3.11-*` |
| 50 | +- Essentials images: Successfully built and pushed |
| 51 | +- Language images: **FAILED** due to transient network timeouts: |
| 52 | + - `ruby` (amd64): `DeadlineExceeded: Post "https://results-receiver.actions.githubusercontent.com/...": dial tcp 140.82.112.22:443: i/o timeout` |
| 53 | + - `java` (amd64): `Error response from daemon: Get "https://ghcr.io/v2/": context deadline exceeded` |
| 54 | + - `java` (arm64): `DeadlineExceeded: failed to fetch oauth token: Post "https://ghcr.io/token": dial tcp 140.82.112.34:443: i/o timeout` |
| 55 | + - `rust` (arm64): `net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)` |
| 56 | + |
| 57 | +**Jobs: create-release, docker-manifest** — NEVER RAN (workflow failed before reaching them) |
| 58 | + |
| 59 | +### Why `e20cf46` Never Got Its Own Workflow Run |
| 60 | + |
| 61 | +After `apply-changesets` pushed commit `e20cf46` using `GITHUB_TOKEN`, **GitHub Actions did NOT create a new `on: push` workflow run for that commit**. This is an intentional GitHub restriction to prevent infinite workflow loops. |
| 62 | + |
| 63 | +- **GitHub Documentation**: "If an action pushes code using the repository's GITHUB_TOKEN, a new workflow will not run even when the repository contains a workflow configured to run when push events occur." — [Triggering a workflow from a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow) |
| 64 | +- **Confirmation**: Zero GitHub Actions workflow runs exist with `head_sha == e20cf4648c446c54bf9492b99bb4ed2356b0a2e9` |
| 65 | + |
| 66 | +This means: if the original push-triggered run fails, there is **no automatic mechanism to retry the build for the version bump commit**. The version bump commit is effectively "orphaned" from CI/CD. |
| 67 | + |
| 68 | +### 2026-02-26T20:05:35Z — User Manually Triggers `workflow_dispatch` |
| 69 | + |
| 70 | +- **Actor**: `konard` |
| 71 | +- **Mode**: `bump-and-release`, `bump_type: patch`, `description: "Test patch release"` |
| 72 | +- **Workflow Run**: [22459099802](https://github.com/link-foundation/sandbox/actions/runs/22459099802) — **SUCCESS** |
| 73 | + |
| 74 | +**What happened:** |
| 75 | +- Read current version (1.3.11, from the `apply-changesets` commit) |
| 76 | +- Bumped 1.3.11 → **1.3.12** (this was not the user's intention — v1.3.11 was already bumped, just not released) |
| 77 | +- Committed `7f76713`: "1.3.12: Test patch release" |
| 78 | +- Successfully built and pushed all Docker images tagged `1.3.12` |
| 79 | +- Created GitHub Release `v1.3.12` |
| 80 | + |
| 81 | +**The user's intention** was likely to force a release, but the `workflow_dispatch` with `bump-and-release` mode performed another version bump instead of releasing the already-bumped v1.3.11. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Version/Release Outcome |
| 86 | + |
| 87 | +| Version | Commit | Docker Images Released | GitHub Release | Notes | |
| 88 | +|---------|---------|----------------------|----------------|-------| |
| 89 | +| 1.3.10 | f274bfa | ✅ Released | ✅ v1.3.10 | Previous release | |
| 90 | +| 1.3.11 | e20cf46 | ⚠️ Partial (JS + essentials only, language images failed) | ❌ None | Changeset bump succeeded, build failed | |
| 91 | +| 1.3.12 | 7f76713 | ✅ Released | ✅ v1.3.12 | Manual dispatch created extra bump | |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## Root Cause Analysis |
| 96 | + |
| 97 | +### Root Cause 1 (Primary): Transient Network Timeouts on GitHub-Hosted Runners |
| 98 | + |
| 99 | +**Description**: The build jobs for language images failed due to transient network timeouts when connecting to GitHub Container Registry (ghcr.io) and Docker Hub from GitHub-hosted runners. |
| 100 | + |
| 101 | +**Errors observed** (from run 22445566687): |
| 102 | +``` |
| 103 | +build-languages-amd64 (ruby): |
| 104 | + ##[error]buildx failed with: ERROR: failed to build: failed to solve: |
| 105 | + DeadlineExceeded: Post "https://results-receiver.actions.githubusercontent.com/...": |
| 106 | + dial tcp 140.82.112.22:443: i/o timeout |
| 107 | +
|
| 108 | +build-languages-amd64 (java): |
| 109 | + ##[error]Error response from daemon: Get "https://ghcr.io/v2/": |
| 110 | + context deadline exceeded |
| 111 | +
|
| 112 | +build-languages-arm64 (java): |
| 113 | + ##[error]buildx failed with: ERROR: failed to build: failed to solve: |
| 114 | + DeadlineExceeded: failed to fetch oauth token: Post "https://ghcr.io/token": |
| 115 | + dial tcp 140.82.112.34:443: i/o timeout |
| 116 | +
|
| 117 | +build-languages-arm64 (rust): |
| 118 | + ##[error]Error response from daemon: Get "https://ghcr.io/v2/": |
| 119 | + net/http: request canceled while waiting for connection |
| 120 | + (Client.Timeout exceeded while awaiting headers) |
| 121 | +``` |
| 122 | + |
| 123 | +**This is a known issue**: GitHub-hosted ARM64 runners in particular have known network instability when connecting to external registries. See: |
| 124 | +- [actions/runner-images#11886](https://github.com/actions/runner-images/issues/11886) — Ubuntu network instability on GitHub Actions runners |
| 125 | +- Previous case study: [issue-53 case study](../issue-53/CASE-STUDY.md) documented similar ARM64 network issues |
| 126 | + |
| 127 | +**Impact**: Since language builds failed, the `create-release` and `docker-manifest` jobs never ran. No GitHub Release was created for v1.3.11. |
| 128 | + |
| 129 | +### Root Cause 2 (Secondary): No Retry Mechanism for GITHUB_TOKEN-Pushed Version Bump Commits |
| 130 | + |
| 131 | +**Description**: When `apply-changesets` pushes a version bump commit using `GITHUB_TOKEN`, GitHub Actions deliberately does NOT trigger a new `on: push` workflow run for that commit. This is by design to prevent infinite loops. |
| 132 | + |
| 133 | +**Consequence**: If the push-triggered workflow run fails (as it did here), there is no automatic mechanism to retry the build for the version bump commit `e20cf46`. The commit is effectively "orphaned" — it exists in git history with the correct version but no CI/CD will ever automatically build/release it. |
| 134 | + |
| 135 | +**GitHub's documentation on this**: |
| 136 | +> "When you use the GITHUB_TOKEN to perform tasks, events triggered by the GITHUB_TOKEN, with the exception of workflow_dispatch and repository_dispatch, will not create a new workflow run." |
| 137 | +> — [GitHub Docs](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow) |
| 138 | +
|
| 139 | +**The design gap**: The `workflow_dispatch` `bump-and-release` mode was intended as a manual workaround, but it performs a NEW version bump rather than retrying the failed release of the already-bumped version. |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Contributing Factors |
| 144 | + |
| 145 | +### Factor: `workflow_dispatch` bump-and-release Doesn't Retry — It Bumps Again |
| 146 | + |
| 147 | +When the user triggered `workflow_dispatch` with `bump-and-release`, the workflow read the current version (1.3.11) and bumped it to 1.3.12. This created an unintentional extra version increment: |
| 148 | +- v1.3.11 was meant to be the release of the issue-57 fix |
| 149 | +- v1.3.12 was created as a "Test patch release" but its actual content is identical to v1.3.11 |
| 150 | + |
| 151 | +There is no `release-only` mode that would: "build and release the current HEAD version without bumping." |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## Online Research |
| 156 | + |
| 157 | +### GitHub Actions GITHUB_TOKEN Push Limitation |
| 158 | + |
| 159 | +Sources confirming that GITHUB_TOKEN pushes don't trigger subsequent workflows: |
| 160 | +- [GitHub Community Discussion #25702](https://github.com/orgs/community/discussions/25702): "Push from Action does not trigger subsequent action" — confirmed intentional |
| 161 | +- [GitHub Community Discussion #37103](https://github.com/orgs/community/discussions/37103): "Push by workflow does not trigger another workflow anymore" |
| 162 | +- [GitHub Community Discussion #33804](https://github.com/orgs/community/discussions/33804): "GitHub-actions bot not triggering Actions" |
| 163 | +- [GitHub Docs: Triggering a workflow from a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow) |
| 164 | + |
| 165 | +### Known Workarounds for Triggering Workflows from Workflows |
| 166 | + |
| 167 | +1. **Use a Personal Access Token (PAT)**: Pushes via PAT DO trigger subsequent `on: push` workflows. Requires maintaining a secret PAT. |
| 168 | + |
| 169 | +2. **Use a GitHub App Token** (`tibdex/github-app-token`): More robust than PAT, doesn't expire. |
| 170 | + |
| 171 | +3. **Use `repository_dispatch`**: Fire a repository dispatch event after the push. Works with GITHUB_TOKEN. |
| 172 | + |
| 173 | +4. **Consolidate build into the same workflow run**: Don't rely on a separate push event — run the build pipeline directly after the version bump within the same run (this is how `workflow_dispatch` mode already works successfully). |
| 174 | + |
| 175 | +5. **Add retry logic**: Use GitHub's `gh run rerun` command or a monitoring workflow to retry failed runs. |
| 176 | + |
| 177 | +### Network Timeout Issues on GitHub Runners |
| 178 | + |
| 179 | +- [actions/runner-images#11886](https://github.com/actions/runner-images/issues/11886): Reports of network instability on GitHub-hosted runners |
| 180 | +- Previous case study [issue-53](../issue-53/CASE-STUDY.md): Documented ARM64 runner network issues causing build hangs |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Proposed Solutions |
| 185 | + |
| 186 | +### Solution A (High Priority): Add `release-only` Mode to `workflow_dispatch` |
| 187 | + |
| 188 | +Add a new `release_mode` option `release-only` that builds and releases the **current HEAD version** without performing a version bump. This would allow: |
| 189 | +``` |
| 190 | +workflow_dispatch → release_mode: release-only |
| 191 | +→ Reads current VERSION (1.3.11) |
| 192 | +→ Builds Docker images with 1.3.11 tags |
| 193 | +→ Creates GitHub Release v1.3.11 |
| 194 | +``` |
| 195 | + |
| 196 | +This directly addresses the "stuck version" problem without creating unwanted version increments. |
| 197 | + |
| 198 | +### Solution B (High Priority): Add Retry Capability for Failed Builds |
| 199 | + |
| 200 | +Implement automatic or semi-automatic retry for failed build runs: |
| 201 | +- **Option 1**: A scheduled monitoring workflow that detects failed `push`-triggered runs and re-runs failed jobs |
| 202 | +- **Option 2**: Clear documentation telling users to use `gh run rerun <run-id>` to retry a failed run |
| 203 | +- **Option 3**: Add retry configuration to build jobs using GitHub Actions' built-in retry (not native, but achievable with `nick-fields/retry` action) |
| 204 | + |
| 205 | +### Solution C (Medium Priority): Address Network Timeout Failures |
| 206 | + |
| 207 | +The network timeouts that caused the original failure: |
| 208 | +- Increase timeout settings on Docker build/push operations |
| 209 | +- Add retry logic around registry pushes |
| 210 | +- Consider caching strategies to reduce registry interaction |
| 211 | +- Already partially addressed in issue-53 (timeout reduction for ARM64 language builds) |
| 212 | + |
| 213 | +### Solution D (Low Priority): Use PAT for `apply-changesets` Push |
| 214 | + |
| 215 | +Replace the GITHUB_TOKEN push in `apply-changesets.sh` with a PAT to enable triggering of subsequent workflows. This would cause a new `on: push` workflow run when the version bump is committed, eliminating the "orphaned version bump commit" problem. |
| 216 | + |
| 217 | +**Trade-off**: Requires managing a PAT secret with rotation. If the PAT expires, changesets stop working. |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +## Summary of Issues Found |
| 222 | + |
| 223 | +| Issue | Severity | Type | |
| 224 | +|-------|----------|------| |
| 225 | +| Transient network timeouts caused build failure for v1.3.11 | High | Infrastructure/Reliability | |
| 226 | +| No `release-only` workflow_dispatch mode — bump-and-release always creates extra version increment | High | Design Gap | |
| 227 | +| GITHUB_TOKEN push doesn't trigger new workflow — no auto-retry for failed version bump builds | Medium | GitHub Platform Limitation | |
| 228 | +| v1.3.11 Docker images partially published (JS+essentials but not language images) | Medium | Side Effect of Failure | |
| 229 | + |
| 230 | +--- |
| 231 | + |
| 232 | +## Data Files |
| 233 | + |
| 234 | +- `ci-logs/workflow-dispatch-22459099802.log` — The successful `workflow_dispatch` run that created v1.3.12 |
| 235 | +- `ci-logs/push-failure-22445566687.log` — The failed `push`-triggered run for `b7462ab` that was supposed to release v1.3.11 |
| 236 | + |
| 237 | +--- |
| 238 | + |
| 239 | +## References |
| 240 | + |
| 241 | +- [GitHub Docs: Triggering a workflow from a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow#triggering-a-workflow-from-a-workflow) |
| 242 | +- [GitHub Community #25702: Push from Action does not trigger subsequent action](https://github.com/orgs/community/discussions/25702) |
| 243 | +- [GitHub Community #37103: Push by workflow does not trigger another workflow anymore](https://github.com/orgs/community/discussions/37103) |
| 244 | +- [GitHub Community #33804: GitHub-actions bot not triggering Actions](https://github.com/orgs/community/discussions/33804) |
| 245 | +- [actions/runner-images#11886: ARM64 runner network instability](https://github.com/actions/runner-images/issues/11886) |
| 246 | +- [Issue #53 Case Study: PHP ARM64 build timeouts](../issue-53/CASE-STUDY.md) |
0 commit comments