|
| 1 | +# Decision: Exit Code Zero for Docker Security Scanning |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Accepted |
| 6 | + |
| 7 | +## Date |
| 8 | + |
| 9 | +2025-12-23 |
| 10 | + |
| 11 | +## Context |
| 12 | + |
| 13 | +When implementing automated Docker vulnerability scanning with Trivy in GitHub Actions, we faced a critical decision about how the CI/CD pipeline should respond to discovered vulnerabilities. |
| 14 | + |
| 15 | +Traditional approaches make CI fail when vulnerabilities are found, blocking all development until issues are resolved. However, this creates several problems: |
| 16 | + |
| 17 | +1. **False Positives**: Security scanners can report issues that don't apply to our context or are accepted risks |
| 18 | +2. **Third-Party Dependencies**: We cannot immediately fix vulnerabilities in upstream images (mysql, prometheus, grafana) |
| 19 | +3. **Scanner Quirks**: Trivy occasionally exits with code 1 even when no vulnerabilities are found |
| 20 | +4. **Development Flow**: Security findings should not block unrelated development work |
| 21 | +5. **Policy Enforcement**: Security decisions should be made by security teams, not automated tooling |
| 22 | +6. **Partial Data Loss**: If CI fails early, later scans never run and we lose visibility into other images |
| 23 | + |
| 24 | +The initial implementation used `exit-code: "1"` which caused the workflow to fail on any HIGH or CRITICAL vulnerability, including when scanning third-party production images with known CVEs that we cannot immediately fix. |
| 25 | + |
| 26 | +## Decision |
| 27 | + |
| 28 | +Implement a **security-first philosophy** where: |
| 29 | + |
| 30 | +1. **Exit Code Zero Everywhere**: All Trivy scan steps use `exit-code: "0"` - the scanner never fails the CI pipeline |
| 31 | +2. **Dual Output Strategy**: |
| 32 | + - Human-readable table format in workflow logs for immediate visibility |
| 33 | + - SARIF format uploaded to GitHub Security tab for tracking and alerting |
| 34 | +3. **Separation of Concerns**: |
| 35 | + - Trivy's role: **Detect** vulnerabilities and provide data |
| 36 | + - GitHub Security's role: **Decide** enforcement policies and alert routing |
| 37 | + - CI's role: **Stay green** and maintain development velocity |
| 38 | +4. **Always Run Policy**: Upload job uses `if: always()` to ensure partial results are never lost |
| 39 | +5. **Unique Categories**: Each image gets a unique SARIF category for proper alert tracking and deduplication |
| 40 | +6. **Scheduled Scanning**: Daily cron ensures continuous monitoring without blocking code changes |
| 41 | + |
| 42 | +This philosophy is summarized as: **"Trivy detects, GitHub Security decides, CI stays green"** |
| 43 | + |
| 44 | +## Consequences |
| 45 | + |
| 46 | +### Positive |
| 47 | + |
| 48 | +- **No False Failures**: Development work never blocked by scanner quirks or edge cases |
| 49 | +- **Continuous Visibility**: All scans complete even if one fails, providing complete security picture |
| 50 | +- **Flexible Enforcement**: Security team can configure GitHub Security policies without changing code |
| 51 | +- **Third-Party Tolerance**: Known vulnerabilities in upstream images don't block development |
| 52 | +- **Developer Experience**: Green builds maintain team velocity while security team reviews findings |
| 53 | +- **Policy Separation**: Security enforcement decoupled from CI/CD implementation |
| 54 | +- **Audit Trail**: All findings recorded in GitHub Security tab for compliance and tracking |
| 55 | +- **Incremental Improvement**: Can address vulnerabilities based on priority without CI pressure |
| 56 | + |
| 57 | +### Negative |
| 58 | + |
| 59 | +- **Potential Complacency**: Green CI might lead to ignoring security findings (mitigated by GitHub Security alerts) |
| 60 | +- **Requires Monitoring**: Security team must actively monitor GitHub Security tab |
| 61 | +- **Policy Configuration**: Requires additional GitHub Security policy setup for enforcement |
| 62 | +- **Learning Curve**: Non-traditional approach may confuse developers expecting red builds for vulnerabilities |
| 63 | + |
| 64 | +### Risks Introduced |
| 65 | + |
| 66 | +- **Missed Critical Issues**: If GitHub Security is not properly configured or monitored, critical vulnerabilities might go unaddressed |
| 67 | + - **Mitigation**: Daily scheduled scans ensure consistent monitoring; GitHub Security sends email notifications |
| 68 | +- **Organizational Resistance**: Some organizations mandate CI failure on security issues |
| 69 | + - **Mitigation**: GitHub Security can be configured to block PRs or deployments if needed |
| 70 | + |
| 71 | +## Alternatives Considered |
| 72 | + |
| 73 | +### 1. Exit Code 1 (Fail on Vulnerabilities) |
| 74 | + |
| 75 | +**Approach**: Use `exit-code: "1"` to fail CI when HIGH/CRITICAL vulnerabilities are found. |
| 76 | + |
| 77 | +**Rejected Because**: |
| 78 | + |
| 79 | +- Blocks development on third-party image vulnerabilities we cannot fix immediately |
| 80 | +- Scanner quirks cause false CI failures even with zero vulnerabilities |
| 81 | +- No flexibility for security team to make risk-based decisions |
| 82 | +- Partial data loss when early scans fail |
| 83 | + |
| 84 | +### 2. Mixed Exit Codes (Project vs Third-Party) |
| 85 | + |
| 86 | +**Approach**: Use `exit-code: "1"` for project images but `exit-code: "0"` for third-party images. |
| 87 | + |
| 88 | +**Rejected Because**: |
| 89 | + |
| 90 | +- Inconsistent philosophy creates confusion |
| 91 | +- Project images can have legitimate accepted risks |
| 92 | +- Still susceptible to scanner quirks on project images |
| 93 | +- Doesn't solve the fundamental policy enforcement problem |
| 94 | + |
| 95 | +### 3. Continue-on-Error Pattern |
| 96 | + |
| 97 | +**Approach**: Use `exit-code: "1"` but add `continue-on-error: true` to allow workflow to proceed. |
| 98 | + |
| 99 | +**Rejected Because**: |
| 100 | + |
| 101 | +- Shows misleading "failed" status even though workflow continues |
| 102 | +- Scanner errors appear as failures in UI, creating noise |
| 103 | +- Doesn't fundamentally change the enforcement model |
| 104 | +- Confusing to developers seeing "failed" steps that don't actually fail |
| 105 | + |
| 106 | +### 4. CodeQL Action with Single Category |
| 107 | + |
| 108 | +**Approach**: Upload all SARIF files using github/codeql-action/upload-sarif with same category. |
| 109 | + |
| 110 | +**Rejected Because**: |
| 111 | + |
| 112 | +- CodeQL Action rejects multiple SARIF uploads with identical categories (as of July 2025) |
| 113 | +- Results in "multiple SARIF runs with same category" error |
| 114 | +- Cannot distinguish alerts between different images |
| 115 | + |
| 116 | +## Related Decisions |
| 117 | + |
| 118 | +- [GitHub Actions Workflow Structure](https://github.com/torrust/torrust-tracker-deployer/pull/256) - How the three-job structure enables this philosophy |
| 119 | +- Future: Security Policy Configuration (to be documented when GitHub Security policies are configured) |
| 120 | + |
| 121 | +## References |
| 122 | + |
| 123 | +- [Issue #251: Implement basic Trivy scanning workflow](https://github.com/torrust/torrust-tracker-deployer/issues/251) |
| 124 | +- [Pull Request #256: Implement Basic Trivy Scanning Workflow](https://github.com/torrust/torrust-tracker-deployer/pull/256) |
| 125 | +- [Trivy Action Documentation](https://github.com/aquasecurity/trivy-action) |
| 126 | +- [GitHub Code Scanning Documentation](https://docs.github.com/en/code-security/code-scanning) |
| 127 | +- [GitHub Security Policy Enforcement](https://docs.github.com/en/code-security/code-scanning/managing-code-scanning-alerts) |
| 128 | +- [Security-First Philosophy Discussion](https://github.com/torrust/torrust-tracker-deployer/pull/256#discussion) - External review recommending exit-code 0 approach |
0 commit comments