|
| 1 | +# E2E Tests Split: Provision vs Configuration |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +### Problem |
| 6 | + |
| 7 | +The current E2E tests are failing on GitHub Actions runners due to network connectivity issues within LXD virtual machines. After successful VM provisioning, the VMs cannot install dependencies because they lack network connectivity in the GitHub Actions environment. |
| 8 | + |
| 9 | +**Related Issues:** |
| 10 | + |
| 11 | +- [GitHub Actions Runner Images Issue #13003](https://github.com/actions/runner-images/issues/13003) - Network connectivity issues with LXD VMs on GitHub runners |
| 12 | +- [Reproduction Repository](https://github.com/josecelano/test-docker-install-inside-vm-in-runner) - Test repository demonstrating the network connectivity issues |
| 13 | +- [Virtualization Support Research](https://github.com/josecelano/github-actions-virtualization-support) - Comprehensive testing of virtualization tools on GitHub Actions, demonstrating Docker feasibility |
| 14 | +- [Original Virtualization Investigation](https://github.com/actions/runner-images/issues/12933) - Background context on GitHub Actions virtualization support |
| 15 | + |
| 16 | +### Current Deployment Phases |
| 17 | + |
| 18 | +Our deployment workflow consists of these sequential phases: |
| 19 | + |
| 20 | +1. **Provision** - Create infrastructure (VMs/containers) using OpenTofu/LXD |
| 21 | +2. **Configure** - Install and configure software using Ansible |
| 22 | +3. **Release** - Deploy application artifacts |
| 23 | +4. **Run** - Start and validate running services |
| 24 | + |
| 25 | +Currently, all phases are tested together in a single E2E test suite, which fails due to the network connectivity issue in phase 2 (Configure). |
| 26 | + |
| 27 | +## Solution |
| 28 | + |
| 29 | +Split the E2E testing into two independent test suites: |
| 30 | + |
| 31 | +### 1. E2E Provision Tests (`e2e-provision`) |
| 32 | + |
| 33 | +- **Scope**: Test only the provisioning phase |
| 34 | +- **Technology**: Continue using LXD VMs via GitHub Actions |
| 35 | +- **Coverage**: |
| 36 | + - VM/container creation |
| 37 | + - Cloud-init completion |
| 38 | + - Basic infrastructure validation |
| 39 | +- **Success Criteria**: VM is created and cloud-init has finished successfully |
| 40 | + |
| 41 | +### 2. E2E Configuration Tests (`e2e-config`) |
| 42 | + |
| 43 | +- **Scope**: Test configuration, release, and run phases |
| 44 | +- **Technology**: Use Docker containers instead of VMs (proven feasible per [virtualization research](https://github.com/josecelano/github-actions-virtualization-support)) |
| 45 | +- **Coverage**: |
| 46 | + - Ansible playbook execution |
| 47 | + - Software installation (Docker, Docker Compose, etc.) |
| 48 | + - Application deployment |
| 49 | + - Service validation |
| 50 | +- **Success Criteria**: All software is installed and services are running correctly |
| 51 | + |
| 52 | +### Benefits |
| 53 | + |
| 54 | +1. **Reliability**: Provision tests continue working on GitHub Actions |
| 55 | +2. **Speed**: Configuration tests run faster in Docker containers |
| 56 | +3. **Isolation**: Issues in one test suite don't block the other |
| 57 | +4. **Maintainability**: Each test suite has a single, focused responsibility |
| 58 | +5. **Debugging**: Easier to identify whether issues are in provisioning or configuration |
| 59 | + |
| 60 | +## Implementation Plan |
| 61 | + |
| 62 | +### Phase A: Create E2E Provision Tests |
| 63 | + |
| 64 | +#### A.1: Define naming and structure |
| 65 | + |
| 66 | +- [ ] **Task**: Define binary and workflow names |
| 67 | + - Binary: `e2e-provision-tests` |
| 68 | + - Workflow: `.github/workflows/test-e2e-provision.yml` |
| 69 | + - Purpose: Test infrastructure provisioning only |
| 70 | + |
| 71 | +#### A.2: Create provision-only workflow |
| 72 | + |
| 73 | +- [ ] **Task**: Create `.github/workflows/test-e2e-provision.yml` |
| 74 | + - Copy structure from existing `test-e2e.yml` |
| 75 | + - Use `cargo run --bin e2e-provision-tests` |
| 76 | + - Keep all LXD/OpenTofu setup steps |
| 77 | + - Remove Ansible installation (not needed for provision-only tests) |
| 78 | + |
| 79 | +#### A.3: Create provision-only binary |
| 80 | + |
| 81 | +- [ ] **Task**: Create `src/bin/e2e_provision_tests.rs` |
| 82 | + - Copy code from `src/bin/e2e_tests.rs` |
| 83 | + - Remove `configure_infrastructure` call in `run_full_deployment_test()` |
| 84 | + - Focus only on: |
| 85 | + - `cleanup_lingering_resources()` |
| 86 | + - `provision_infrastructure()` |
| 87 | + - Basic validation that VM is created and cloud-init completed |
| 88 | + - `cleanup_infrastructure()` |
| 89 | + |
| 90 | +#### A.4: Update provision test validation |
| 91 | + |
| 92 | +- [ ] **Task**: Modify validation logic in provision tests |
| 93 | + - Check VM/container exists and is running |
| 94 | + - Verify cloud-init has completed successfully |
| 95 | + - Validate basic network interface setup |
| 96 | + - Skip application-level validations |
| 97 | + |
| 98 | +#### A.5: Test and commit provision workflow |
| 99 | + |
| 100 | +- [ ] **Task**: Verify provision-only workflow works |
| 101 | + - Test locally: `cargo run --bin e2e-provision-tests` |
| 102 | + - Commit changes with conventional commit format |
| 103 | + - Verify new GitHub workflow passes |
| 104 | + - Update workflow status badges in README if needed |
| 105 | + |
| 106 | +### Phase B: Create E2E Configuration Tests |
| 107 | + |
| 108 | +#### B.1: Research Docker container approach |
| 109 | + |
| 110 | +- [ ] **Task**: Design Docker-based test environment |
| 111 | + - **Reference**: Use proven approach from [virtualization support research](https://github.com/josecelano/github-actions-virtualization-support) |
| 112 | + - Create Ubuntu 24.04 base container configuration |
| 113 | + - Investigate cloud-init support in Docker (or alternative initialization) |
| 114 | + - Research testcontainers integration for Rust |
| 115 | + - Document container networking requirements for Ansible |
| 116 | + - **Advantage**: Docker is well-established and reliable on GitHub Actions |
| 117 | + |
| 118 | +#### B.2: Create Docker configuration |
| 119 | + |
| 120 | +- [ ] **Task**: Create `docker/test-ubuntu/Dockerfile` |
| 121 | + - Ubuntu 24.04 base image |
| 122 | + - Cloud-init installation (if feasible) or alternative init system |
| 123 | + - SSH server configuration for Ansible connectivity |
| 124 | + - Network configuration for container accessibility |
| 125 | + - Required system dependencies |
| 126 | + |
| 127 | +#### B.3: Create configuration-only binary |
| 128 | + |
| 129 | +- [ ] **Task**: Create `src/bin/e2e_config_tests.rs` |
| 130 | + - Copy code from original `src/bin/e2e_tests.rs` (before provision-only changes) |
| 131 | + - Replace LXD VM provisioning with Docker container setup |
| 132 | + - Implement Docker container lifecycle management |
| 133 | + - Keep all configuration, release, and run phase testing |
| 134 | + - Update infrastructure cleanup to handle Docker containers |
| 135 | + |
| 136 | +#### B.4: Integrate testcontainers (optional) |
| 137 | + |
| 138 | +- [ ] **Task**: Evaluate and potentially integrate testcontainers-rs |
| 139 | + - Add `testcontainers` crate dependency if beneficial |
| 140 | + - Implement container management through testcontainers API |
| 141 | + - Compare with direct Docker CLI approach |
| 142 | + - Document decision and rationale |
| 143 | + |
| 144 | +#### B.5: Test configuration workflow locally |
| 145 | + |
| 146 | +- [ ] **Task**: Validate configuration tests work locally |
| 147 | + - Test: `cargo run --bin e2e-config-tests` |
| 148 | + - Verify container creation and networking |
| 149 | + - Validate Ansible connectivity to container |
| 150 | + - Confirm all configuration/release/run phases complete |
| 151 | + - Test cleanup procedures |
| 152 | + |
| 153 | +#### B.6: Create configuration workflow |
| 154 | + |
| 155 | +- [ ] **Task**: Create `.github/workflows/test-e2e-config.yml` |
| 156 | + - Remove LXD/OpenTofu setup steps |
| 157 | + - Keep Ansible installation |
| 158 | + - Add Docker setup if needed |
| 159 | + - Use `cargo run --bin e2e-config-tests` |
| 160 | + - Configure appropriate timeout limits |
| 161 | + |
| 162 | +#### B.7: Test and commit configuration workflow |
| 163 | + |
| 164 | +- [ ] **Task**: Verify configuration workflow on GitHub Actions |
| 165 | + - Commit configuration test changes |
| 166 | + - Verify new GitHub workflow passes |
| 167 | + - Test that Docker containers work correctly in GitHub Actions |
| 168 | + - Validate all software installation steps complete |
| 169 | + |
| 170 | +### Phase C: Integration and Documentation |
| 171 | + |
| 172 | +#### C.1: Update documentation |
| 173 | + |
| 174 | +- [ ] **Task**: Update relevant documentation |
| 175 | + - Update `docs/e2e-testing.md` to reflect new split approach |
| 176 | + - Document how to run each test suite independently |
| 177 | + - Update `README.md` workflow badges for both test suites |
| 178 | + - Add troubleshooting guide for each test type |
| 179 | + |
| 180 | +#### C.2: Update legacy workflow |
| 181 | + |
| 182 | +- [ ] **Task**: Update or deprecate original E2E workflow |
| 183 | + - Option 1: Remove `.github/workflows/test-e2e.yml` entirely |
| 184 | + - Option 2: Convert to meta-workflow that runs both new test suites |
| 185 | + - Update any CI dependencies or status checks |
| 186 | + |
| 187 | +#### C.3: Cleanup old binary (optional) |
| 188 | + |
| 189 | +- [ ] **Task**: Remove or repurpose `src/bin/e2e_tests.rs` |
| 190 | + - Remove if no longer needed |
| 191 | + - Or repurpose as meta-test runner for both suites |
| 192 | + - Update any related documentation |
| 193 | + |
| 194 | +#### C.4: Validate complete solution |
| 195 | + |
| 196 | +- [ ] **Task**: End-to-end validation |
| 197 | + - Verify both test suites pass independently |
| 198 | + - Test that they can run in parallel without conflicts |
| 199 | + - Validate comprehensive coverage across all deployment phases |
| 200 | + - Confirm GitHub Actions reliability improvements |
| 201 | + |
| 202 | +## Success Criteria |
| 203 | + |
| 204 | +1. **Provision Tests**: Consistently pass on GitHub Actions, testing VM creation and cloud-init |
| 205 | +2. **Configuration Tests**: Consistently pass on GitHub Actions, testing software installation and deployment |
| 206 | +3. **Independence**: Each test suite can run independently without interference |
| 207 | +4. **Coverage**: Combined test suites provide equivalent or better coverage than original tests |
| 208 | +5. **Performance**: Overall test execution time is equal or improved |
| 209 | +6. **Maintainability**: Clear separation of concerns makes debugging and maintenance easier |
| 210 | + |
| 211 | +## Risks and Mitigations |
| 212 | + |
| 213 | +### Risk: Docker environment differs from LXD VMs |
| 214 | + |
| 215 | +- **Mitigation**: Carefully configure Docker container to match LXD VM environment |
| 216 | +- **Validation**: Cross-reference configurations between Docker and LXD templates |
| 217 | + |
| 218 | +### Risk: Testcontainers adds complexity |
| 219 | + |
| 220 | +- **Mitigation**: Start with direct Docker approach, only add testcontainers if clearly beneficial |
| 221 | +- **Fallback**: Direct Docker CLI integration is simpler and well-documented |
| 222 | + |
| 223 | +### Risk: Loss of end-to-end coverage |
| 224 | + |
| 225 | +- **Mitigation**: Ensure that provision tests validate infrastructure is ready for configuration |
| 226 | +- **Validation**: Document the interface contract between provision and configuration phases |
| 227 | + |
| 228 | +### Risk: Increased maintenance burden |
| 229 | + |
| 230 | +- **Mitigation**: Share common code between test suites through library modules |
| 231 | +- **Best Practice**: Keep test configurations as similar as possible between suites |
0 commit comments