|
| 1 | +# Case Study: Issue #5 - Benchmark is Broken |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +The benchmark workflow reports false success even when the benchmark fails with a panic. This case study analyzes the root causes, timeline of events, and proposes solutions. |
| 6 | + |
| 7 | +**Primary Issue:** GitHub Actions workflow incorrectly reports success despite `cargo bench` failing with a panic. |
| 8 | + |
| 9 | +**Root Causes Identified:** |
| 10 | +1. **Pipeline Exit Code Masking:** Using `| tee out.txt` in the benchmark command masks the exit code |
| 11 | +2. **Benchmark Logic Bug:** The delete benchmark attempts to delete links that don't exist due to shared state between benchmarks |
| 12 | + |
| 13 | +## Timeline of Events |
| 14 | + |
| 15 | +| Timestamp (UTC) | Event | |
| 16 | +|-----------------|-------| |
| 17 | +| 2025-12-22 19:49:32 | PR #4 merged: "Fix benchmark failure: Transaction implementation was a broken stub" | |
| 18 | +| 2025-12-22 19:49:39 | CI run #20442351221 starts | |
| 19 | +| 2025-12-22 19:55:19 | `cargo bench` begins execution | |
| 20 | +| 2025-12-22 20:18:16 | Create/Neo4j_NonTransaction completes | |
| 21 | +| 2025-12-22 20:39:35 | Create/Neo4j_Transaction completes | |
| 22 | +| 2025-12-22 20:40:26 | All Doublets Create benchmarks complete | |
| 23 | +| 2025-12-22 21:05:20 | Delete/Neo4j_NonTransaction completes (1988ms/iter) | |
| 24 | +| 2025-12-22 21:05:47 | **PANIC:** `NotExists(4000)` at `delete.rs:23` | |
| 25 | +| 2025-12-22 21:05:47 | `error: bench failed` message displayed | |
| 26 | +| 2025-12-22 21:05:56 | "Prepare benchmark results" step runs (despite failure) | |
| 27 | +| 2025-12-22 21:05:56 | Commit 5134f36: "Update benchmark results" pushed to main | |
| 28 | +| 2025-12-22 21:05:57 | CI run completes with **SUCCESS** status | |
| 29 | + |
| 30 | +## Root Cause Analysis |
| 31 | + |
| 32 | +### Root Cause #1: Pipeline Exit Code Masking |
| 33 | + |
| 34 | +**Location:** `.github/workflows/rust.yml:60` |
| 35 | + |
| 36 | +```yaml |
| 37 | +run: cargo bench --bench bench -- --output-format bencher | tee out.txt |
| 38 | +``` |
| 39 | +
|
| 40 | +**Problem:** When a command is piped to `tee`, the exit code of the entire pipeline is determined by the last command (`tee`), not the first command (`cargo bench`). Since `tee` successfully writes output to the file, it returns exit code 0 even though `cargo bench` returned exit code 101. |
| 41 | + |
| 42 | +**Evidence:** |
| 43 | +- CI logs show `error: bench failed` message |
| 44 | +- The step "Run benchmark" shows `conclusion: success` in GitHub Actions API |
| 45 | +- Subsequent steps (Prepare benchmark results) execute as if nothing failed |
| 46 | + |
| 47 | +**Solution:** Use `set -o pipefail` to ensure the pipeline returns the exit code of the first failing command: |
| 48 | + |
| 49 | +```yaml |
| 50 | +run: | |
| 51 | + set -o pipefail |
| 52 | + cargo bench --bench bench -- --output-format bencher | tee out.txt |
| 53 | +``` |
| 54 | + |
| 55 | +Or use bash-specific syntax: |
| 56 | +```yaml |
| 57 | +shell: bash |
| 58 | +run: | |
| 59 | + set -o pipefail |
| 60 | + cargo bench --bench bench -- --output-format bencher | tee out.txt |
| 61 | +``` |
| 62 | + |
| 63 | +### Root Cause #2: Benchmark Logic Bug (NotExists Error) |
| 64 | + |
| 65 | +**Location:** `rust/benches/benchmarks/delete.rs:23-28` |
| 66 | + |
| 67 | +```rust |
| 68 | +for id in (BACKGROUND_LINKS..=BACKGROUND_LINKS + 1_000).rev() { |
| 69 | + let _ = elapsed! {fork.delete(id)?}; |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +**Problem:** The delete benchmark tries to delete links with IDs from 3000 to 4001, but the Transaction benchmark shares state with the Client benchmark. Here's the sequence: |
| 74 | + |
| 75 | +1. `Delete/Neo4j_NonTransaction` runs successfully, creating links with IDs 1-4001 and deleting them |
| 76 | +2. `Delete/Neo4j_Transaction` starts - but it uses a shared `Client` that was initialized during `create` benchmarks |
| 77 | +3. The client's `next_id` counter has been modified by previous benchmarks |
| 78 | +4. When trying to delete ID 4000, it doesn't exist (was never created or was deleted) |
| 79 | + |
| 80 | +**Evidence from logs:** |
| 81 | +``` |
| 82 | +thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NotExists(4000)', benches/benchmarks/delete.rs:23:9 |
| 83 | +``` |
| 84 | +
|
| 85 | +**Solution Options:** |
| 86 | +1. Ensure Transaction cleanup properly resets state between benchmarks |
| 87 | +2. Use proper test isolation with separate database instances |
| 88 | +3. Handle the `NotExists` error gracefully in benchmarks |
| 89 | +
|
| 90 | +### Root Cause #3: Flawed Transaction Implementation |
| 91 | +
|
| 92 | +**Location:** `rust/src/benched.rs:94-108` |
| 93 | +
|
| 94 | +```rust |
| 95 | +impl<'a, T: LinkType> Benched for Exclusive<Transaction<'a, T>> { |
| 96 | + // ... |
| 97 | + fn fork(&mut self) -> Fork<Self> { |
| 98 | + Fork(self) // Does not call create_table()! |
| 99 | + } |
| 100 | +
|
| 101 | + unsafe fn unfork(&mut self) { |
| 102 | + // Transaction cleanup handled by client |
| 103 | + } |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +**Problem:** Unlike the `Client` implementation which calls `create_table()` in `fork()`, the `Transaction` implementation does not set up a clean state. This means: |
| 108 | +- Data from previous benchmarks persists |
| 109 | +- The `next_id` counter is not reset |
| 110 | +- Links created by previous tests affect the Transaction benchmark |
| 111 | + |
| 112 | +**Contrast with Client implementation:** |
| 113 | +```rust |
| 114 | +impl<T: LinkType> Benched for Exclusive<Client<T>> { |
| 115 | + fn fork(&mut self) -> Fork<Self> { |
| 116 | + let _ = self.create_table(); // Proper setup |
| 117 | + Fork(self) |
| 118 | + } |
| 119 | + |
| 120 | + unsafe fn unfork(&mut self) { |
| 121 | + let _ = self.drop_table(); // Proper cleanup |
| 122 | + } |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +## Proposed Solutions |
| 127 | + |
| 128 | +### Solution 1: Fix Pipeline Exit Code (Critical) |
| 129 | + |
| 130 | +**File:** `.github/workflows/rust.yml` |
| 131 | + |
| 132 | +Change: |
| 133 | +```yaml |
| 134 | +- name: Run benchmark |
| 135 | + working-directory: rust |
| 136 | + env: |
| 137 | + NEO4J_URI: bolt://localhost:7687 |
| 138 | + NEO4J_USER: neo4j |
| 139 | + NEO4J_PASSWORD: password |
| 140 | + run: cargo bench --bench bench -- --output-format bencher | tee out.txt |
| 141 | +``` |
| 142 | +
|
| 143 | +To: |
| 144 | +```yaml |
| 145 | +- name: Run benchmark |
| 146 | + working-directory: rust |
| 147 | + env: |
| 148 | + NEO4J_URI: bolt://localhost:7687 |
| 149 | + NEO4J_USER: neo4j |
| 150 | + NEO4J_PASSWORD: password |
| 151 | + run: | |
| 152 | + set -o pipefail |
| 153 | + cargo bench --bench bench -- --output-format bencher | tee out.txt |
| 154 | +``` |
| 155 | +
|
| 156 | +### Solution 2: Fix Transaction Benchmark Isolation |
| 157 | +
|
| 158 | +**File:** `rust/src/benched.rs` |
| 159 | + |
| 160 | +Update the Transaction `fork()` method to properly clean state: |
| 161 | + |
| 162 | +```rust |
| 163 | +impl<'a, T: LinkType> Benched for Exclusive<Transaction<'a, T>> { |
| 164 | + // ... |
| 165 | + fn fork(&mut self) -> Fork<Self> { |
| 166 | + let _ = self.drop_table(); // Clean previous state |
| 167 | + Fork(self) |
| 168 | + } |
| 169 | +
|
| 170 | + unsafe fn unfork(&mut self) { |
| 171 | + let _ = self.drop_table(); // Clean after benchmark |
| 172 | + } |
| 173 | +} |
| 174 | +``` |
| 175 | + |
| 176 | +### Solution 3: Make Benchmarks Independent (Long-term) |
| 177 | + |
| 178 | +Consider restructuring the benchmarks to: |
| 179 | +1. Each benchmark creates its own isolated database/table |
| 180 | +2. Use unique prefixes or namespaces for each benchmark run |
| 181 | +3. Add proper cleanup verification before each benchmark |
| 182 | + |
| 183 | +## Impact Assessment |
| 184 | + |
| 185 | +| Impact Area | Severity | Description | |
| 186 | +|-------------|----------|-------------| |
| 187 | +| CI Reliability | High | False success leads to pushing broken code | |
| 188 | +| Data Integrity | Medium | Incorrect benchmark results published to README | |
| 189 | +| Developer Trust | High | Can't rely on CI to catch benchmark failures | |
| 190 | + |
| 191 | +## Files Changed in This Incident |
| 192 | + |
| 193 | +1. `Docs/bench_rust.png` - Updated with partial results |
| 194 | +2. `Docs/bench_rust_log_scale.png` - Updated with partial results |
| 195 | +3. `README.md` - Results table updated with incomplete data showing N/A for failed benchmarks |
| 196 | + |
| 197 | +## References |
| 198 | + |
| 199 | +- [GitHub Actions: Handling Step Errors](https://www.kenmuse.com/blog/how-to-handle-step-and-job-errors-in-github-actions/) |
| 200 | +- [Bash pipefail option](https://www.gnu.org/software/bash/manual/bash.html#The-Set-Builtin) |
| 201 | +- [cargo bench documentation](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) |
| 202 | +- [CI Run #20442351221](https://github.com/linksplatform/Comparisons.Neo4jVSDoublets/actions/runs/20442351221/job/58738054640) |
| 203 | +- [Commit 5134f36 (pushed with broken benchmarks)](https://github.com/linksplatform/Comparisons.Neo4jVSDoublets/commit/5134f36730caa5a6b571692e4b67b1e3cb523e3e) |
| 204 | + |
| 205 | +## Appendix: Full Error Log Extract |
| 206 | + |
| 207 | +``` |
| 208 | +Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 1450.3s, or reduce sample count to 10. |
| 209 | +thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NotExists(4000)', benches/benchmarks/delete.rs:23:9 |
| 210 | +note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace |
| 211 | +test Delete/Neo4j_Transaction ... |
| 212 | +error: bench failed |
| 213 | +``` |
0 commit comments