Skip to content

Commit 963a7ee

Browse files
committed
Expanding our guide to flaky tests
1 parent e8dee04 commit 963a7ee

File tree

1 file changed

+149
-33
lines changed

1 file changed

+149
-33
lines changed

tools/flakeguard/e2e-flaky-test-guide.md

Lines changed: 149 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,133 @@
1-
# Finding the Root Cause of E2E Test Flakes
1+
# Finding the Root Cause of Test Flakes in Go
22

3-
## Introduction
4-
When end-to-end tests fail intermittently, the underlying issues can stem from resource constraints, environment setup, or test design—among other possibilities. This guide helps engineers systematically diagnose and address E2E test flakiness, reducing the time spent on guesswork and repeated failures.
3+
Flaky tests can arise from many sources and can be frustrating to fix. Here's a non-exhaustive guide to help you find and resolve common causes for flakes in Go.
54

6-
---
5+
## The Test Only Flakes 0.xx% of the Time, Why Bother Fixing It?
76

8-
## 1. GitHub Runners' Hardware
9-
GitHub provides **hosted runners** with specific CPU, memory, and disk allocations. If your tests require more resources than these runners can provide, you may encounter intermittent failures.
7+
You bother to fix it because of **MATH!**
108

11-
By default, we run tests on **`ubuntu-latest`**, as it is **free for public repositories** and the **most cost-effective option for private repositories**. However, this runner has limited resources, which can lead to intermittent failures in resource-intensive tests.
9+
Let's imagine a large repo with 10,000 tests, and let's imagine only 100 (1%) of them are flaky. Let's further imagine that each of those flaky tests has a chance of flaking 1% of the time. If you are a responsible dev that requires all of your tests to pass in CI before you merge, flaky tests have now become a massive headache.
1210

13-
> **Note:** `ubuntu-latest` for **private repositories** has weaker hardware compared to `ubuntu-latest` for **public repositories**. You can learn more about this distinction in [GitHub's documentation](https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories).
11+
$$P(\text{at least one flaky test}) = 1 - (1 - 0.01)^{100}$$
1412

15-
### 1.1 Available GitHub Runners
16-
Below are the some of the GitHub-hosted runners available in our organization:
13+
$$P(\text{at least one flaky test}) \approx 63.40\%$$
1714

18-
| Runner Name | CPU | Memory | Disk |
19-
|------------|-----|--------|------|
20-
| `ubuntu-22.04-4cores-16GB` | 4 cores | 16 GB RAM | 150 GB SSD |
21-
| `ubuntu-latest-4cores-16GB` | 4 cores | 16 GB RAM | 150 GB SSD |
22-
| `ubuntu-22.04-8cores-32GB` | 8 cores | 32 GB RAM | 300 GB SSD |
23-
| `ubuntu-latest-8cores-32GB` | 8 cores | 32 GB RAM | 300 GB SSD |
24-
| `ubuntu-22.04-8cores-32GB-ARM` | 8 cores | 32 GB RAM | 300 GB SSD |
15+
Even a few tests with a tiny chance of a flaking can cause massive damage to a repo that a lot of devs work on.
2516

17+
## General Tips
2618

27-
### 1.2 Tips for Low-Resource Environments
28-
- **Profile your tests** to understand their CPU and memory usage.
29-
- **Optimize**: Only spin up what you need.
30-
- **If resources are insufficient**, consider redesigning your tests to run in smaller, independent chunks.
31-
- **If needed**, you can configure CI workflows to use a higher-tier runner, but this comes at an additional cost.
32-
- **Run with debug logs** or Delve debugger. For more details, check out the [CTF Debug Docs.](https://smartcontractkit.github.io/chainlink-testing-framework/framework/components/debug.html)
19+
Ideally, if you're dealing with a flaky test, you'll already have some examples of it flaking in front of you so you can dig through logs and stack traces and figure it out that way. If that's not the case, or you'd like some more evidence, or you're just stumped, try reproducing the flake. How you reproduce the flake is often the best clue as to why its flaking.
3320

34-
---
21+
For repos that have [flakeguard](https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/tools/flakeguard) configured (like chainlink), you can try running it locally.
22+
23+
```sh
24+
make run_flakeguard_validate_unit_tests
25+
```
26+
27+
You can also try some more precise configurations below.
28+
29+
### 1. Run the Test in Isolation
30+
31+
As we saw above, flaky tests become issues even when their chance of flaking is tiny. You might be hunting down a flake that only happens 0.5% of the time, so you're only real solution is to run the test over and over.
32+
33+
```sh
34+
# Run just that test 1,000 times, stopping after the first failure
35+
go test ./package -run TestName -count 1000 -failfast
36+
```
37+
38+
### 2. Run the Test Package
39+
40+
Tests rarely run in isolation in the real world. If you can't get the flake to happen when isolated, try running the whole package on repeat.
41+
42+
```sh
43+
# Run all tests in the package over and over.
44+
go test ./package -count 1000 -failfast
45+
```
46+
47+
If you get the test to fail here, but not independently, it's likely that it depends on the execution of other tests in the package. Look for global resources your test could be sharing with others, and do your best to isolate all of your unit tests.
48+
49+
### 3. Randomize Test Order
50+
51+
If that's still not doing the job, or you're still scratching your head, try randomizing the test order. Go runs tests in a deterministic order by default, but Go's idea of "deterministic" is pretty liberal.
52+
53+
```sh
54+
# -shuffle randomizes test order
55+
go test ./package -shuffle on -count 1000 -failfast
56+
# You can supply your own int value to shuffle as a seed
57+
go test ./package -shuffle 15 -count 1000 -failfast
58+
```
59+
60+
### 4. Check for Races
61+
62+
If your test is failing in a situation like this, it's possible there's a race condition it's getting caught on. Go's `-race` flag isn't guaranteed to catch all races every time. Just like flakes, you sometimes just need to get lucky (unlucky?).
63+
64+
```sh
65+
# Tests with -race detection take longer to run, and aren't always going to catch issues, especially in large test suites.
66+
go test ./package -race -shuffle on -count 100 -failfast
67+
```
68+
69+
### 5. Emulate Your Target System
70+
71+
Tests will often fail in CI, but not locally. You can try re-running the test in CI, but this might take a long time, cost a lot of money, or generally be annoying. There are a few tricks you can do to emulate CI environments locally.
72+
73+
#### 5.1 Play with -cpu and -parallel
74+
75+
You can artificially constrain or expand parallel execution directly in go. [GOMAXPROCS](https://pkg.go.dev/runtime#hdr-Environment_Variables) is set to the amount of CPUs your system has by default, and controls how many OS threads can run Go code at once. You can manipulate this value, or otherwise play with how many tests can run at once easily. This can help you figure out if resource constraints are hurting your tests.
76+
77+
```sh
78+
# Use -cpu to change GOMAXPROCS. You can supply a list of values to try out different values at once
79+
go test ./package -shuffle 15 -count 1000 -failfast -cpu 1,2,4
80+
# Use -parallel to set the max amount of tests allowed to run in parallel at once
81+
go test ./package -shuffle 15 -count 1000 -failfast -parallel 4
82+
```
83+
84+
#### 5.2 Use Docker
3585

36-
## 2. Reproducing Flakes
37-
Flaky tests don't fail on every run, so you need to execute them multiple times to isolate problems.
86+
Docker can help you emulate your CI environment a little better. You can lookup what type of GitHub Actions runner your CI workflow uses by matching to the lists [here](https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories) and [here](https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/about-larger-runners#specifications-for-general-larger-runners). You can then package your Go tests in a Docker container, and run them with varying resources.
87+
88+
```sh
89+
# Run the default 4-core-16GB ubuntu-latest image used for public GitHub repos
90+
docker run -it --cpus=4 --memory="16g" ubuntu-24.04
91+
```
92+
93+
You can also try using [dockexec](https://github.com/mvdan/dockexec) for convenience, but I've never personally tried it.
94+
95+
#### 5.3 Use act
96+
97+
[act](https://github.com/nektos/act) is a project that lets you emulate your GitHub Actions workflows locally. It's not perfect, and can be tricky to setup for more complex workflows, but it is a nice option for if you suspect issues are further back in the workflow, and don't want to run the full CI process.
98+
99+
### 6. Use Your Target System
100+
101+
Sometimes you can only discover the truth by going directly to the source. Before you do so, please double check what `runs_on` systems your workflows use. If you're only using `ubuntu-latest` runners, these runs should be free. `8-core`, `16-core`, and `32-core` workflows can become very expensive, very quickly. Please use caution and discretion when running these workflows repeatedly.
102+
103+
### 7. Fix It!
104+
105+
Maybe you've found the source of the flake and are now drilling down into the reasons why. Whatever those reasons might be, I urge you to, at least briefly, reframe the problem and ask if the test is actually working as intended, and it is revealing flaky behavior in your application instead. This might be an opportunity to fix a rare bug instead of force a test to conform to it.
106+
107+
### 8. Give Up
108+
109+
It's not my favorite answer, but sometimes this truly is the solution. It's hard to know exactly when this point is. I hope to eventually gather enough data on dev productivity and how flaky tests affect them that I can give you absolute rules, but until then, we'll have to go off vibes. Here's your checklist for when you feel ready to collapse in defeat.
110+
111+
#### 8.1 Evaluate the Importance of the Test
112+
113+
* What does the test actually check? Is it a critical path?
114+
* Is the test flaking because it's a bad test? Or it's trying to test behavior that shouldn't or can't be tested? TODO:
115+
*
116+
117+
#### 8.2 How Flaky is the Test?
118+
119+
Flakeguard should give you a good idea of the test's percentage chance of flaking. Remember from above that even
120+
121+
## Chainlink E2E Tests
122+
123+
At CLL, we have specially designed E2E tests that run in Docker and Kubernetes environments. They're more thorough validations of our systems, and much more complex than typical unit tests.
124+
125+
### 1. Find Flakes
126+
127+
You should already have examples thanks to flakeguard TODO:
128+
129+
### 2. Reproduce Flakes
38130

39-
### 2.1 Repeat Runs
40131
For E2E tests, run them 5–10 times consecutively to expose intermittent issues. To run the tests with flakeguard validation, execute the following command from the `chainlink-core/` directory:
41132

42133
```sh
@@ -53,16 +144,37 @@ You’ll be prompted to provide:
53144
- **Chainlink version** (default: develop)
54145
- **Branch name** (default: develop)
55146

147+
This is generally enough
56148

57-
### 2.2 Flaky Unit Tests in the Core Repository
58-
For unit tests in the core repository, you can use a dedicated command to detect flakiness in an updated test:
149+
### 2. Check Resource Constraints
59150

151+
GitHub provides **hosted runners** with specific CPU, memory, and disk allocations. If your tests require more resources than these runners can provide, you may encounter intermittent failures.
60152

61-
```sh
62-
cd chainlink-core/
63-
make run_flakeguard_validate_unit_tests
64-
```
153+
By default, we run tests on **`ubuntu-latest`**, as it is **free for public repositories** and the **most cost-effective option for private repositories**. However, this runner has limited resources, which can lead to intermittent failures in resource-intensive tests.
154+
155+
> **Note:** `ubuntu-latest` for **private repositories** has weaker hardware compared to `ubuntu-latest` for **public repositories**. You can learn more about this distinction in [GitHub's documentation](https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories).
156+
157+
### 1.1 Available GitHub Runners
158+
Below are the some of the GitHub-hosted runners available in our organization:
159+
160+
| Runner Name | CPU | Memory | Disk |
161+
| ------------------------------ | ------- | --------- | ---------- |
162+
| `ubuntu-22.04-4cores-16GB` | 4 cores | 16 GB RAM | 150 GB SSD |
163+
| `ubuntu-latest-4cores-16GB` | 4 cores | 16 GB RAM | 150 GB SSD |
164+
| `ubuntu-22.04-8cores-32GB` | 8 cores | 32 GB RAM | 300 GB SSD |
165+
| `ubuntu-latest-8cores-32GB` | 8 cores | 32 GB RAM | 300 GB SSD |
166+
| `ubuntu-22.04-8cores-32GB-ARM` | 8 cores | 32 GB RAM | 300 GB SSD |
167+
168+
169+
### 1.2 Tips for Low-Resource Environments
170+
171+
- **Profile your tests** to understand their CPU and memory usage.
172+
- **Optimize**: Only spin up what you need.
173+
- **If resources are insufficient**, consider redesigning your tests to run in smaller, independent chunks.
174+
- **If needed**, you can configure CI workflows to use a higher-tier runner, but this comes at an additional cost.
175+
- **Run with debug logs** or Delve debugger. For more details, check out the [CTF Debug Docs.](https://smartcontractkit.github.io/chainlink-testing-framework/framework/components/debug.html)
65176

177+
---
66178

67179
## 3. Testing Locally Under CPU and Memory Constraints
68180

@@ -75,6 +187,7 @@ If CPU throttling or resource contention is suspected, here's how you can approa
75187

76188

77189
### Setting Global Limits (Docker Desktop)
190+
78191
If you are using **Docker Desktop** on **macOS or Windows**, you can globally limit Docker's resource usage:
79192

80193
1. Open **Docker Desktop**.
@@ -86,6 +199,7 @@ This setting caps the **total** resources Docker can use on your machine, ensuri
86199

87200

88201
### Observing Test Behavior Under Constraints
202+
89203
- **Run your E2E tests repeatedly** with different global resource settings.
90204
- Watch for flakiness: If tests start failing more under tighter limits, suspect CPU throttling or memory starvation.
91205
- **Examine logs/metrics** to pinpoint if insufficient resources are causing sporadic failures.
@@ -94,13 +208,15 @@ By setting global limits, you can simulate resource-constrained environments sim
94208

95209

96210
## 4. Common Pitfalls and “Gotchas”
211+
97212
1. **Resource Starvation**: Heavy tests on minimal hardware lead to timeouts or slow responses.
98213
2. **External Dependencies**: Network latency, rate limits, or third-party service issues can cause sporadic failures.
99214
3. **Shared State**: Race conditions arise if tests share databases or global variables in parallel runs.
100215
4. **Timeouts**: Overly tight time limits can fail tests on slower environments.
101216

102217

103218
## 5. Key Takeaways
219+
104220
Tackle flakiness systematically:
105221
1. **Attempt local reproduction** (e.g., Docker + limited resources).
106222
2. **Run multiple iterations** on GitHub runners.

0 commit comments

Comments
 (0)