Skip to content

Commit 9cd0fd7

Browse files
committed
cr changes, more tests, improved docs, real world examaple
1 parent 22a0417 commit 9cd0fd7

25 files changed

+2990
-147
lines changed

book/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@
7878
- [Standard Prometheus metrics](./libs/wasp/benchspy/prometheus_std.md)
7979
- [Custom Prometheus metrics](./libs/wasp/benchspy/prometheus_custom.md)
8080
- [To Loki or not to Loki?](./libs/wasp/benchspy/loki_dillema.md)
81+
- [Real world example](./libs/wasp/benchspy/real_world.md)
8182
- [Reports](./libs/wasp/benchspy/reports/overview.md)
8283
- [Standard Report](./libs/wasp/benchspy/reports/standard_report.md)
8384
- [Adding new QueryExecutor](./libs/wasp/benchspy/reports/new_executor.md)

book/src/libs/wasp/benchspy/first_test.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ require.NoError(t, storeErr, "failed to store baseline report", path)
6767
> For now, it's enough to know that the standard metrics provided by `StandardQueryExecutor_Direct` include:
6868
> - Median latency
6969
> - P95 latency (95th percentile)
70+
> - Max latency
7071
> - Error rate
7172
7273
### Step 3: Run the Test Again and Compare Reports

book/src/libs/wasp/benchspy/loki_dillema.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,11 @@ You might be wondering whether to use the `Loki` or `Direct` query executor if a
44

55
## Rule of Thumb
66

7-
If all you need is a single number, such as the median latency or error rate, and you're not interested in:
7+
You should opt for the `Direct` query executor if all you need is a single number, such as the median latency or error rate, and you're not interested in:
88
- Comparing time series directly,
9-
- Examining minimum or maximum values, or
9+
- Examining minimum or maximum values over time, or
1010
- Performing advanced calculations on raw data,
1111

12-
then you should opt for the `Direct` query executor.
13-
1412
## Why Choose `Direct`?
1513

1614
The `Direct` executor returns a single value for each standard metric using the same raw data that Loki would use. It accesses data stored in the `WASP` generator, which is later pushed to Loki.
@@ -31,5 +29,11 @@ By using `Direct`, you save resources and simplify the process when advanced ana
3129
> - In the **`Direct` QueryExecutor**, the p95 is calculated across all raw data points, capturing the true variability of the dataset, including any extreme values or spikes.
3230
> - In the **`Loki` QueryExecutor**, the p95 is calculated over aggregated data (i.e. using the 10-second window). As a result, the raw values within each window are smoothed into a single representative value, potentially lowering or altering the calculated p95. For example, an outlier that would significantly affect the p95 in the `Direct` calculation might be averaged out in the `Loki` window, leading to a slightly lower percentile value.
3331
32+
> #### Direct caveats:
33+
> - **buffer limitations:** `WASP` generator use a [StringBuffer](https://github.com/smartcontractkit/chainlink-testing-framework/blob/main/wasp/buffer.go) with fixed size to store the responses. Once full capacity is reached
34+
> oldest entries are replaced with incoming ones. The size of the buffer can be set in generator's config. By default, it is limited to 50k entries to lower resource consumption and potential OOMs.
35+
>
36+
> - **sampling:** `WASP` generators support optional sampling of successful responses. It is disabled by deafult, but if you do enable it, then the calculations would no longer be done over a full dataset.
37+
3438
> #### Key Takeaway:
3539
> The difference arises because `Direct` prioritizes precision by using raw data, while `Loki` prioritizes efficiency and scalability by using aggregated data. When interpreting results, it’s essential to consider how the smoothing effect of `Loki` might impact the representation of variability or extremes in the dataset. This is especially important for metrics like percentiles, where such details can significantly influence the outcome.

book/src/libs/wasp/benchspy/loki_std.md

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ require.NoError(t, storeErr, "failed to store baseline report", path)
6262
```
6363

6464
## Step 3: Skip to Metrics Comparison
65-
6665
Since the next steps are very similar to those in the first test, we’ll skip them and go straight to metrics comparison.
6766

6867
By default, the `LokiQueryExecutor` returns results as the `[]string` data type. Let’s use dedicated convenience functions to cast them from `interface{}` to string slices:
@@ -74,40 +73,46 @@ previousAsStringSlice := benchspy.MustAllLokiResults(previousReport)
7473

7574
## Step 4: Compare Metrics
7675

77-
Now, let’s compare metrics. Since we have `[]string`, we’ll first convert it to `[]float64`, calculate the median, and ensure the difference between the medians is less than 1%. Again, this is just an example—you should decide the best way to validate your metrics.
76+
Now, let’s compare metrics. Since we have `[]string`, we’ll first convert it to `[]float64`, calculate the median, and ensure the difference between the averages is less than 1%. Again, this is just an example—you should decide the best way to validate your metrics. Here we are explicitly aggregating them using an average to get a single number representation of each metric, but for your case a median or percentile or yet some other aggregate might be more appropriate.
7877

7978
```go
80-
var compareMedian = func(metricName string) {
81-
require.NotEmpty(t, currentAsStringSlice[metricName], "%s results were missing from current report", metricName)
82-
require.NotEmpty(t, previousAsStringSlice[metricName], "%s results were missing from previous report", metricName)
79+
var compareAverages = func(t *testing.T, metricName string, currentAsStringSlice, previousAsStringSlice map[string][]string) {
80+
require.NotEmpty(t, currentAsStringSlice[metricName], "%s results were missing from current report", metricName)
81+
require.NotEmpty(t, previousAsStringSlice[metricName], "%s results were missing from previous report", metricName)
8382

84-
currentFloatSlice, err := benchspy.StringSliceToFloat64Slice(currentAsStringSlice[metricName])
85-
require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
86-
currentMedian := benchspy.CalculatePercentile(currentFloatSlice, 0.5)
83+
currentFloatSlice, err := benchspy.StringSliceToFloat64Slice(currentAsStringSlice[metricName])
84+
require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
85+
currentMedian, err := stats.Mean(currentFloatSlice)
86+
require.NoError(t, err, "failed to calculate median for %s results", metricName)
8787

88-
previousFloatSlice, err := benchspy.StringSliceToFloat64Slice(previousAsStringSlice[metricName])
89-
require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
90-
previousMedian := benchspy.CalculatePercentile(previousFloatSlice, 0.5)
88+
previousFloatSlice, err := benchspy.StringSliceToFloat64Slice(previousAsStringSlice[metricName])
89+
require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
90+
previousMedian, err := stats.Mean(previousFloatSlice)
91+
require.NoError(t, err, "failed to calculate median for %s results", metricName)
9192

92-
var diffPercentage float64
93+
var diffPrecentage float64
9394
if previousMedian != 0.0 && currentMedian != 0.0 {
9495
diffPrecentage = (currentMedian - previousMedian) / previousMedian * 100
9596
} else if previousMedian == 0.0 && currentMedian == 0.0 {
9697
diffPrecentage = 0.0
9798
} else {
9899
diffPrecentage = 100.0
99100
}
100-
assert.LessOrEqual(t, math.Abs(diffPercentage), 1.0, "%s medians are more than 1% different", metricName, fmt.Sprintf("%.4f", diffPercentage))
101+
assert.LessOrEqual(t, math.Abs(diffPrecentage), 1.0, "%s medians are more than 1% different", metricName, fmt.Sprintf("%.4f", diffPrecentage))
101102
}
102103

103-
compareMedian(string(benchspy.MedianLatency))
104-
compareMedian(string(benchspy.Percentile95Latency))
105-
compareMedian(string(benchspy.ErrorRate))
104+
compareAverages(t, string(benchspy.MedianLatency), currentAsStringSlice, previousAsStringSlice)
105+
compareAverages(t, string(benchspy.Percentile95Latency), currentAsStringSlice, previousAsStringSlice)
106+
compareAverages(t, string(benchspy.MaxLatency), currentAsStringSlice, previousAsStringSlice)
107+
compareAverages(t, string(benchspy.ErrorRate), currentAsStringSlice, previousAsStringSlice)
106108
```
107109

108110
> [!WARNING]
109111
> Standard Loki metrics are all calculated using a 10 seconds moving window, which results in smoothing of values due to aggregation.
110112
> To learn what that means in details, please refer to [To Loki or Not to Loki](./loki_dillema.md) chapter.
113+
>
114+
> Also, due to the HTTP API endpoint used, namely the `query_range`, all query results **are always returned as a slice**. Execution of **instant queries**
115+
> that return a single data point is currently **not supported**.
111116
112117
## What’s Next?
113118

book/src/libs/wasp/benchspy/overview.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ BenchSpy (short for Benchmark Spy) is a [WASP](../overview.md)-coupled tool desi
1010
- **Standard/pre-defined metrics** for each data source.
1111
- **Ease of extensibility** with custom metrics.
1212
- **Ability to load the latest performance report** based on Git history.
13-
- **88% unit test coverage**.
1413

15-
BenchSpy does not include any built-in comparison logic beyond ensuring that performance reports are comparable (e.g., they measure the same metrics in the same way), offering complete freedom to the user for interpretation and analysis.
14+
BenchSpy does not include any built-in comparison logic beyond ensuring that performance reports are comparable (e.g., they measure the same metrics in the same way), offering complete freedom to the user for interpretation and analysis.
15+
16+
## Why could you need it?
17+
`BenchSpy` was created with two main goals in mind:
18+
* **measuring application performance programmatically**,
19+
* **finding performance-related changes or regression issues between different commits or releases**.

book/src/libs/wasp/benchspy/prometheus_std.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ This constructor loads the URL from the environment variable `PROMETHEUS_URL` an
1616

1717
> [!WARNING]
1818
> This example assumes that you have both the observability stack and basic node set running.
19-
> If you have the `CTF CLI`, you can start it by running: `ctf b ns`.
19+
> If you have the [CTF CLI](../../../framework/getting_started.md), you can start it by running: `ctf b ns`.
2020
2121
> [!NOTE]
2222
> Matching containers **by name** should work both for most k8s and Docker setups using `CTFv2` observability stack.
@@ -53,8 +53,10 @@ require.NoError(t, storeErr, "failed to store baseline report", path)
5353
> Standard Prometheus metrics include:
5454
> - `median_cpu_usage`
5555
> - `median_mem_usage`
56+
> - `max_cpu_usage`
5657
> - `p95_cpu_usage`
5758
> - `p95_mem_usage`
59+
> - `max_mem_usage`
5860
>
5961
> These are calculated at the **container level**, based on total usage (user + system).
6062
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# BenchSpy - Real world example
2+
3+
Now that we have seen all possible usages, you might wonder how you should write a test that compares performance between different
4+
releases of your application.
5+
6+
Usually steps to follow would look like this:
7+
1. Write the performance test.
8+
2. At the end of the test fetch the report, store it and commit to git.
9+
3. Modify the previous point, so that it fetches both latest report and creates a new one.
10+
4. Write your assertions for metrics.
11+
12+
# Writing the performance test
13+
We will use a simple mock for the application under test. All that it does is wait for `50 ms` before
14+
returning a 200 response code.
15+
16+
```go
17+
generator, err := wasp.NewGenerator(&wasp.Config{
18+
T: t,
19+
GenName: "vu",
20+
CallTimeout: 100 * time.Millisecond,
21+
LoadType: wasp.VU,
22+
Schedule: wasp.Plain(10, 15*time.Second),
23+
VU: wasp.NewMockVU(&wasp.MockVirtualUserConfig{
24+
// notice lower latency
25+
CallSleep: 50 * time.Millisecond,
26+
}),
27+
})
28+
require.NoError(t, err)
29+
30+
generator.Run(true)
31+
```
32+
33+
# Generating first report
34+
Here we generate a new performance report for `v1.0.0`. We will use `Direct` query executor and save the report to a custom directory
35+
called `test_reports`. We will use this report later to compare the performance of new versions.
36+
37+
```go
38+
fetchCtx, cancelFn := context.WithTimeout(context.Background(), 60*time.Second)
39+
defer cancelFn()
40+
41+
baseLineReport, err := benchspy.NewStandardReport(
42+
"v1.0.0",
43+
benchspy.WithStandardQueries(benchspy.StandardQueryExecutor_Direct),
44+
benchspy.WithReportDirectory("test_reports"),
45+
benchspy.WithGenerators(gen),
46+
)
47+
require.NoError(t, err, "failed to create baseline report")
48+
49+
fetchErr := baseLineReport.FetchData(fetchCtx)
50+
require.NoError(t, fetchErr, "failed to fetch data for original report")
51+
52+
path, storeErr := baseLineReport.Store()
53+
require.NoError(t, storeErr, "failed to store current report", path)
54+
```
55+
56+
# Modifying report generation
57+
Now that we have a baseline report stored for `v1.0.0` lets modify the test, so that we can use it with future releases of our application.
58+
That means that the code from previous step has to change to:
59+
```go
60+
fetchCtx, cancelFn := context.WithTimeout(context.Background(), 60*time.Second)
61+
defer cancelFn()
62+
63+
currentReport, previousReport, err := benchspy.FetchNewStandardReportAndLoadLatestPrevious(
64+
fetchCtx,
65+
"v1.1.0",
66+
benchspy.WithStandardQueries(benchspy.StandardQueryExecutor_Direct),
67+
benchspy.WithReportDirectory("test_reports"),
68+
benchspy.WithGenerators(generator),
69+
)
70+
require.NoError(t, err, "failed to fetch current report or load the previous one")
71+
```
72+
73+
As you remember this function will load latest report from `test_reports` directory and fetch a current one, in this case for `v1.1.0`.
74+
75+
# Adding assertions
76+
Let's assume we don't want any of performance metrics to get more than **1% worse** between releases and use a convenience function
77+
for `Direct` query executor:
78+
```go
79+
hasErrors, errors := benchspy.CompareDirectWithThresholds(
80+
1.0, // max 1% worse median latency
81+
1.0, // max 1% worse p95 latency
82+
1.0, // max 1% worse maximum latency
83+
0.0, // no change in error rate
84+
currentReport, previousReport)
85+
require.False(t, hasErrors, fmt.Sprintf("errors found: %v", errors))
86+
```
87+
88+
Done, you're ready to use `BenchSpy` to make sure that the performance of your application didn't degrade below your chosen thresholds!
89+
90+
> [!NOTE]
91+
> You can find a test example, where the performance has degraded significantly [here](https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/wasp/examples/benchspy/direct_query_executor/direct_query_real_case.go).
92+
>
93+
> This test passes, because we expect the performance to be worse. This is, of course, the opposite what you should do in case of a real application :-)

book/src/libs/wasp/benchspy/simplest_metrics.md

Lines changed: 18 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -6,54 +6,33 @@ For example, if your query returns a time series, you could:
66
- Compare each data point in the time series individually.
77
- Compare aggregates like averages, medians, or min/max values of the time series.
88

9-
Each of these approaches has its pros and cons, and `BenchSpy` doesn't make any judgments here. In this example, we'll use a very simplified approach, which **should not be treated** as a gold standard. In our case, the `QueryExecutor` returns a single data point for each metric, eliminating the complexity. However, with `Loki` and `Prometheus`, things can get more complicated.
10-
119
## Working with Built-in `QueryExecutors`
12-
13-
Since each built-in `QueryExecutor` returns a different data type, and we use the `interface{}` type to reflect this, convenience functions help cast these results into more usable types:
10+
Each built-in `QueryExecutor` returns a different data type, and we use the `interface{}` type to reflect this. Since `Direct` executor always returns `float64` we have added a convenience function
11+
that checks whether any of the standard metrics has **degraded** more than the threshold. If the performance has improved, no error will be returned.
1412

1513
```go
16-
currentAsFloat64 := benchspy.MustAllDirectResults(currentReport)
17-
previousAsFloat64 := benchspy.MustAllDirectResults(previousReport)
14+
hasErrors, errors := benchspy.CompareDirectWithThresholds(
15+
// maximum differences in percentages for:
16+
1.0, // median latency
17+
1.0, // p95 latency
18+
1.0, // max latency
19+
1.0, // error rate
20+
currentReport,
21+
previousReport,
22+
)
23+
require.False(t, hasErrors, fmt.Sprintf("errors found: %v", errors))
1824
```
1925

2026
> [!NOTE]
21-
> All standard metrics for the `DirectQueryExecutor` have the `float64` type.
22-
23-
## Defining a Comparison Function
24-
25-
Next, let's define a simple function to compare two floats and ensure the difference between them is smaller than 1%:
26-
27-
```go
28-
var compareValues = func(
29-
metricName string,
30-
maxDiffPercentage float64,
31-
) {
32-
require.NotNil(t, currentAsFloat64[metricName], "%s results were missing from current report", metricName)
33-
require.NotNil(t, previousAsFloat64[metricName], "%s results were missing from previous report", metricName)
34-
35-
currentMetric := currentAsFloat64[metricName]
36-
previousMetric := previousAsFloat64[metricName]
37-
38-
var diffPercentage float64
39-
if previousMetric != 0.0 && currentMetric != 0.0 {
40-
diffPrecentage = (currentMetric - previousMetric) / previousMetric * 100
41-
} else if previousMetric == 0.0 && currentMetric == 0.0 {
42-
diffPrecentage = 0.0
43-
} else {
44-
diffPrecentage = 100.0
45-
}
46-
assert.LessOrEqual(t, math.Abs(diffPercentage), maxDiffPercentage, "%s medians are more than 1% different", metricName, fmt.Sprintf("%.4f", diffPercentage))
47-
}
48-
49-
compareValues(string(benchspy.MedianLatency), 1.0)
50-
compareValues(string(benchspy.Percentile95Latency), 1.0)
51-
compareValues(string(benchspy.ErrorRate), 1.0)
52-
```
27+
> Both `Direct` and `Loki` query executors support following standard performance metrics out of the box:
28+
> - `median_latency`
29+
> - `p95_latency`
30+
> - `max_latency`
31+
> - `error_rate`
5332
5433
## Wrapping Up
5534

56-
And that's it! You've written your first test that uses `WASP` to generate load and `BenchSpy` to ensure that the median latency, 95th percentile latency, and error rate haven't changed significantly between runs. You accomplished this without even needing a Loki instance. But what if you wanted to leverage the power of `LogQL`? We'll explore that in the [next chapter](./loki_std.md).
35+
And that's it! You've written your first test that uses `WASP` to generate load and `BenchSpy` to ensure that the median latency, 95th percentile latency, max latency and error rate haven't changed significantly between runs. You accomplished this without even needing a Loki instance. But what if you wanted to leverage the power of `LogQL`? We'll explore that in the [next chapter](./loki_std.md).
5736

5837
> [!NOTE]
5938
> You can find the full example [here](https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/wasp/examples/benchspy/direct_query_executor/direct_query_executor_test.go).

wasp/benchspy/direct.go

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ func (g *DirectQueryExecutor) TimeRange(_, _ time.Time) {
156156
func (g *DirectQueryExecutor) generateStandardQueries() (map[string]DirectQueryFn, error) {
157157
standardQueries := make(map[string]DirectQueryFn)
158158

159-
for _, metric := range standardLoadMetrics {
159+
for _, metric := range StandardLoadMetrics {
160160
query, err := g.standardQuery(metric)
161161
if err != nil {
162162
return nil, err
@@ -193,6 +193,18 @@ func (g *DirectQueryExecutor) standardQuery(standardMetric StandardLoadMetric) (
193193
return stats.Percentile(asMiliDuration, 95)
194194
}
195195
return p95Fn, nil
196+
case MaxLatency:
197+
maxFn := func(responses *wasp.SliceBuffer[wasp.Response]) (float64, error) {
198+
var asMiliDuration []float64
199+
for _, response := range responses.Data {
200+
// get duration as nanoseconds and convert to milliseconds in order to not lose precision
201+
// otherwise, the duration will be rounded to the nearest millisecond
202+
asMiliDuration = append(asMiliDuration, float64(response.Duration.Nanoseconds())/1_000_000)
203+
}
204+
205+
return stats.Max(asMiliDuration)
206+
}
207+
return maxFn, nil
196208
case ErrorRate:
197209
errorRateFn := func(responses *wasp.SliceBuffer[wasp.Response]) (float64, error) {
198210
if len(responses.Data) == 0 {

0 commit comments

Comments
 (0)