smartcontractkit
diff --git a/‎book/src/SUMMARY.md‎
Lines changed: 1 addition & 0 deletions b/‎book/src/SUMMARY.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎book/src/libs/wasp/benchspy/first_test.md‎
Lines changed: 1 addition & 0 deletions b/‎book/src/libs/wasp/benchspy/first_test.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎book/src/libs/wasp/benchspy/loki_dillema.md‎
Lines changed: 8 additions & 4 deletions b/‎book/src/libs/wasp/benchspy/loki_dillema.md‎
Lines changed: 8 additions & 4 deletions
diff --git a/‎book/src/libs/wasp/benchspy/loki_std.md‎
Lines changed: 21 additions & 16 deletions b/‎book/src/libs/wasp/benchspy/loki_std.md‎
Lines changed: 21 additions & 16 deletions
diff --git a/‎book/src/libs/wasp/benchspy/overview.md‎
Lines changed: 6 additions & 2 deletions b/‎book/src/libs/wasp/benchspy/overview.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎book/src/libs/wasp/benchspy/prometheus_std.md‎
Lines changed: 3 additions & 1 deletion b/‎book/src/libs/wasp/benchspy/prometheus_std.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎book/src/libs/wasp/benchspy/real_world.md‎
Lines changed: 93 additions & 0 deletions b/‎book/src/libs/wasp/benchspy/real_world.md‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎book/src/libs/wasp/benchspy/simplest_metrics.md‎
Lines changed: 18 additions & 39 deletions b/‎book/src/libs/wasp/benchspy/simplest_metrics.md‎
Lines changed: 18 additions & 39 deletions
diff --git a/‎wasp/benchspy/direct.go‎
Lines changed: 13 additions & 1 deletion b/‎wasp/benchspy/direct.go‎
Lines changed: 13 additions & 1 deletion
@@ -78,6 +78,7 @@
       - [Standard Prometheus metrics](./libs/wasp/benchspy/prometheus_std.md)
       - [Custom Prometheus metrics](./libs/wasp/benchspy/prometheus_custom.md)
       - [To Loki or not to Loki?](./libs/wasp/benchspy/loki_dillema.md)
+      - [Real world example](./libs/wasp/benchspy/real_world.md)
       - [Reports](./libs/wasp/benchspy/reports/overview.md)
         - [Standard Report](./libs/wasp/benchspy/reports/standard_report.md)
           - [Adding new QueryExecutor](./libs/wasp/benchspy/reports/new_executor.md)
 
@@ -67,6 +67,7 @@ require.NoError(t, storeErr, "failed to store baseline report", path)
 > For now, it's enough to know that the standard metrics provided by `StandardQueryExecutor_Direct` include:
 > - Median latency
 > - P95 latency (95th percentile)
+> - Max latency
 > - Error rate
 
 ### Step 3: Run the Test Again and Compare Reports
 
@@ -4,13 +4,11 @@ You might be wondering whether to use the `Loki` or `Direct` query executor if a
 
 ## Rule of Thumb
 
-If all you need is a single number, such as the median latency or error rate, and you're not interested in:
+You should opt for the `Direct` query executor if all you need is a single number, such as the median latency or error rate, and you're not interested in:
 - Comparing time series directly,
-- Examining minimum or maximum values, or
+- Examining minimum or maximum values over time, or
 - Performing advanced calculations on raw data,
 
-then you should opt for the `Direct` query executor.
-
 ## Why Choose `Direct`?
 
 The `Direct` executor returns a single value for each standard metric using the same raw data that Loki would use. It accesses data stored in the `WASP` generator, which is later pushed to Loki.
@@ -31,5 +29,11 @@ By using `Direct`, you save resources and simplify the process when advanced ana
 > - In the **`Direct` QueryExecutor**, the p95 is calculated across all raw data points, capturing the true variability of the dataset, including any extreme values or spikes.
 > - In the **`Loki` QueryExecutor**, the p95 is calculated over aggregated data (i.e. using the 10-second window). As a result, the raw values within each window are smoothed into a single representative value, potentially lowering or altering the calculated p95. For example, an outlier that would significantly affect the p95 in the `Direct` calculation might be averaged out in the `Loki` window, leading to a slightly lower percentile value.
 
+> #### Direct caveats:
+> - **buffer limitations:** `WASP` generator use a [StringBuffer](https://github.com/smartcontractkit/chainlink-testing-framework/blob/main/wasp/buffer.go) with fixed size to store the responses. Once full capacity is reached
+> oldest entries are replaced with incoming ones. The size of the buffer can be set in generator's config. By default, it is limited to 50k entries to lower resource consumption and potential OOMs.
+>
+> - **sampling:** `WASP` generators support optional sampling of successful responses. It is disabled by deafult, but if you do enable it, then the calculations would no longer be done over a full dataset.
+
 > #### Key Takeaway:
 > The difference arises because `Direct` prioritizes precision by using raw data, while `Loki` prioritizes efficiency and scalability by using aggregated data. When interpreting results, it’s essential to consider how the smoothing effect of `Loki` might impact the representation of variability or extremes in the dataset. This is especially important for metrics like percentiles, where such details can significantly influence the outcome.
@@ -62,7 +62,6 @@ require.NoError(t, storeErr, "failed to store baseline report", path)
 ```
 
 ## Step 3: Skip to Metrics Comparison
-
 Since the next steps are very similar to those in the first test, we’ll skip them and go straight to metrics comparison.
 
 By default, the `LokiQueryExecutor` returns results as the `[]string` data type. Let’s use dedicated convenience functions to cast them from `interface{}` to string slices:
@@ -74,40 +73,46 @@ previousAsStringSlice := benchspy.MustAllLokiResults(previousReport)
 
 ## Step 4: Compare Metrics
 
-Now, let’s compare metrics. Since we have `[]string`, we’ll first convert it to `[]float64`, calculate the median, and ensure the difference between the medians is less than 1%. Again, this is just an example—you should decide the best way to validate your metrics.
+Now, let’s compare metrics. Since we have `[]string`, we’ll first convert it to `[]float64`, calculate the median, and ensure the difference between the averages is less than 1%. Again, this is just an example—you should decide the best way to validate your metrics. Here we are explicitly aggregating them using an average to get a single number representation of each metric, but for your case a median or percentile or yet some other aggregate might be more appropriate.
 
 ```go
-var compareMedian = func(metricName string) {
-    require.NotEmpty(t, currentAsStringSlice[metricName], "%s results were missing from current report", metricName)
-    require.NotEmpty(t, previousAsStringSlice[metricName], "%s results were missing from previous report", metricName)
+var compareAverages = func(t *testing.T, metricName string, currentAsStringSlice, previousAsStringSlice map[string][]string) {
+	require.NotEmpty(t, currentAsStringSlice[metricName], "%s results were missing from current report", metricName)
+	require.NotEmpty(t, previousAsStringSlice[metricName], "%s results were missing from previous report", metricName)
 
-    currentFloatSlice, err := benchspy.StringSliceToFloat64Slice(currentAsStringSlice[metricName])
-    require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
-    currentMedian := benchspy.CalculatePercentile(currentFloatSlice, 0.5)
+	currentFloatSlice, err := benchspy.StringSliceToFloat64Slice(currentAsStringSlice[metricName])
+	require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
+	currentMedian, err := stats.Mean(currentFloatSlice)
+	require.NoError(t, err, "failed to calculate median for %s results", metricName)
 
-    previousFloatSlice, err := benchspy.StringSliceToFloat64Slice(previousAsStringSlice[metricName])
-    require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
-    previousMedian := benchspy.CalculatePercentile(previousFloatSlice, 0.5)
+	previousFloatSlice, err := benchspy.StringSliceToFloat64Slice(previousAsStringSlice[metricName])
+	require.NoError(t, err, "failed to convert %s results to float64 slice", metricName)
+	previousMedian, err := stats.Mean(previousFloatSlice)
+	require.NoError(t, err, "failed to calculate median for %s results", metricName)
 
-    var diffPercentage float64
+	var diffPrecentage float64
 	if previousMedian != 0.0 && currentMedian != 0.0 {
 		diffPrecentage = (currentMedian - previousMedian) / previousMedian * 100
 	} else if previousMedian == 0.0 && currentMedian == 0.0 {
 		diffPrecentage = 0.0
 	} else {
 		diffPrecentage = 100.0
 	}
-    assert.LessOrEqual(t, math.Abs(diffPercentage), 1.0, "%s medians are more than 1% different", metricName, fmt.Sprintf("%.4f", diffPercentage))
+	assert.LessOrEqual(t, math.Abs(diffPrecentage), 1.0, "%s medians are more than 1% different", metricName, fmt.Sprintf("%.4f", diffPrecentage))
 }
 
-compareMedian(string(benchspy.MedianLatency))
-compareMedian(string(benchspy.Percentile95Latency))
-compareMedian(string(benchspy.ErrorRate))
+compareAverages(t, string(benchspy.MedianLatency), currentAsStringSlice, previousAsStringSlice)
+compareAverages(t, string(benchspy.Percentile95Latency), currentAsStringSlice, previousAsStringSlice)
+compareAverages(t, string(benchspy.MaxLatency), currentAsStringSlice, previousAsStringSlice)
+compareAverages(t, string(benchspy.ErrorRate), currentAsStringSlice, previousAsStringSlice)
 ```
 
 > [!WARNING]
 > Standard Loki metrics are all calculated using a 10 seconds moving window, which results in smoothing of values due to aggregation.
 > To learn what that means in details, please refer to [To Loki or Not to Loki](./loki_dillema.md) chapter.
+>
+> Also, due to the HTTP API endpoint used, namely the `query_range`, all query results **are always returned as a slice**. Execution of **instant queries**
+> that return a single data point is currently **not supported**.
 
 ## What’s Next?
 
 
@@ -10,6 +10,10 @@ BenchSpy (short for Benchmark Spy) is a [WASP](../overview.md)-coupled tool desi
 - **Standard/pre-defined metrics** for each data source.
 - **Ease of extensibility** with custom metrics.
 - **Ability to load the latest performance report** based on Git history.
-- **88% unit test coverage**.
 
-BenchSpy does not include any built-in comparison logic beyond ensuring that performance reports are comparable (e.g., they measure the same metrics in the same way), offering complete freedom to the user for interpretation and analysis.
+BenchSpy does not include any built-in comparison logic beyond ensuring that performance reports are comparable (e.g., they measure the same metrics in the same way), offering complete freedom to the user for interpretation and analysis.
+
+## Why could you need it?
+`BenchSpy` was created with two main goals in mind:
+* **measuring application performance programmatically**,
+* **finding performance-related changes or regression issues between different commits or releases**.
@@ -16,7 +16,7 @@ This constructor loads the URL from the environment variable `PROMETHEUS_URL` an
 
 > [!WARNING]
 > This example assumes that you have both the observability stack and basic node set running.
-> If you have the `CTF CLI`, you can start it by running: `ctf b ns`.
+> If you have the [CTF CLI](../../../framework/getting_started.md), you can start it by running: `ctf b ns`.
 
 > [!NOTE]
 > Matching containers **by name** should work both for most k8s and Docker setups using `CTFv2` observability stack.
@@ -53,8 +53,10 @@ require.NoError(t, storeErr, "failed to store baseline report", path)
 > Standard Prometheus metrics include:
 > - `median_cpu_usage`
 > - `median_mem_usage`
+> - `max_cpu_usage`
 > - `p95_cpu_usage`
 > - `p95_mem_usage`
+> - `max_mem_usage`
 >
 > These are calculated at the **container level**, based on total usage (user + system).
 
 
@@ -0,0 +1,93 @@
+# BenchSpy - Real world example
+
+Now that we have seen all possible usages, you might wonder how you should write a test that compares performance between different
+releases of your application.
+
+Usually steps to follow would look like this:
+1. Write the performance test.
+2. At the end of the test fetch the report, store it and commit to git.
+3. Modify the previous point, so that it fetches both latest report and creates a new one.
+4. Write your assertions for metrics.
+
+# Writing the performance test
+We will use a simple mock for the application under test. All that it does is wait for `50 ms` before
+returning a 200 response code.
+
+```go
+generator, err := wasp.NewGenerator(&wasp.Config{
+    T:           t,
+    GenName:     "vu",
+    CallTimeout: 100 * time.Millisecond,
+    LoadType:    wasp.VU,
+    Schedule:    wasp.Plain(10, 15*time.Second),
+    VU: wasp.NewMockVU(&wasp.MockVirtualUserConfig{
+        // notice lower latency
+        CallSleep: 50 * time.Millisecond,
+    }),
+})
+require.NoError(t, err)
+
+generator.Run(true)
+```
+
+# Generating first report
+Here we generate a new performance report for `v1.0.0`. We will use `Direct` query executor and save the report to a custom directory
+called `test_reports`. We will use this report later to compare the performance of new versions.
+
+```go
+fetchCtx, cancelFn := context.WithTimeout(context.Background(), 60*time.Second)
+defer cancelFn()
+
+baseLineReport, err := benchspy.NewStandardReport(
+    "v1.0.0",
+    benchspy.WithStandardQueries(benchspy.StandardQueryExecutor_Direct),
+    benchspy.WithReportDirectory("test_reports"),
+    benchspy.WithGenerators(gen),
+)
+require.NoError(t, err, "failed to create baseline report")
+
+fetchErr := baseLineReport.FetchData(fetchCtx)
+require.NoError(t, fetchErr, "failed to fetch data for original report")
+
+path, storeErr := baseLineReport.Store()
+require.NoError(t, storeErr, "failed to store current report", path)
+```
+
+# Modifying report generation
+Now that we have a baseline report stored for `v1.0.0` lets modify the test, so that we can use it with future releases of our application.
+That means that the code from previous step has to change to:
+```go
+fetchCtx, cancelFn := context.WithTimeout(context.Background(), 60*time.Second)
+defer cancelFn()
+
+currentReport, previousReport, err := benchspy.FetchNewStandardReportAndLoadLatestPrevious(
+    fetchCtx,
+    "v1.1.0",
+    benchspy.WithStandardQueries(benchspy.StandardQueryExecutor_Direct),
+    benchspy.WithReportDirectory("test_reports"),
+    benchspy.WithGenerators(generator),
+)
+require.NoError(t, err, "failed to fetch current report or load the previous one")
+```
+
+As you remember this function will load latest report from `test_reports` directory and fetch a current one, in this case for `v1.1.0`.
+
+# Adding assertions
+Let's assume we don't want any of performance metrics to get more than **1% worse** between releases and use a convenience function
+for `Direct` query executor:
+```go
+hasErrors, errors := benchspy.CompareDirectWithThresholds(
+    1.0, // max 1% worse median latency
+    1.0, // max 1% worse p95 latency
+    1.0, // max 1% worse maximum latency
+    0.0, // no change in error rate
+    currentReport, previousReport)
+require.False(t, hasErrors, fmt.Sprintf("errors found: %v", errors))
+```
+
+Done, you're ready to use `BenchSpy` to make sure that the performance of your application didn't degrade below your chosen thresholds!
+
+> [!NOTE]
+> You can find a test example, where the performance has degraded significantly [here](https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/wasp/examples/benchspy/direct_query_executor/direct_query_real_case.go).
+>
+> This test passes, because we expect the performance to be worse. This is, of course, the opposite what you should do in case of a real application :-)
@@ -6,54 +6,33 @@ For example, if your query returns a time series, you could:
 - Compare each data point in the time series individually.
 - Compare aggregates like averages, medians, or min/max values of the time series.
 
-Each of these approaches has its pros and cons, and `BenchSpy` doesn't make any judgments here. In this example, we'll use a very simplified approach, which **should not be treated** as a gold standard. In our case, the `QueryExecutor` returns a single data point for each metric, eliminating the complexity. However, with `Loki` and `Prometheus`, things can get more complicated.
-
 ## Working with Built-in `QueryExecutors`
-
-Since each built-in `QueryExecutor` returns a different data type, and we use the `interface{}` type to reflect this, convenience functions help cast these results into more usable types:
+Each built-in `QueryExecutor` returns a different data type, and we use the `interface{}` type to reflect this. Since `Direct` executor always returns `float64` we have added a convenience function
+that checks whether any of the standard metrics has **degraded** more than the threshold. If the performance has improved, no error will be returned.
 
 ```go
-currentAsFloat64 := benchspy.MustAllDirectResults(currentReport)
-previousAsFloat64 := benchspy.MustAllDirectResults(previousReport)
+hasErrors, errors := benchspy.CompareDirectWithThresholds(
+    // maximum differences in percentages for:
+    1.0, // median latency
+    1.0, // p95 latency
+    1.0, // max latency
+    1.0, // error rate
+    currentReport,
+    previousReport,
+)
+require.False(t, hasErrors, fmt.Sprintf("errors found: %v", errors))
 ```
 
 > [!NOTE]
-> All standard metrics for the `DirectQueryExecutor` have the `float64` type.
-
-## Defining a Comparison Function
-
-Next, let's define a simple function to compare two floats and ensure the difference between them is smaller than 1%:
-
-```go
-var compareValues = func(
-    metricName string,
-    maxDiffPercentage float64,
-) {
-    require.NotNil(t, currentAsFloat64[metricName], "%s results were missing from current report", metricName)
-    require.NotNil(t, previousAsFloat64[metricName], "%s results were missing from previous report", metricName)
-
-    currentMetric := currentAsFloat64[metricName]
-    previousMetric := previousAsFloat64[metricName]
-
-    var diffPercentage float64
-    if previousMetric != 0.0 && currentMetric != 0.0 {
-        diffPrecentage = (currentMetric - previousMetric) / previousMetric * 100
-    } else if previousMetric == 0.0 && currentMetric == 0.0 {
-        diffPrecentage = 0.0
-    } else {
-        diffPrecentage = 100.0
-    }
-    assert.LessOrEqual(t, math.Abs(diffPercentage), maxDiffPercentage, "%s medians are more than 1% different", metricName, fmt.Sprintf("%.4f", diffPercentage))
-}
-
-compareValues(string(benchspy.MedianLatency), 1.0)
-compareValues(string(benchspy.Percentile95Latency), 1.0)
-compareValues(string(benchspy.ErrorRate), 1.0)
-```
+> Both `Direct` and `Loki` query executors support following standard performance metrics out of the box:
+> - `median_latency`
+> - `p95_latency`
+> - `max_latency`
+> - `error_rate`
 
 ## Wrapping Up
 
-And that's it! You've written your first test that uses `WASP` to generate load and `BenchSpy` to ensure that the median latency, 95th percentile latency, and error rate haven't changed significantly between runs. You accomplished this without even needing a Loki instance. But what if you wanted to leverage the power of `LogQL`? We'll explore that in the [next chapter](./loki_std.md).
+And that's it! You've written your first test that uses `WASP` to generate load and `BenchSpy` to ensure that the median latency, 95th percentile latency, max latency and error rate haven't changed significantly between runs. You accomplished this without even needing a Loki instance. But what if you wanted to leverage the power of `LogQL`? We'll explore that in the [next chapter](./loki_std.md).
 
 > [!NOTE]
 > You can find the full example [here](https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/wasp/examples/benchspy/direct_query_executor/direct_query_executor_test.go).
@@ -156,7 +156,7 @@ func (g *DirectQueryExecutor) TimeRange(_, _ time.Time) {
 func (g *DirectQueryExecutor) generateStandardQueries() (map[string]DirectQueryFn, error) {
 	standardQueries := make(map[string]DirectQueryFn)
 
-	for _, metric := range standardLoadMetrics {
+	for _, metric := range StandardLoadMetrics {
 		query, err := g.standardQuery(metric)
 		if err != nil {
 			return nil, err
@@ -193,6 +193,18 @@ func (g *DirectQueryExecutor) standardQuery(standardMetric StandardLoadMetric) (
 			return stats.Percentile(asMiliDuration, 95)
 		}
 		return p95Fn, nil
+	case MaxLatency:
+		maxFn := func(responses *wasp.SliceBuffer[wasp.Response]) (float64, error) {
+			var asMiliDuration []float64
+			for _, response := range responses.Data {
+				// get duration as nanoseconds and convert to milliseconds in order to not lose precision
+				// otherwise, the duration will be rounded to the nearest millisecond
+				asMiliDuration = append(asMiliDuration, float64(response.Duration.Nanoseconds())/1_000_000)
+			}
+
+			return stats.Max(asMiliDuration)
+		}
+		return maxFn, nil
 	case ErrorRate:
 		errorRateFn := func(responses *wasp.SliceBuffer[wasp.Response]) (float64, error) {
 			if len(responses.Data) == 0 {