bencheeorg · PragTob · Oct 19, 2025 · Jul 6, 2025 · Jul 6, 2025 · Jul 6, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,12 +8,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## Unreleased
 
 ## Features (User Facing)
-* Introduce `max_sample_size` which guides how many samples will be gathered at most for a given scenario. This avoids a variety of issues when scenarios gather too many samples (memory consumption etc). Defaults to `1_000_000`, setting it to `nil` gathers unlimited samples again (behavior before this version).
+* Introduce `max_sample_size` which guides how many samples will be gathered at most for a given scenario.
+This avoids a variety of issues when scenarios gather too many samples (memory consumption, statistics taking long to calculate, formatters hanging/not working).
+Defaults to `1_000_000`, setting it to `nil` gathers unlimited samples again (behavior before this version).
+* Introduce `exclude_outliers` option which when set to `true` will automatically exclude outliers from the samples gathered.
+Especially important for run time, you can remove samples caused by garbage collection or external factors.
+Defaults to `false`.
+Shout out to [@NickNeck](https://github.com/NickNeck) who implemented this long wished for feature over in `Statistex`.
 
 ### Bugfixes (User Facing)
 * fixed a bug where if times were supplied as `0` instead of `0.0` we'd sometimes gather a single measurement
 * elixir `1.19` compilation warnings have been removed
 
+### Features (Plugins)
+* The `%Benchee.Statistics{}` struct now comes with values to accompany the outlier exclusion feature:
+  * outliers - if outlier exclusion was enabled, may include any samples of outliers that were found, empty list otherwise
+  * lower_outlier_bound - value below which values are considered an outlier
+  * upper_outlier_bound - value above which values are considered an outlier
+
 ## 1.4.0 (2025-04-14)
 
 Some nice features (`pre_check: :all_same` is cool) along with adding support for some new stuff (`tprof`) and fixing some bugs.

diff --git a/README.md b/README.md
@@ -26,9 +26,9 @@ Produces the following output on the console:
 Operating System: Linux
 CPU Information: AMD Ryzen 9 5900X 12-Core Processor
 Number of Available Cores: 24
-Available memory: 31.25 GB
-Elixir 1.16.0-rc.1
-Erlang 26.1.2
+Available memory: 31.26 GB
+Elixir 1.19.0
+Erlang 28.1
 JIT enabled: true
 
 Benchmark suite executing with the following configuration:
@@ -39,25 +39,26 @@ reduction time: 0 ns
 parallel: 1
 inputs: none specified
 Estimated total run time: 28 s
+Excluding outliers: false
 
 Benchmarking flat_map ...
 Benchmarking map.flatten ...
 Calculating statistics...
 Formatting results...
 
 Name                  ips        average  deviation         median         99th %
-flat_map           3.79 K      263.87 μs    ±15.49%      259.47 μs      329.29 μs
-map.flatten        1.96 K      509.19 μs    ±51.36%      395.23 μs     1262.27 μs
+flat_map           3.96 K      252.74 μs    ±15.64%      247.61 μs      321.85 μs
+map.flatten        1.84 K      543.57 μs    ±44.18%      414.16 μs     1223.92 μs
 
 Comparison:
-flat_map           3.79 K
-map.flatten        1.96 K - 1.93x slower +245.32 μs
+flat_map           3.96 K
+map.flatten        1.84 K - 2.15x slower +290.83 μs
 
 Memory usage statistics:
 
 Name           Memory usage
-flat_map             625 KB
-map.flatten       781.25 KB - 1.25x memory usage +156.25 KB
+flat_map          624.97 KB
+map.flatten       781.25 KB - 1.25x memory usage +156.28 KB
 
 **All measurements for memory usage were the same**
 ```
@@ -83,6 +84,7 @@ The aforementioned [plugins](#plugins) like [benchee_html](https://github.com/be
   - [Formatters](#formatters)
     - [Console Formatter options](#console-formatter-options)
   - [Profiling after a run](#profiling-after-a-run)
+  - [Remove Outliers](#remove-outliers)
   - [Saving, loading and comparing previous runs](#saving-loading-and-comparing-previous-runs)
   - [Hooks (Setup, Teardown etc.)](#hooks-setup-teardown-etc)
     - [Suite hooks](#suite-hooks)
@@ -115,6 +117,7 @@ The aforementioned [plugins](#plugins) like [benchee_html](https://github.com/be
 * as precise as it can get, measure with up to nanosecond precision (Operating System dependent)
 * nicely formatted console output with units scaled to appropriately (nanoseconds to minutes)
 * (optionally) measures the overhead of function calls so that the measured/reported times really are the execution time of _your_code_ without that overhead.
+* (optionally) [removes outliers](#remove-outliers)
 * [hooks](#hooks-setup-teardown-etc) to execute something before/after a benchmarking invocation, without it impacting the measured time
 * execute benchmark jobs in parallel to gather more results in the same time, or simulate a system under load
 * well documented & well tested
@@ -136,6 +139,8 @@ In addition, you can optionally output an extended set of statistics:
 * **sample size** - the number of measurements taken
 * **mode**        - the measured values that occur the most. Often one value, but can be multiple values if they occur exactly as often. If no value occurs at least twice, this value will be `nil`.
 
+Benchee can also [remove outliers](#remove-outliers).
+
 ## Installation
 
 Add `:benchee` to your list of dependencies in `mix.exs`:
@@ -263,6 +268,11 @@ The available options are the following (also documented in [hexdocs](https://he
   This is used to limit memory consumption and unnecessary processing - 1 Million samples is plenty.
   This limit also applies to number of iterations done during warmup.
   You can set your own number or set it to `nil` if you don't want any limit.
+* `exclude_outliers` - whether or not statistical outliers should be removed for the calculated statistics.
+Defaults to `false`.
+This means that values that are far outside the usual range (as determined by the percentiles/quantiles) will
+be removed from the gathered samples and the calculated statistics. You might want to enable this if you
+don't want things like the garbage collection triggering to influence your results as much.
 
 ### Metrics to measure
 
@@ -303,6 +313,7 @@ So, what happens if a function executes too fast for Benchee to measure? If Benc
 * essentially every single measurement is now an average across 10 runs making lots of statistics less meaningful
 
 Benchee will print a big warning when this happens.
+
 #### Measuring Memory Consumption
 
 Starting with version 0.13, users can now get measurements of how much memory their benchmarked scenarios use. The measurement is **limited to the process that Benchee executes your provided code in** - i.e. other processes (like worker pools)/the whole BEAM isn't taken into account.
@@ -542,6 +553,27 @@ Enum."-map/2-lists^map/1-0-"/2                  10001 26.38 2282    0.23
 
 **Note about after_each hooks:** `after_each` hooks currently don't work when profiling a function, as they are not passed the return value of the function after the profiling run. It's already fixed on the elixir side and is waiting for release, likely in 1.14. It should then just work.
 
+### Remove Outliers
+
+Benchee can remove outliers from the gathered samples while calculating statistics.
+That is, as determined by percentiles/quantiles (we follow [this approach](https://en.wikipedia.org/wiki/Interquartile_range#Outliers)).
+
+You might consider excluding outliers for extreme micro/nano-benchmarks where individual results can be skewed a lot by the Garbage Collection.
+
+You can simply pass `exclude_outliers: true` to Benchee to trigger the removal of outliers.
+
+```elixir
+Benchee.run(jobs, exclude_outliers: true)
+```
+
+The outliers themselves (aka the samples that have been determined to be outliers)
+as well as the lower/upper bound after which samples are considered outliers are accessible
+in the `Benchee.Statistics` struct.
+
+The samples themselves still include the outliers, they are only removed for calculating statistics.
+
+Right now Benchee doesn't print the outliers yet, but you can inspect the resulting data structures if you're interested (or send a PR :) )
+
 ### Saving, loading and comparing previous runs
 
 Benchee can store the results of previous runs in a file and then load them again to compare them. For example this is useful to compare what was recorded on the main branch against a branch with performance improvements. You may also use this to benchmark across different exlixir/erlang versions.

diff --git a/lib/benchee/benchmark/runner.ex b/lib/benchee/benchmark/runner.ex
@@ -6,8 +6,10 @@ defmodule Benchee.Benchmark.Runner do
   # This module actually runs our benchmark scenarios, adding information about
   # run time and memory usage to each scenario.
 
+  alias Benchee.Benchmark
   alias Benchee.Benchmark.BenchmarkConfig
-  alias Benchee.{Benchmark, Scenario, Utility.Parallel}
+  alias Benchee.Scenario
+  alias Benchee.Utility.Parallel
 
   alias Benchmark.{
     Collect,

diff --git a/lib/benchee/configuration.ex b/lib/benchee/configuration.ex
@@ -48,7 +48,8 @@ defmodule Benchee.Configuration do
             # It also generates less than 1GB in data (some of which is garbage collected/
             # not necessarily all in RAM at the same time) - which seems reasonable enough.
             # see `samples/statistics_performance.exs` and also maybe run it yourself.
-            max_sample_size: 1_000_000
+            max_sample_size: 1_000_000,
+            exclude_outliers: false
 
   @typedoc """
   The configuration supplied by the user as either a map or a keyword list
@@ -152,6 +153,11 @@ defmodule Benchee.Configuration do
     This is used to limit memory consumption and unnecessary processing - 1 Million samples is plenty.
     This limit also applies to number of iterations done during warmup.
     You can set your own number or set it to `nil` if you don't want any limit.
+    * `exclude_outliers` - whether or not statistical outliers should be removed for the calculated statistics.
+    Defaults to `false`.
+    This means that values that are far outside the usual range (as determined by the percentiles/quantiles) will
+    be removed from the gathered samples and the calculated statistics. You might want to enable this if you
+    don't want things like the garbage collection triggering to influence your results as much.
   """
   @type user_configuration :: map | keyword
 
@@ -183,7 +189,8 @@ defmodule Benchee.Configuration do
           measure_function_call_overhead: boolean,
           title: String.t() | nil,
           profile_after: boolean | atom | {atom, keyword},
-          max_sample_size: pos_integer()
+          max_sample_size: pos_integer(),
+          exclude_outliers: boolean()
         }
 
   @time_keys [:time, :warmup, :memory_time, :reduction_time]

diff --git a/lib/benchee/formatters/console.ex b/lib/benchee/formatters/console.ex
@@ -33,8 +33,8 @@ defmodule Benchee.Formatters.Console do
 
   @behaviour Benchee.Formatter
 
-  alias Benchee.Suite
   alias Benchee.Formatters.Console.{Memory, Reductions, RunTime}
+  alias Benchee.Suite
 
   @doc """
   Formats the benchmark statistics to a report suitable for output on the CLI.

diff --git a/lib/benchee/output/benchmark_printer.ex b/lib/benchee/output/benchmark_printer.ex
@@ -69,7 +69,8 @@ defmodule Benchee.Output.BenchmarkPrinter do
          warmup: warmup,
          inputs: inputs,
          memory_time: memory_time,
-         reduction_time: reduction_time
+         reduction_time: reduction_time,
+         exclude_outliers: exclude_outliers
        }) do
     scenario_count = length(scenarios)
     exec_time = warmup + time + memory_time + reduction_time
@@ -84,6 +85,7 @@ defmodule Benchee.Output.BenchmarkPrinter do
     parallel: #{parallel}
     inputs: #{inputs_out(inputs)}
     Estimated total run time: #{Duration.format_human(total_time)}
+    Excluding outliers: #{exclude_outliers}
     """)
   end