Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions experiments/regression/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Regression Experiments

This folder defines all the regression detector experiments.

The regression detector is a tool provided by Single Machine Performance thru the `smp` CLI.

It allows us to performance test lading under different scenarios/loads.

IMPORTANT: Any local benchmarks should be supplemented with SMP regression detector experiments that exercise that same local benchmark test.
21 changes: 21 additions & 0 deletions experiments/regression/cases/http_ascii_1000mib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# HTTP ASCII 1000 MiB/s

Resource usage of lading under a steady 1000 MiB/s HTTP load with ASCII payloads.

## What

Runs lading as a target with an HTTP generator sending ASCII payloads at 1000 MiB/s to its own HTTP blackhole. This exercises the HTTP generator, ASCII payload construction, and HTTP blackhole within a single lading instance at maximum throughput.

## Why

Establishes a baseline for lading's resource consumption under an extreme HTTP workload. Regressions here indicate overhead in the HTTP generator or payload path at high throughput.

## Paired Benchmark

This experiment is paired with the local benchmark in `lading_payload/benches/ascii.rs`.
If throughput sizes change in either place, update the other to match.

## Enforcements

Memory usage is enforced by bounding `total_pss_bytes`.
CPU usage is enforced by bounding `avg(total_cpu_usage_millicores)`.
27 changes: 27 additions & 0 deletions experiments/regression/cases/http_ascii_1000mib/experiment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optimization_goal: memory
erratic: false

target:
name: lading-target
# Setting experiment duration infinite here implies that lading will not go thru the shutdown process
# This is acceptable as we are less concerned about the shutdown performance
command: "/usr/bin/lading --no-target --experiment-duration-infinite --config-path /etc/lading-target/lading.yaml"
cpu_allotment: 2
memory_allotment: 1250 MiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this doesn't OOM given the 5GiB prebuild cache - am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the issue that codex spotted above.

lading is limiting the cache size to 1G

imo, this feels like a bug in lading that I'd like to address


environment:
DD_SERVICE: lading-target
RUST_LOG: info
RUST_BACKTRACE: 1

checks:
- name: memory_usage
description: "Memory usage quality gate. Bounds total memory usage."
bounds:
series: total_pss_bytes
upper_bound: "1130 MiB"
- name: cpu_usage
description: "CPU usage quality gate. Bounds total average millicore usage."
bounds:
series: avg(total_cpu_usage_millicores)
upper_bound: 400
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
generator:
- http:
seed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
headers: {}
target_uri: "http://127.0.0.1:8080/"
bytes_per_second: "1000 MiB"
parallel_connections: 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard this case against silently under-driving the 1 GiB/s load

With the default maximum_block_size of 1 MiB (lading_payload::block::default_maximum_block_size), this config has to complete about 1000 HTTP request/response cycles per second to really deliver 1000 MiB/s. Http::spin only allows one in-flight request per configured connection, so parallel_connections: 1 makes the highest-throughput case depend on a single loopback connection keeping up. When that does not hold on a given runner, lading simply falls behind the target rate and experiment.yaml still passes because it only bounds CPU/PSS, not achieved bytes, so the "1000 MiB/s" regression gate ends up measuring an unknown lower load.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, this feels like a bug that isn't being caught at runtime.

I'd prefer for lading to fail in some way rather than silently "fail".

method:
post:
maximum_prebuild_cache_size_bytes: "5 GiB"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep prebuild cache size within u32 limits

Setting maximum_prebuild_cache_size_bytes to "5 GiB" here silently overflows when the HTTP generator builds its cache: Http::new converts this byte value with as u32 (lading/src/generator/http.rs), so values above u32::MAX wrap instead of erroring. In this case, the intended 5 GiB cache becomes 1 GiB at runtime, which materially changes memory behavior and makes the 1000 MiB/s regression gate measure a different workload than configured.

Useful? React with 👍 / 👎.

variant: "ascii"

blackhole:
- http:
binding_addr: "0.0.0.0:8080"

telemetry:
addr: "0.0.0.0:9000"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
generator: []

blackhole:
- http:
binding_addr: "127.0.0.1:9091"

target_metrics:
- prometheus:
uri: "http://127.0.0.1:9000/metrics"
21 changes: 21 additions & 0 deletions experiments/regression/cases/http_ascii_100mib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# HTTP ASCII 100 MiB/s

Resource usage of lading under a steady 100 MiB/s HTTP load with ASCII payloads.

## What

Runs lading as a target with an HTTP generator sending ASCII payloads at 100 MiB/s to its own HTTP blackhole. This exercises the HTTP generator, ASCII payload construction, and HTTP blackhole within a single lading instance.

## Why

Establishes a baseline for lading's resource consumption under a high HTTP workload. Regressions here indicate overhead in the HTTP generator or payload path independent of payload complexity.

## Paired Benchmark

This experiment is paired with the local benchmark in `lading_payload/benches/ascii.rs`.
If throughput sizes change in either place, update the other to match.

## Enforcements

Memory usage is enforced by bounding `total_pss_bytes`.
CPU usage is enforced by bounding `avg(total_cpu_usage_millicores)`.
27 changes: 27 additions & 0 deletions experiments/regression/cases/http_ascii_100mib/experiment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optimization_goal: memory
erratic: false

target:
name: lading-target
# Setting experiment duration infinite here implies that lading will not go thru the shutdown process
# This is acceptable as we are less concerned about the shutdown performance
command: "/usr/bin/lading --no-target --experiment-duration-infinite --config-path /etc/lading-target/lading.yaml"
cpu_allotment: 2
memory_allotment: 630 MiB

environment:
DD_SERVICE: lading-target
RUST_LOG: info
RUST_BACKTRACE: 1

checks:
- name: memory_usage
description: "Memory usage quality gate. Bounds total memory usage."
bounds:
series: total_pss_bytes
upper_bound: "575 MiB"
- name: cpu_usage
description: "CPU usage quality gate. Bounds total average millicore usage."
bounds:
series: avg(total_cpu_usage_millicores)
upper_bound: 155
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
generator:
- http:
seed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
headers: {}
target_uri: "http://127.0.0.1:8080/"
bytes_per_second: "100 MiB"
parallel_connections: 1
method:
post:
maximum_prebuild_cache_size_bytes: "500 MiB"
Comment on lines +6 to +10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match maximum_block_size to the paired ASCII benchmark

These HTTP ASCII cases are described as being paired with lading_payload/benches/ascii.rs, but leaving maximum_block_size unset makes the HTTP generator fall back to 1 MiB blocks (lading/src/generator/http.rs, lading_payload::block::default_maximum_block_size). The criterion bench's 10/100/1000 MiB entries serialize single buffers of those exact sizes, whereas this config only exercises repeated <=1 MiB serializations to reach 100 MiB/s. That means regressions in large-buffer ASCII generation will not be caught by the SMP gate even though the README says the two should stay in sync.

Useful? React with 👍 / 👎.

Comment on lines +8 to +10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set maximum_block_size to the benchmarked payload size

For the 10/100/1000 MiB/s cases, ascii_throughput in lading_payload/benches/ascii.rs benchmarks serializing one buffer of that exact size, but these configs never set maximum_block_size, so lading::generator::http::Config falls back to lading_payload::block::default_maximum_block_size() = 1 MiB. In practice these SMP cases send a stream of 1 MiB requests instead of exercising 10/100/1000 MiB ASCII serialization, so regressions that only show up on large contiguous buffers will be missed even though the README/comment says the experiments are paired with that benchmark.

Useful? React with 👍 / 👎.

variant: "ascii"

blackhole:
- http:
binding_addr: "0.0.0.0:8080"

telemetry:
addr: "0.0.0.0:9000"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
generator: []

blackhole:
- http:
binding_addr: "127.0.0.1:9091"

target_metrics:
- prometheus:
uri: "http://127.0.0.1:9000/metrics"
21 changes: 21 additions & 0 deletions experiments/regression/cases/http_ascii_10mib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# HTTP ASCII 10 MiB/s

Resource usage of lading under a steady 10 MiB/s HTTP load with ASCII payloads.

## What

Runs lading as a target with an HTTP generator sending ASCII payloads at 10 MiB/s to its own HTTP blackhole. This exercises the HTTP generator, ASCII payload construction, and HTTP blackhole within a single lading instance.

## Why

Establishes a baseline for lading's resource consumption under a moderate HTTP workload. Regressions here indicate overhead in the HTTP generator or payload path independent of payload complexity.

## Paired Benchmark

This experiment is paired with the local benchmark in `lading_payload/benches/ascii.rs`.
If throughput sizes change in either place, update the other to match.

## Enforcements

Memory usage is enforced by bounding `total_pss_bytes`.
CPU usage is enforced by bounding `avg(total_cpu_usage_millicores)`.
27 changes: 27 additions & 0 deletions experiments/regression/cases/http_ascii_10mib/experiment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optimization_goal: memory
erratic: false

target:
name: lading-target
# Setting experiment duration infinite here implies that lading will not go thru the shutdown process
# This is acceptable as we are less concerned about the shutdown performance
command: "/usr/bin/lading --no-target --experiment-duration-infinite --config-path /etc/lading-target/lading.yaml"
cpu_allotment: 2
memory_allotment: 150 MiB

environment:
DD_SERVICE: lading-target
RUST_LOG: info
RUST_BACKTRACE: 1

checks:
- name: memory_usage
description: "Memory usage quality gate. Bounds total memory usage."
bounds:
series: total_pss_bytes
upper_bound: "132 MiB"
- name: cpu_usage
description: "CPU usage quality gate. Bounds total average millicore usage."
bounds:
series: avg(total_cpu_usage_millicores)
upper_bound: 14
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
generator:
- http:
seed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
headers: {}
target_uri: "http://127.0.0.1:8080/"
bytes_per_second: "10 MiB"
parallel_connections: 1
method:
post:
maximum_prebuild_cache_size_bytes: "100 MiB"
variant: "ascii"

blackhole:
- http:
binding_addr: "0.0.0.0:8080"

telemetry:
addr: "0.0.0.0:9000"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
generator: []

blackhole:
- http:
binding_addr: "127.0.0.1:9091"

target_metrics:
- prometheus:
uri: "http://127.0.0.1:9000/metrics"
21 changes: 21 additions & 0 deletions experiments/regression/cases/http_ascii_1mib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# HTTP ASCII 1 MiB/s

Resource usage of lading under a steady 1 MiB/s HTTP load with ASCII payloads.

## What

Runs lading as a target with an HTTP generator sending ASCII payloads at 1 MiB/s to its own HTTP blackhole. This exercises the HTTP generator, ASCII payload construction, and HTTP blackhole within a single lading instance.

## Why

Establishes a baseline for lading's resource consumption under a simple, sustained HTTP workload. Regressions here indicate overhead in the HTTP generator or payload path independent of payload complexity.

## Paired Benchmark

This experiment is paired with the local benchmark in `lading_payload/benches/ascii.rs`.
If throughput sizes change in either place, update the other to match.

## Enforcements

Memory usage is enforced by bounding `total_pss_bytes`.
CPU usage is enforced by bounding `avg(total_cpu_usage_millicores)`.
27 changes: 27 additions & 0 deletions experiments/regression/cases/http_ascii_1mib/experiment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optimization_goal: memory
erratic: false

target:
name: lading-target
# Setting experiment duration infinite here implies that lading will not go thru the shutdown process
# This is acceptable as we are less concerned about the shutdown performance
command: "/usr/bin/lading --no-target --experiment-duration-infinite --config-path /etc/lading-target/lading.yaml"
cpu_allotment: 2
memory_allotment: 40 MiB

environment:
DD_SERVICE: lading-target
RUST_LOG: info
RUST_BACKTRACE: 1

checks:
- name: memory_usage
description: "Memory usage quality gate. Bounds total memory usage."
bounds:
series: total_pss_bytes
upper_bound: "34 MiB"
- name: cpu_usage
description: "CPU usage quality gate. Bounds total average millicore usage."
bounds:
series: avg(total_cpu_usage_millicores)
upper_bound: 2.4
Comment on lines +23 to +27

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add a throughput check to the smaller ASCII SMP cases

This is the last check in the file, so the 1/10/100 MiB/s experiments introduced here only bound PSS and CPU. Because the controller is already scraping the target's Prometheus telemetry via target_metrics, a generator regression that under-drives the advertised rate would make these cases cheaper and still pass green, meaning the regression detector can stop exercising the intended load without any signal. (The 1000 MiB/s case already has a separate comment; the same missing lower-bound exists for the smaller throughput cases too.)

Useful? React with 👍 / 👎.

Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
generator:
- http:
seed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
headers: {}
target_uri: "http://127.0.0.1:8080/"
bytes_per_second: "1 MiB"
parallel_connections: 1
method:
post:
maximum_prebuild_cache_size_bytes: "10 MiB"
variant: "ascii"

blackhole:
- http:
binding_addr: "0.0.0.0:8080"

telemetry:
addr: "0.0.0.0:9000"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
generator: []

blackhole:
- http:
binding_addr: "127.0.0.1:9091"

target_metrics:
- prometheus:
uri: "http://127.0.0.1:9000/metrics"
Comment on lines +7 to +9

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove self-scraping from the 1 MiB/s regression case

At this load the target is only generating about one 1 MiB block per second because HTTP defaults maximum_block_size to 1 MiB (lading_payload/src/block.rs), but target_metrics.prometheus makes the observer issue another HTTP request against /metrics every sample period (1 Hz by default in lading/src/config.rs, via target_metrics::prometheus::Prometheus::run). In practice that means the exporter/scraper path becomes comparable to the actual workload, so this gate no longer isolates the 1 MiB/s generator+blackhole path and can miss or misattribute low-throughput regressions.

Useful? React with 👍 / 👎.

16 changes: 16 additions & 0 deletions experiments/regression/cases/idle/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Idle

Baseline resource usage of lading with no active workload.

## What

Runs lading as a target with an empty generator and a single HTTP blackhole; this represents the most basic way to run lading.

## Why

Establishes a floor for lading's memory footprint. Regressions here indicate overhead introduced independent of any workload.

## Enforcements

Memory usage is enforced by bounding `total_pss_bytes`.
CPU usage is enforced by bounding `avg(total_cpu_usage_millicores)`.
27 changes: 27 additions & 0 deletions experiments/regression/cases/idle/experiment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optimization_goal: memory
erratic: false

target:
name: lading-target
# Setting experiment duration infinite here implies that lading will not go thru the shutdown process
# This is acceptable as we are less concerned about the shutdown performance
command: "/usr/bin/lading --no-target --experiment-duration-infinite --config-path /etc/lading-target/lading.yaml"
cpu_allotment: 2
memory_allotment: 17 MiB

environment:
DD_SERVICE: lading-target
RUST_LOG: info
RUST_BACKTRACE: 1

checks:
- name: memory_usage
description: "Memory usage quality gate. Bounds total memory usage."
bounds:
series: total_pss_bytes
upper_bound: "14 MiB"
Comment on lines +21 to +22

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop binding these SMP cases to observer-only metrics

Checked the new cases/idle and cases/http_ascii_* configs: the only metric source wired into case/lading/lading.yaml is target_metrics.prometheus, but the target command launches lading with --no-target. In inner_main, --no-target makes config.target = None, so the observer is skipped entirely (lading/src/bin/lading.rs:603-615), and the total_pss_bytes / total_cpu_usage_millicores series used by these new quality gates are only emitted by the observer (lading/src/observer/linux/procfs.rs:161-164, lading/src/observer/linux/cgroup/v2/cpu.rs:115-123). That means these experiments never export the memory/CPU series they are checking here, so the new gates either fail on missing data or silently stop enforcing the intended limits.

Useful? React with 👍 / 👎.

- name: cpu_usage
description: "CPU usage quality gate. Bounds total average millicore usage."
bounds:
series: avg(total_cpu_usage_millicores)
upper_bound: 0.6
8 changes: 8 additions & 0 deletions experiments/regression/cases/idle/lading-target/lading.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
generator: []

blackhole:
- http:
binding_addr: "0.0.0.0:8080"

telemetry:
addr: "0.0.0.0:9000"
9 changes: 9 additions & 0 deletions experiments/regression/cases/idle/lading/lading.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
generator: []

blackhole:
- http:
binding_addr: "127.0.0.1:9091"

target_metrics:
- prometheus:
uri: "http://127.0.0.1:9000/metrics"
Comment on lines +7 to +9

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove Prometheus self-scraping from the idle baseline

The idle gate in cases/idle/experiment.yaml only checks total_pss_bytes and avg(total_cpu_usage_millicores), both of which come from the observer, but this extra target_metrics entry makes the controller issue a Prometheus scrape against the target every sample period (target_metrics::prometheus::Prometheus::run). In the idle case that periodic /metrics traffic becomes one of the main CPU sources, so the baseline is no longer measuring a quiescent lading process; it is measuring the exporter-plus-scraper path as well.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I want this as is. But maybe it's worth while to update the README to reflect this? Don't feel too strongly, I'll wait and see what reviewer says.

6 changes: 6 additions & 0 deletions experiments/regression/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
lading:
version: 0.31.2

target:
ddprof_replicas: 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: i was running into some strange profiling errors so I turned this off temporarily.

Easy enough to re-add and re-test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in seeing those errors if you have a link handy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, too old. It's one of the many many runs I did last week and have totally lost track of it.

It's going to be easy enough to reproduce though.

internal_profiling_replicas: 0
Loading
Loading