Skip to content

Commit 4db0e5b

Browse files
nuke-web3kevinnasserymothranzeroecco
authored
DR-370: BM-238: Add Initial Bento performance tuning guide (github#62)
Co-authored-by: Kevin Nassery <[email protected]> Co-authored-by: Parker Thompson <[email protected]> Co-authored-by: Richard Howard <[email protected]>
1 parent 9cc8cbf commit 4db0e5b

File tree

3 files changed

+291
-0
lines changed

3 files changed

+291
-0
lines changed

docs/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
- [Bento](./prover-manual/bento/README.md)
2121
- [Running Bento](./prover-manual/bento/running_bento.md)
22+
- [Performance](./prover-manual/bento/performance.md)
2223
- [Broker](./prover-manual/broker/README.md)
2324

2425
- [🔗 Reference](./reference.md)

docs/src/images/nvtop-1.png

73.4 KB
Loading
Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Bento Performance Tuning
2+
3+
This guide offers a practical set of steps for optimizing your initial Bento deployment. Though performance tuning ultimately depends on your specific environment, equipment, and requirements... these baseline recommendations will assist your initial deployment.
4+
5+
## Recommended tools
6+
7+
Prior to starting we recommend the following tools monitor performance and resource use:
8+
9+
- [nvtop](https://github.com/Syllo/nvtop) - A tool to monitor GPU utilization.
10+
- [htop](https://htop.dev/) - A tool to monitor CPU utilization, system memory, and process status.
11+
12+
Both of these tools warrant a decent size terminal window on your desktop to monitor the performance during experiments.
13+
14+
## Isolating tests
15+
16+
Operating competing workloads on the same system as your Bento deployment can lead to unpredictable results. We recommend isolating your test system from other workloads to ensure that the performance tuning results are consistent and reliable. This includes stopping configured [Broker][page-broker] services:
17+
18+
```bash
19+
docker ps
20+
docker stop <BROKER_CONTAINER_ID>
21+
```
22+
23+
Alternatively you can modify your `scripts/boundless_service.sh` to remove `--profile broker`.
24+
25+
## Defining a test harness
26+
27+
It is preferable to benchmark using an example of your actual workload. Using a representative workload will provide more accurate turn around times, and validate your ELF and inputs file for the proofs you plan to generate. To try a realistic example:
28+
29+
```bash
30+
RUST_LOG=info cargo run --bin bento_cli -- -f /path/to/elf -i /path/to/input`
31+
```
32+
33+
If you intend to operate across a variety of different workloads (such as those that may be fed by [Broker][page-broker]) you can also use the following command to generate a synthetic workload:
34+
35+
```bash
36+
RUST_LOG=info cargo run --bin bento_cli -- -c <ITERATION_COUNT>
37+
```
38+
39+
Where iteration count is the number of times the synthetic guest is executed. A value of 4096 is a good starting point, however on smaller or less performant you may want to reduce this to 2048 or 1024 while performing some of your experiments. For functional testing, 32 is sufficient.
40+
41+
The typical test process will be:
42+
43+
1. start `nvtop` and `htop`
44+
1. execute the test harness above and copy the job id
45+
1. upon completion of the job use [the `script/job_status.sh`](https://github.com/boundless-xyz/boundless/blob/main/scripts/job_status.sh) to view the results
46+
47+
Example test run of 1024 iterations:
48+
49+
```bash
50+
RUST_LOG=info cargo run --bin bento_cli -- -c 1024
51+
```
52+
53+
```txt
54+
2024-10-17T15:27:34.469227Z INFO bento_cli: image_id: a0dfc25e54ebde808e4fd8c34b6549bbb91b4928edeea90ceb7d1d8e7e9096c7 | input_id: 3740ebbd-3bef-475f-b23d-6c2bf96c6551
55+
2024-10-17T15:27:34.479904Z INFO bento_cli: STARK job_id: 895a996b-b0fa-4fc8-ae7a-ba92eeb6b0b1
56+
2024-10-17T15:27:34.480919Z INFO bento_cli: STARK Job running....
57+
...
58+
2024-10-17T15:27:56.509275Z INFO bento_cli: STARK Job running....
59+
2024-10-17T15:27:58.513718Z INFO bento_cli: Job done!
60+
```
61+
62+
```bash
63+
echo 895a996b-b0fa-4fc8-ae7a-ba92eeb6b0b1 | bash scripts/job_status.sh
64+
```
65+
66+
```txt
67+
jobs_count
68+
------------
69+
19
70+
(1 row)
71+
72+
remaining_jobs
73+
----------------
74+
0
75+
(1 row)
76+
77+
task times:
78+
task_id | task_type | state | wall_time | started_at
79+
---------+-----------+-------+-----------+----------------------------
80+
init | Executor | done | 0.530216 | 2024-10-17 15:27:35.00974
81+
0 | Prove | done | 3.299771 | 2024-10-17 15:27:35.661319
82+
1 | Prove | done | 3.129968 | 2024-10-17 15:27:35.818467
83+
3 | Prove | done | 2.998964 | 2024-10-17 15:27:38.963914
84+
2 | Join | done | 1.123467 | 2024-10-17 15:27:38.9684
85+
4 | Prove | done | 2.901972 | 2024-10-17 15:27:40.105599
86+
7 | Prove | done | 3.001664 | 2024-10-17 15:27:41.977001
87+
5 | Join | done | 1.237363 | 2024-10-17 15:27:43.022033
88+
6 | Join | done | 1.154148 | 2024-10-17 15:27:44.273276
89+
8 | Prove | done | 3.096732 | 2024-10-17 15:27:44.992537
90+
(10 rows)
91+
92+
Effective Hz:
93+
hz | total_cycles | elapsed_sec
94+
---------------------+--------------+-------------
95+
399385.599822715909 | 8650752 | 21.660150
96+
(1 row)
97+
```
98+
99+
<div class="warning">
100+
101+
Note that in the `job_status.sh` output above, the Hz is only accurate if the job has completed with no error conditions. Failed and in-progress jobs will have an inflated Hz value.
102+
103+
</div>
104+
105+
In the final table, the effective Hz is the primary metric for consideration. This represents the (number of cycles) / (elapsed wallclock time). In the example above, the effective Hz is roughly 400kHz.
106+
107+
## Finding a maximum `SEGMENT_SIZE` for GPU VRAM
108+
109+
In most scenarios it makes sense to start by optimizing our GPU workers. This is because the bulk of the RISC Zero workload is executed by the `gpu-agent` and GPU resources a most often the performance bottleneck.
110+
111+
A critical concept to understand as we begin this testing is related to [RISC Zero's continuations][r0-term-continuations]. [Continuations](https://dev.risczero.com/api/recursion) are the key mechanism that allow RISC Zero to scale to effectively handle arbitrarily large proofs.
112+
113+
The CPU first runs the workload in a pre-flight stage where it doesn't engage in proving, while doing so it divides the program trace into a series of [segments][r0-term-segment]. In Bento these segments are then dispatched to various workers for proving, and are combined back together in the final stage to produce the proof.
114+
115+
The key tuning parameter of continuations is `SEGMENT_SIZE` in the `.env-compose` file. A proof is divided into `(2^SEGMENT_SIZE)` sized segments. The default value is 20, which means that a proof is composed of the number of required segments of approximately 1M cycles `(2^20 = 1048576)`.
116+
117+
`SEGMENT_SIZE has` some practical implications, related to GPU VRAM capacity. Below is a set of guidelines for setting `SEGMENT_SIZE` maximums:
118+
119+
| VRAM | `SEGMENT_SIZE` Max |
120+
| ---- | ------------------ |
121+
| 8GB | 19 |
122+
| 16GB | 20 |
123+
| 20GB | 21 |
124+
125+
Once you have selected a `MAXIMUM` segment size you should verify that the GPU does in fact have enough memory to complete.
126+
127+
In the following test, an RTX 4060 with 16GB VRAM attempts to run with a `SEGMENT_SIZE` of 21, which is too large for the GPU to handle. In this test, it is necessary to monitor the `gpu-agent` docker logs to determine the cause of the failure (note boundless should be restarted upon changing the `SEGMENT_SIZE` value, and verify that [Broker][page-broker] is not running):
128+
129+
```bash
130+
RUST_LOG=info cargo run --bin bento_cli -- -c 4096
131+
```
132+
133+
```txt
134+
2024-10-17T15:58:15.205138Z INFO bento_cli: image_id: a0dfc25e54ebde808e4fd8c34b6549bbb91b4928edeea90ceb7d1d8e7e9096c7 | input_id: fe7f4251-25f4-436f-b782-f134d4c80538
135+
2024-10-17T15:58:15.210646Z INFO bento_cli: STARK job_id: bbf442eb-40db-44fb-8df4-f13a8ce10bf2
136+
2024-10-17T15:58:15.211686Z INFO bento_cli: STARK Job running....
137+
....
138+
```
139+
140+
We then examine the `gpu-agent` logs and see a series of out of memory errors:
141+
`docker logs bento-gpu_agent0`
142+
143+
```
144+
2024-10-17T15:57:43.667484Z INFO workflow::tasks::prove: Starting proof of idx: 6f95e238-d0be-4e94-9e81-fefdc0b7d8c4 - 1
145+
thread 'main' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/risc0-zkp-1.1.1/src/hal/cuda.rs:206:61:
146+
called `Result::unwrap()` on an `Err` value: OutOfMemory
147+
stack backtrace:
148+
0: rust_begin_unwind
149+
at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
150+
1: core::panicking::panic_fmt
151+
at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
152+
2: core::result::unwrap_failed
153+
at ./rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5
154+
```
155+
156+
Indicating that the GPU is out of memory. In this case, the `SEGMENT_SIZE` should be reduced.
157+
158+
Note that if you have a multi-GPU system, your `SEGMENT_SIZE` should be set to the lowest common denominator of the GPUs in the system, so benchmarking should be performed on that card by tuning device-id in `compose.yml`.
159+
160+
*** NOTE: ***: If a job fails to complete due to OOM, it may be resumed after Bento has been restarted. It's important to ensure that resumed jobs are not in progress during the test harness execution.
161+
162+
## Benchmark single GPU's `SEGMENT_SIZE`
163+
164+
Configure a single GPU instance:
165+
166+
_**compose.yml**_:
167+
168+
```
169+
gpu_agent0: &gpu
170+
image: agent
171+
runtime: nvidia
172+
pull_policy: never
173+
restart: always
174+
depends_on:
175+
- postgres
176+
- redis
177+
- minio
178+
179+
mem_limit: 4G
180+
cpu_count: 4
181+
```
182+
183+
Verify that no other GPU definitions are present in the compose file.
184+
185+
Confirm the `SEGMENT_SIZE` is set to the maximum as determined above in the `.env-compose` file.
186+
187+
Execute the test harness:
188+
189+
```
190+
RUST_LOG=info cargo run --bin bento_cli -- -c 4096
191+
```
192+
193+
Confirm single GPU utilization using `nvtop`:
194+
195+
<figure>
196+
<img src="../../images/nvtop-1.png"/>
197+
<cap>Monitoring Bento with <a target="_blank" href="https://github.com/Syllo/nvtop">nvtop</a> </cap>
198+
</figure>
199+
200+
Review the effective Hz:
201+
202+
```
203+
echo <JOB_ID> | bash scripts/job_status.sh
204+
```
205+
206+
Example results:
207+
208+
```
209+
...
210+
211+
Effective Hz:
212+
hz | total_cycles | elapsed_sec
213+
---------------------+--------------+-------------
214+
264892.074666431500 | 34603008 | 130.630590
215+
(1 row)
216+
```
217+
218+
Here we see that our single `gpu-agent` at max `SEGMENT_SIZE` is able to achieve an effective 264kHz.
219+
220+
## Multiple agents and GPUs
221+
222+
We can incorporate multiple GPUs into a configuration. In this example, we have two 16GB GPU as that proved to be optimal above:
223+
224+
`compose.yml`
225+
226+
```yml
227+
...
228+
gpu_agent0: &gpu
229+
image: agent
230+
runtime: nvidia
231+
pull_policy: never
232+
restart: always
233+
depends_on:
234+
- postgres
235+
- redis
236+
- minio
237+
...
238+
gpu_agent1:
239+
<<: *gpu
240+
241+
deploy:
242+
resources:
243+
reservations:
244+
devices:
245+
- driver: nvidia
246+
device_ids: ['0']
247+
capabilities: [gpu]
248+
249+
gpu_agent2:
250+
<<: *gpu
251+
252+
deploy:
253+
resources:
254+
reservations:
255+
devices:
256+
- driver: nvidia
257+
device_ids: ['1']
258+
capabilities: [gpu]
259+
260+
gpu_agent3:
261+
<<: *gpu
262+
263+
deploy:
264+
resources:
265+
reservations:
266+
devices:
267+
- driver: nvidia
268+
device_ids: ['1']
269+
capabilities: [gpu]
270+
...
271+
```
272+
273+
Here are the effective results on our example system:
274+
275+
```txt
276+
431375.207834042771 | 35127296 | 81.430957
277+
```
278+
279+
In this case, we see that the effective Hz has increased to 431kHz, which is a significant improvement over the single GPU configuration; however we anticipated if the system was GPU limited we could expect 264Hz * 2 = 528Hz.
280+
281+
This means that our example system is bound by some other factor such as bus bandwidth, memory, etc.
282+
283+
In this case we suggest:
284+
285+
- Reconfigure the system back to higher PO2 with single agent per GPU and establish a new baseline performance level to compare against.
286+
- In lower `SEGMENT_SIZE` configurations experiment with `cpu_count` and `mem_limit` (removing, increasing, or decreasing) to see if the performance can be improved. In cases where bus contention is the limiting factor, running fewer agents at higher maximum `SEGMENT_SIZE` may be optimal. Systems in this configuration should avoid GPU expansion, and instead opt to expand into remote workers.
287+
288+
[r0-term-continuations]: https://dev.risczero.com/terminology#continuations
289+
[r0-term-segment]: https://dev.risczero.com/terminology#segment
290+
[page-broker]: ../broker/README.md

0 commit comments

Comments
 (0)