Skip to content

Commit efbfa36

Browse files
Dealing with data too large for a single buffer (gfx-rs#6138)
* init files, dir structure * wip: it's working need to suss out the readme and some of the consts... * ok that's probably good enough for a first pass... * chore: spelling * chore: readme tweaks * chore: clippy and fmt * chore: add self and changes to changelog.md * fix: typo and remove env_logger via cfg flag for wasm builds (which this doesn't supprot anyway) * refactor: bring inline with newer wgpu * refactor: bring inline with newer wgpu * chore: work on the readme a bit... * refactor: remove a bunch of everything, be simple * wip: get a test going * wip: remove unrequired pub(s)... * refactor: remove a bunch of everything, be simple wip: get a test going * wip: remove unrequired pub(s)... wip: remove unrequired pub(s)... * chore: cleanups, typos, simplifying * chore: reconcile changelog diffs * fix: re-add our change to the changelog * wip: finess the docs a bit per request... * chore: trying to get the woring right... * chore: trying to get the woring right... * fix: typos * fix: spelling * Update mod.rs swap all loops over to 'for' by request. Flume's sender is already Send/Sync chunks will already split for us .unwraps() unwraps everywhere! * Update CHANGELOG.md Co-authored-by: Jim Blandy <[email protected]> * 1GB as the example says we'll do - 1GB as the example says we'll do - update readme for windows users. * init files, dir structure * wip: it's working need to suss out the readme and some of the consts... * ok that's probably good enough for a first pass... * chore: spelling * chore: readme tweaks * chore: clippy and fmt * chore: add self and changes to changelog.md * fix: typo and remove env_logger via cfg flag for wasm builds (which this doesn't supprot anyway) * refactor: bring inline with newer wgpu * refactor: bring inline with newer wgpu * chore: work on the readme a bit... * refactor: remove a bunch of everything, be simple * wip: get a test going * wip: remove unrequired pub(s)... * wip: remove unrequired pub(s)... wip: remove unrequired pub(s)... * chore: cleanups, typos, simplifying * fix: re-add our change to the changelog * wip: finess the docs a bit per request... * chore: trying to get the woring right... * chore: trying to get the woring right... * fix: typos * fix: spelling * Update mod.rs swap all loops over to 'for' by request. Flume's sender is already Send/Sync chunks will already split for us .unwraps() unwraps everywhere! * Update CHANGELOG.md Co-authored-by: Jim Blandy <[email protected]> * 1GB as the example says we'll do - 1GB as the example says we'll do - update readme for windows users. * bring up to date with trunk sync with trunk. make more of Jim's changes * some of the consts have changed name. * small tweaks * what is the flag called now? what is the flag called now? * Update shader.wgsl naming things betterer * Update README.md reword readme * Update README.md simplify readme * Update mod.rs remove unused * well at least it compiles again * BUG: ... it seems to run forever and never complete. * nicer shader module creation * ... add logging to track down infinite hangtime... * use 2 buffers in the test * test and example pass (now they do the same number of buffers.. * that's better... * fix: remove duplicate entries * fix: whitespace * move changelog entry to #unreleased per request * fix: target_arch != wasm to satiate pipeline * fix: target_arch != wasm to satiate pipeline * pipeline want's us to allow allows... * savage hacks to make the wasm build ignore our test * fix: allow the allowing of allows that allow the dead_code. * Fix: no tests on wasm --------- Co-authored-by: Jim Blandy <[email protected]>
1 parent 1ef9940 commit efbfa36

File tree

8 files changed

+373
-0
lines changed

8 files changed

+373
-0
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@ Bottom level categories:
4040

4141
## Unreleased
4242

43+
- Added an example that shows how to handle datasets too large to fit in a single `GPUBuffer` by distributing it across many buffers, and then having the shader receive them as a `binding_array` of storage buffers. By @alphastrata in [#6138](https://github.com/gfx-rs/wgpu/pull/6138)
44+
45+
4346
### Major Features
4447

4548
#### Hashmaps Removed from APIs
@@ -186,6 +189,7 @@ By @wumpf in [#7144](https://github.com/gfx-rs/wgpu/pull/7144)
186189
187190
- Support getting vertices of the hit triangle when raytracing. By @Vecvec in [#7183](https://github.com/gfx-rs/wgpu/pull/7183) .
188191
192+
189193
#### Naga
190194
191195
- Add support for unsigned types when calling textureLoad with the level parameter. By @ygdrasil-io in [#7058](https://github.com/gfx-rs/wgpu/pull/7058).

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ These examples use a common framework to handle wgpu init, window creation, and
5151
- `hello_workgroups` - Teaches the user about the basics of compute workgroups; what they are and what they can do.
5252
- `hello_synchronization` - Teaches the user about synchronization in WGSL, the ability to force all invocations in a workgroup to synchronize with each other before continuing via a sort of barrier.
5353
- `storage_texture` - Demonstrates the use of storage textures as outputs to compute shaders. The example on the outside seems very similar to `render_to_texture` in that it outputs an image either to the file system or the web page, except displaying a grayscale render of the Mandelbrot Set. However, inside, the example dispatches a grid of compute workgroups, one for each pixel, which calculates the pixel value and stores it to the corresponding pixel of the output storage texture. This example either outputs an image file of your naming (pass command line arguments after specifying a `--` like `cargo run --bin wgpu-examples -- storage_texture "test.png"`) or adds an `img` element containing the image to the page in WASM.
54+
- `big_compute_buffers` - Demonstrates how you can split _large_ datasets across multiple buffers, using `binding_array` in your `wgsl` [NOTE: native only, no WASM support].
5455

5556
#### Combined
5657

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# big-compute-buffers
2+
3+
*NOTE: `binding_array` is Vulkan only.*
4+
5+
This example assumes you're familiar with the other GP-GPU compute examples in this repository, if you're not you should go look at those first.
6+
7+
This example also assumes you've specifically come here looking to do this, because you want at least the following:
8+
9+
1. To be working on your 'data' in your shader treating it contiguously, not batching etc.
10+
2. The data you are wanting to work on does **not** fit within a single buffer on your device, see the [hello](https://github.com/gfx-rs/wgpu/tree/trunk/examples/src/hello) example for how to print information about your unique device to explore its maximum supported buffer size.
11+
12+
Demonstrates how to split larger datasets (things too big to fit into a single buffer), across multiple buffers.
13+
14+
- Creates a set of buffers totalling `1GB`, full of `0.0f32`.
15+
- Moves those buffers to the DEVICE.
16+
- Increments each element in each set of buffers by `1.0`, on the DEVICE.
17+
- Returns those modified buffers full of `1.0` values as a back to the HOST.
18+
19+
## Caution
20+
21+
- Large buffers can fail to allocate due to fragmentation issues, you will **always** need not only the appropriate amount of space required for your buffer(s) but, that space will also need to be contiguous within GPU/Device memory for this strategy to work.
22+
23+
You can read more about fragmentation [here](https://developer.nvidia.com/docs/drive/drive-os/archives/6.0.4/linux/sdk/common/topics/graphics_content/avoiding_memory_fragmentation.html).
24+
25+
## To Run
26+
27+
```sh
28+
# linux/mac
29+
RUST_LOG=wgpu_examples::big_compute_buffers=info cargo run -r --bin wgpu-examples -- big_compute_buffers
30+
31+
# windows (Powershell)
32+
$env:WGPU_BACKEND="Vulkan"; $env:RUST_LOG="wgpu_examples::big_compute_buffers=info"; cargo run -r --bin wgpu-examples -- big_compute_buffers
33+
```
34+
35+
## Example Output
36+
37+
```txt
38+
[2024-09-29T11:47:55Z INFO wgpu_examples::big_compute_buffers] All 0.0s
39+
[2024-09-29T11:47:58Z INFO wgpu_examples::big_compute_buffers] GPU RUNTIME: 3228ms
40+
[2024-09-29T11:47:58Z INFO wgpu_examples::big_compute_buffers] All 1.0s
41+
```
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
//! This example shows you a potential course for when your 'data' is too large
2+
//! for a single Buffer.
3+
//!
4+
//! A lot of things aren't explained here via comments. See hello-compute and
5+
//! repeated-compute for code that is more thoroughly commented.
6+
7+
use std::num::NonZeroU32;
8+
use wgpu::{util::DeviceExt, Features};
9+
10+
// These are set by the minimum required defaults for webgpu.
11+
const MAX_BUFFER_SIZE: u64 = 1 << 27; // 134_217_728 // 134MB
12+
const MAX_DISPATCH_SIZE: u32 = (1 << 16) - 1;
13+
14+
pub async fn execute_gpu(numbers: &[f32]) -> Vec<f32> {
15+
let instance = wgpu::Instance::default();
16+
17+
let adapter = instance
18+
.request_adapter(&wgpu::RequestAdapterOptions::default())
19+
.await
20+
.unwrap();
21+
22+
let (device, queue) = adapter
23+
.request_device(&wgpu::DeviceDescriptor {
24+
label: None,
25+
// These features are required to use `binding_array` in your wgsl.
26+
// Without them your shader may fail to compile.
27+
required_features: Features::BUFFER_BINDING_ARRAY
28+
| Features::STORAGE_RESOURCE_BINDING_ARRAY
29+
| Features::SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING,
30+
memory_hints: wgpu::MemoryHints::Performance,
31+
required_limits: wgpu::Limits {
32+
max_buffer_size: MAX_BUFFER_SIZE,
33+
max_binding_array_elements_per_shader_stage: 8,
34+
..Default::default()
35+
},
36+
..Default::default()
37+
})
38+
.await
39+
.unwrap();
40+
41+
execute_gpu_inner(&device, &queue, numbers).await
42+
}
43+
44+
pub async fn execute_gpu_inner(
45+
device: &wgpu::Device,
46+
queue: &wgpu::Queue,
47+
numbers: &[f32],
48+
) -> Vec<f32> {
49+
let (staging_buffers, storage_buffers, bind_group, compute_pipeline) = setup(device, numbers);
50+
51+
let mut encoder =
52+
device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
53+
{
54+
let mut cpass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor {
55+
label: Some("compute pass descriptor"),
56+
timestamp_writes: None,
57+
});
58+
cpass.set_pipeline(&compute_pipeline);
59+
cpass.set_bind_group(0, Some(&bind_group), &[]);
60+
61+
cpass.dispatch_workgroups(MAX_DISPATCH_SIZE.min(numbers.len() as u32), 1, 1);
62+
}
63+
64+
for (storage_buffer, staging_buffer) in storage_buffers.iter().zip(staging_buffers.iter()) {
65+
let stg_size = staging_buffer.size();
66+
67+
encoder.copy_buffer_to_buffer(
68+
storage_buffer, // Source buffer
69+
0,
70+
staging_buffer, // Destination buffer
71+
0,
72+
stg_size,
73+
);
74+
}
75+
76+
queue.submit(Some(encoder.finish()));
77+
78+
for staging_buffer in &staging_buffers {
79+
let slice = staging_buffer.slice(..);
80+
slice.map_async(wgpu::MapMode::Read, |_| {});
81+
}
82+
83+
device.poll(wgpu::PollType::Wait).unwrap();
84+
85+
let mut data = Vec::new();
86+
for staging_buffer in &staging_buffers {
87+
let slice = staging_buffer.slice(..);
88+
let mapped = slice.get_mapped_range();
89+
data.extend_from_slice(bytemuck::cast_slice(&mapped));
90+
drop(mapped);
91+
staging_buffer.unmap();
92+
}
93+
94+
data
95+
}
96+
97+
fn setup(
98+
device: &wgpu::Device,
99+
numbers: &[f32],
100+
) -> (
101+
Vec<wgpu::Buffer>,
102+
Vec<wgpu::Buffer>,
103+
wgpu::BindGroup,
104+
wgpu::ComputePipeline,
105+
) {
106+
let cs_module = device.create_shader_module(wgpu::include_wgsl!("shader.wgsl"));
107+
108+
let staging_buffers = create_staging_buffers(device, numbers);
109+
let storage_buffers = create_storage_buffers(device, numbers);
110+
111+
let (bind_group_layout, bind_group) = setup_binds(&storage_buffers, device);
112+
113+
let compute_pipeline = setup_pipeline(device, bind_group_layout, cs_module);
114+
(
115+
staging_buffers,
116+
storage_buffers,
117+
bind_group,
118+
compute_pipeline,
119+
)
120+
}
121+
122+
fn setup_pipeline(
123+
device: &wgpu::Device,
124+
bind_group_layout: wgpu::BindGroupLayout,
125+
cs_module: wgpu::ShaderModule,
126+
) -> wgpu::ComputePipeline {
127+
let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
128+
label: Some("Compute Pipeline Layout"),
129+
bind_group_layouts: &[&bind_group_layout],
130+
push_constant_ranges: &[],
131+
});
132+
133+
device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
134+
label: Some("Compute Pipeline"),
135+
layout: Some(&pipeline_layout),
136+
module: &cs_module,
137+
entry_point: Some("main"),
138+
compilation_options: Default::default(),
139+
cache: None,
140+
})
141+
}
142+
143+
fn setup_binds(
144+
storage_buffers: &[wgpu::Buffer],
145+
device: &wgpu::Device,
146+
) -> (wgpu::BindGroupLayout, wgpu::BindGroup) {
147+
let bind_group_entries: Vec<wgpu::BindGroupEntry> = storage_buffers
148+
.iter()
149+
.enumerate()
150+
.map(|(bind_idx, buffer)| wgpu::BindGroupEntry {
151+
binding: bind_idx as u32,
152+
resource: buffer.as_entire_binding(),
153+
})
154+
.collect();
155+
156+
let bind_group_layout_entries: Vec<wgpu::BindGroupLayoutEntry> = (0..storage_buffers.len())
157+
.map(|bind_idx| wgpu::BindGroupLayoutEntry {
158+
binding: bind_idx as u32,
159+
visibility: wgpu::ShaderStages::COMPUTE,
160+
ty: wgpu::BindingType::Buffer {
161+
ty: wgpu::BufferBindingType::Storage { read_only: false },
162+
has_dynamic_offset: false,
163+
min_binding_size: None,
164+
},
165+
count: Some(NonZeroU32::new(1).unwrap()),
166+
})
167+
.collect();
168+
169+
let bind_group_layout = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
170+
label: Some("Custom Storage Bind Group Layout"),
171+
entries: &bind_group_layout_entries,
172+
});
173+
174+
let bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
175+
label: Some("Combined Storage Bind Group"),
176+
layout: &bind_group_layout,
177+
entries: &bind_group_entries,
178+
});
179+
180+
(bind_group_layout, bind_group)
181+
}
182+
183+
fn calculate_chunks(numbers: &[f32], max_buffer_size: u64) -> Vec<&[f32]> {
184+
let max_elements_per_chunk = max_buffer_size as usize / std::mem::size_of::<f32>();
185+
numbers.chunks(max_elements_per_chunk).collect()
186+
}
187+
188+
fn create_storage_buffers(device: &wgpu::Device, numbers: &[f32]) -> Vec<wgpu::Buffer> {
189+
let chunks = calculate_chunks(numbers, MAX_BUFFER_SIZE);
190+
191+
chunks
192+
.iter()
193+
.enumerate()
194+
.map(|(e, seg)| {
195+
device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
196+
label: Some(&format!("Storage Buffer-{}", e)),
197+
contents: bytemuck::cast_slice(seg),
198+
usage: wgpu::BufferUsages::STORAGE
199+
| wgpu::BufferUsages::COPY_DST
200+
| wgpu::BufferUsages::COPY_SRC,
201+
})
202+
})
203+
.collect()
204+
}
205+
206+
fn create_staging_buffers(device: &wgpu::Device, numbers: &[f32]) -> Vec<wgpu::Buffer> {
207+
let chunks = calculate_chunks(numbers, MAX_BUFFER_SIZE);
208+
209+
(0..chunks.len())
210+
.map(|e| {
211+
let size = std::mem::size_of_val(chunks[e]) as u64;
212+
213+
device.create_buffer(&wgpu::BufferDescriptor {
214+
label: Some(&format!("staging buffer-{}", e)),
215+
size,
216+
usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
217+
mapped_at_creation: false,
218+
})
219+
})
220+
.collect()
221+
}
222+
223+
#[cfg_attr(target_arch = "wasm32", allow(clippy::allow_attributes, dead_code))]
224+
async fn run() {
225+
let numbers = {
226+
const BYTES_PER_GB: usize = 1024 * 1024 * 1024;
227+
// 4 bytes per f32
228+
let elements = (BYTES_PER_GB as f32 / 4.0) as usize;
229+
vec![0.0; elements]
230+
};
231+
assert!(numbers.iter().all(|n| *n == 0.0));
232+
log::info!("All 0.0s");
233+
let t1 = std::time::Instant::now();
234+
let results = execute_gpu(&numbers).await;
235+
log::info!("GPU RUNTIME: {}ms", t1.elapsed().as_millis());
236+
assert_eq!(numbers.len(), results.len());
237+
assert!(results.iter().all(|n| *n == 1.0));
238+
log::info!("All 1.0s");
239+
}
240+
241+
pub fn main() {
242+
#[cfg(not(target_arch = "wasm32"))]
243+
{
244+
env_logger::init();
245+
pollster::block_on(run());
246+
}
247+
}
248+
249+
#[cfg(test)]
250+
#[cfg(not(target_arch = "wasm32"))]
251+
mod tests;
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
const OFFSET: u32 = 1u << 8u;
2+
const BUFFER_MAX_ELEMENTS: u32 = 1u << 25u; // Think `buffer.len()`
3+
const NUM_BUFFERS: u32 = 8u;
4+
const TOTAL_SIZE: u32 = BUFFER_MAX_ELEMENTS * NUM_BUFFERS;
5+
6+
7+
// `binding_array` requires a custom struct
8+
struct ContiguousArray {
9+
inner: array<f32>
10+
}
11+
12+
@group(0) @binding(0)
13+
var<storage, read_write> storage_array: binding_array<ContiguousArray, NUM_BUFFERS>;
14+
15+
16+
@compute @workgroup_size(256, 1, 1)
17+
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
18+
let base_index = global_id.x * OFFSET;
19+
20+
for (var i = 0u; i < OFFSET; i++) {
21+
let index = base_index + i;
22+
23+
if index < TOTAL_SIZE {
24+
let buffer_index = index / BUFFER_MAX_ELEMENTS;
25+
let inner_index = index % BUFFER_MAX_ELEMENTS;
26+
27+
storage_array[buffer_index].inner[inner_index] = add_one(storage_array[buffer_index].inner[inner_index]);
28+
}
29+
}
30+
}
31+
32+
fn add_one(n: f32) -> f32 {
33+
return n + 1.0;
34+
}
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
use super::*;
2+
use wgpu_test::{gpu_test, GpuTestConfiguration, TestParameters};
3+
4+
#[gpu_test]
5+
static TWO_BUFFERS: GpuTestConfiguration = GpuTestConfiguration::new()
6+
.parameters(
7+
TestParameters::default()
8+
.features(
9+
Features::BUFFER_BINDING_ARRAY
10+
| Features::STORAGE_RESOURCE_BINDING_ARRAY
11+
| Features::SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING,
12+
)
13+
.downlevel_flags(wgpu::DownlevelFlags::COMPUTE_SHADERS)
14+
.limits(wgpu::Limits {
15+
max_buffer_size: MAX_BUFFER_SIZE,
16+
max_binding_array_elements_per_shader_stage: 8,
17+
..Default::default()
18+
}),
19+
)
20+
.run_async(|ctx| {
21+
// The test environment's GPU reports 134MB as the max storage buffer size.https://github.com/gfx-rs/wgpu/actions/runs/11001397782/job/30546188996#step:12:1096
22+
const SIZE: usize = (1 << 27) / std::mem::size_of::<f32>() * 8;
23+
// 2 Buffers worth, of 0.0s.
24+
let input = &[0.0; SIZE];
25+
26+
async move { assert_execute_gpu(&ctx.device, &ctx.queue, input).await }
27+
});
28+
29+
async fn assert_execute_gpu(device: &wgpu::Device, queue: &wgpu::Queue, input: &[f32]) {
30+
let expected_len = input.len();
31+
let produced = execute_gpu_inner(device, queue, input).await;
32+
33+
assert_eq!(produced.len(), expected_len);
34+
assert!(produced.into_iter().all(|v| v == 1.0));
35+
}

examples/features/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
pub mod framework;
55
pub mod utils;
66

7+
pub mod big_compute_buffers;
78
pub mod boids;
89
pub mod bunnymark;
910
pub mod conservative_raster;

0 commit comments

Comments
 (0)