Dealing with data too large for a single buffer (gfx-rs#6138)

alphastrata · jimblandy · web-flow · commit efbfa36ded83 · 2025-03-28T02:24:29.000Z
* init files, dir structure

* wip: it's working need to suss out the readme and some of the consts...

* ok that's probably good enough for a first pass...

* chore: spelling

* chore: readme tweaks

* chore: clippy and fmt

* chore: add self and changes to changelog.md

* fix: typo and remove env_logger via cfg flag for wasm builds (which this doesn't supprot anyway)

* refactor: bring inline with newer wgpu

* refactor: bring inline with newer wgpu

* chore: work on the readme a bit...

* refactor: remove a bunch of everything, be simple

* wip: get a test going

* wip: remove unrequired pub(s)...

* refactor: remove a bunch of everything, be simple

wip: get a test going

* wip: remove unrequired pub(s)...

wip: remove unrequired pub(s)...

* chore: cleanups, typos, simplifying

* chore: reconcile changelog diffs

* fix: re-add our change to the changelog

* wip: finess the docs a bit per request...

* chore: trying to get the woring right...

* chore: trying to get the woring right...

* fix: typos

* fix: spelling

* Update mod.rs

swap all loops over to 'for' by request.

Flume's sender is already Send/Sync

chunks will already split for us

.unwraps() unwraps everywhere!

* Update CHANGELOG.md

Co-authored-by: Jim Blandy &lt;jimb@red-bean.com&gt;

* 1GB as the example says we'll do

- 1GB as the example says we'll do
- update readme for windows users.

* init files, dir structure

* wip: it's working need to suss out the readme and some of the consts...

* ok that's probably good enough for a first pass...

* chore: spelling

* chore: readme tweaks

* chore: clippy and fmt

* chore: add self and changes to changelog.md

* fix: typo and remove env_logger via cfg flag for wasm builds (which this doesn't supprot anyway)

* refactor: bring inline with newer wgpu

* refactor: bring inline with newer wgpu

* chore: work on the readme a bit...

* refactor: remove a bunch of everything, be simple

* wip: get a test going

* wip: remove unrequired pub(s)...

* wip: remove unrequired pub(s)...

wip: remove unrequired pub(s)...

* chore: cleanups, typos, simplifying

* fix: re-add our change to the changelog

* wip: finess the docs a bit per request...

* chore: trying to get the woring right...

* chore: trying to get the woring right...

* fix: typos

* fix: spelling

* Update mod.rs

swap all loops over to 'for' by request.

Flume's sender is already Send/Sync

chunks will already split for us

.unwraps() unwraps everywhere!

* Update CHANGELOG.md

Co-authored-by: Jim Blandy &lt;jimb@red-bean.com&gt;

* 1GB as the example says we'll do

- 1GB as the example says we'll do
- update readme for windows users.

* bring up to date with trunk

sync with trunk.
make more of Jim's changes

* some of the consts have changed name.

* small tweaks

* what is the flag called now?

what is the flag called now?

* Update shader.wgsl

naming things betterer

* Update README.md

reword readme

* Update README.md

simplify readme

* Update mod.rs

remove unused

* well at least it compiles again

* BUG: ... it seems to run forever and never complete.

* nicer shader module creation

* ... add logging to track down infinite hangtime...

* use 2 buffers in the test

* test and example pass (now they do the same number of buffers..

* that's better...

* fix: remove duplicate entries

* fix: whitespace

* move changelog entry to #unreleased per request

* fix: target_arch != wasm to satiate pipeline

* fix: target_arch != wasm to satiate pipeline

* pipeline want's us to allow allows...

* savage hacks to make the wasm build ignore our test

* fix: allow the allowing of allows that allow the dead_code.

* Fix: no tests on wasm

---------

Co-authored-by: Jim Blandy &lt;jimb@red-bean.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -40,6 +40,9 @@ Bottom level categories:
 
 ## Unreleased
 
+- Added an example that shows how to handle datasets too large to fit in a single `GPUBuffer` by distributing it across many buffers, and then having the shader receive them as a `binding_array` of storage buffers.  By @alphastrata in [#6138](https://github.com/gfx-rs/wgpu/pull/6138)
+
+
 ### Major Features
 
 #### Hashmaps Removed from APIs
@@ -186,6 +189,7 @@ By @wumpf in [#7144](https://github.com/gfx-rs/wgpu/pull/7144)
 
 - Support getting vertices of the hit triangle when raytracing. By @Vecvec in [#7183](https://github.com/gfx-rs/wgpu/pull/7183) .
 
+
 #### Naga
 
 - Add support for unsigned types when calling textureLoad with the level parameter. By @ygdrasil-io in [#7058](https://github.com/gfx-rs/wgpu/pull/7058).
diff --git a/examples/README.md b/examples/README.md
@@ -51,6 +51,7 @@ These examples use a common framework to handle wgpu init, window creation, and
 - `hello_workgroups` - Teaches the user about the basics of compute workgroups; what they are and what they can do.
 - `hello_synchronization` - Teaches the user about synchronization in WGSL, the ability to force all invocations in a workgroup to synchronize with each other before continuing via a sort of barrier.
 - `storage_texture` - Demonstrates the use of storage textures as outputs to compute shaders. The example on the outside seems very similar to `render_to_texture` in that it outputs an image either to the file system or the web page, except displaying a grayscale render of the Mandelbrot Set. However, inside, the example dispatches a grid of compute workgroups, one for each pixel, which calculates the pixel value and stores it to the corresponding pixel of the output storage texture. This example either outputs an image file of your naming (pass command line arguments after specifying a `--` like `cargo run --bin wgpu-examples -- storage_texture "test.png"`) or adds an `img` element containing the image to the page in WASM.
+- `big_compute_buffers` - Demonstrates how you can split _large_ datasets across multiple buffers, using `binding_array` in your `wgsl` [NOTE: native only, no WASM support].
 
 #### Combined
 
diff --git a/examples/features/src/big_compute_buffers/README.md b/examples/features/src/big_compute_buffers/README.md
@@ -0,0 +1,41 @@
+# big-compute-buffers
+
+*NOTE: `binding_array` is Vulkan only.*
+
+This example assumes you're familiar with the other GP-GPU compute examples in this repository, if you're not you should go look at those first.
+
+This example also assumes you've specifically come here looking to do this, because you want at least the following:
+
+1. To be working on your 'data' in your shader treating it contiguously, not batching etc.
+2. The data you are wanting to work on does **not** fit within a single buffer on your device, see the [hello](https://github.com/gfx-rs/wgpu/tree/trunk/examples/src/hello) example for how to print information about your unique device to explore its maximum supported buffer size.
+
+Demonstrates how to split larger datasets (things too big to fit into a single buffer), across multiple buffers.
+
+- Creates a set of buffers totalling `1GB`, full of `0.0f32`.
+- Moves those buffers to the DEVICE.
+- Increments each element in each set of buffers by `1.0`, on the DEVICE.
+- Returns those modified buffers full of `1.0` values as a back to the HOST.
+
+## Caution
+
+- Large buffers can fail to allocate due to fragmentation issues, you will **always** need not only the appropriate amount of space required for your buffer(s) but, that space will also need to be contiguous within GPU/Device memory for this strategy to work.
+
+You can read more about fragmentation [here](https://developer.nvidia.com/docs/drive/drive-os/archives/6.0.4/linux/sdk/common/topics/graphics_content/avoiding_memory_fragmentation.html).
+
+## To Run
+
+```sh
+# linux/mac
+RUST_LOG=wgpu_examples::big_compute_buffers=info cargo run -r --bin wgpu-examples -- big_compute_buffers
+
+# windows (Powershell)
+$env:WGPU_BACKEND="Vulkan"; $env:RUST_LOG="wgpu_examples::big_compute_buffers=info"; cargo run -r --bin wgpu-examples -- big_compute_buffers
+```
+
+## Example Output
+
+```txt
+[2024-09-29T11:47:55Z INFO  wgpu_examples::big_compute_buffers] All 0.0s
+[2024-09-29T11:47:58Z INFO  wgpu_examples::big_compute_buffers] GPU RUNTIME: 3228ms
+[2024-09-29T11:47:58Z INFO  wgpu_examples::big_compute_buffers] All 1.0s
+```
diff --git a/examples/features/src/big_compute_buffers/mod.rs b/examples/features/src/big_compute_buffers/mod.rs
@@ -0,0 +1,251 @@
+//! This example shows you a potential course for when your 'data' is too large
+//! for a single Buffer.
+//!
+//! A lot of things aren't explained here via comments. See hello-compute and
+//! repeated-compute for code that is more thoroughly commented.
+
+use std::num::NonZeroU32;
+use wgpu::{util::DeviceExt, Features};
+
+// These are set by the minimum required defaults for webgpu.
+const MAX_BUFFER_SIZE: u64 = 1 << 27; // 134_217_728 // 134MB
+const MAX_DISPATCH_SIZE: u32 = (1 << 16) - 1;
+
+pub async fn execute_gpu(numbers: &[f32]) -> Vec<f32> {
+    let instance = wgpu::Instance::default();
+
+    let adapter = instance
+        .request_adapter(&wgpu::RequestAdapterOptions::default())
+        .await
+        .unwrap();
+
+    let (device, queue) = adapter
+        .request_device(&wgpu::DeviceDescriptor {
+            label: None,
+            // These features are required to use `binding_array` in your wgsl.
+            // Without them your shader may fail to compile.
+            required_features: Features::BUFFER_BINDING_ARRAY
+                | Features::STORAGE_RESOURCE_BINDING_ARRAY
+                | Features::SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING,
+            memory_hints: wgpu::MemoryHints::Performance,
+            required_limits: wgpu::Limits {
+                max_buffer_size: MAX_BUFFER_SIZE,
+                max_binding_array_elements_per_shader_stage: 8,
+                ..Default::default()
+            },
+            ..Default::default()
+        })
+        .await
+        .unwrap();
+
+    execute_gpu_inner(&device, &queue, numbers).await
+}
+
+pub async fn execute_gpu_inner(
+    device: &wgpu::Device,
+    queue: &wgpu::Queue,
+    numbers: &[f32],
+) -> Vec<f32> {
+    let (staging_buffers, storage_buffers, bind_group, compute_pipeline) = setup(device, numbers);
+
+    let mut encoder =
+        device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
+    {
+        let mut cpass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor {
+            label: Some("compute pass descriptor"),
+            timestamp_writes: None,
+        });
+        cpass.set_pipeline(&compute_pipeline);
+        cpass.set_bind_group(0, Some(&bind_group), &[]);
+
+        cpass.dispatch_workgroups(MAX_DISPATCH_SIZE.min(numbers.len() as u32), 1, 1);
+    }
+
+    for (storage_buffer, staging_buffer) in storage_buffers.iter().zip(staging_buffers.iter()) {
+        let stg_size = staging_buffer.size();
+
+        encoder.copy_buffer_to_buffer(
+            storage_buffer, // Source buffer
+            0,
+            staging_buffer, // Destination buffer
+            0,
+            stg_size,
+        );
+    }
+
+    queue.submit(Some(encoder.finish()));
+
+    for staging_buffer in &staging_buffers {
+        let slice = staging_buffer.slice(..);
+        slice.map_async(wgpu::MapMode::Read, |_| {});
+    }
+
+    device.poll(wgpu::PollType::Wait).unwrap();
+
+    let mut data = Vec::new();
+    for staging_buffer in &staging_buffers {
+        let slice = staging_buffer.slice(..);
+        let mapped = slice.get_mapped_range();
+        data.extend_from_slice(bytemuck::cast_slice(&mapped));
+        drop(mapped);
+        staging_buffer.unmap();
+    }
+
+    data
+}
+
+fn setup(
+    device: &wgpu::Device,
+    numbers: &[f32],
+) -> (
+    Vec<wgpu::Buffer>,
+    Vec<wgpu::Buffer>,
+    wgpu::BindGroup,
+    wgpu::ComputePipeline,
+) {
+    let cs_module = device.create_shader_module(wgpu::include_wgsl!("shader.wgsl"));
+
+    let staging_buffers = create_staging_buffers(device, numbers);
+    let storage_buffers = create_storage_buffers(device, numbers);
+
+    let (bind_group_layout, bind_group) = setup_binds(&storage_buffers, device);
+
+    let compute_pipeline = setup_pipeline(device, bind_group_layout, cs_module);
+    (
+        staging_buffers,
+        storage_buffers,
+        bind_group,
+        compute_pipeline,
+    )
+}
+
+fn setup_pipeline(
+    device: &wgpu::Device,
+    bind_group_layout: wgpu::BindGroupLayout,
+    cs_module: wgpu::ShaderModule,
+) -> wgpu::ComputePipeline {
+    let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
+        label: Some("Compute Pipeline Layout"),
+        bind_group_layouts: &[&bind_group_layout],
+        push_constant_ranges: &[],
+    });
+
+    device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
+        label: Some("Compute Pipeline"),
+        layout: Some(&pipeline_layout),
+        module: &cs_module,
+        entry_point: Some("main"),
+        compilation_options: Default::default(),
+        cache: None,
+    })
+}
+
+fn setup_binds(
+    storage_buffers: &[wgpu::Buffer],
+    device: &wgpu::Device,
+) -> (wgpu::BindGroupLayout, wgpu::BindGroup) {
+    let bind_group_entries: Vec<wgpu::BindGroupEntry> = storage_buffers
+        .iter()
+        .enumerate()
+        .map(|(bind_idx, buffer)| wgpu::BindGroupEntry {
+            binding: bind_idx as u32,
+            resource: buffer.as_entire_binding(),
+        })
+        .collect();
+
+    let bind_group_layout_entries: Vec<wgpu::BindGroupLayoutEntry> = (0..storage_buffers.len())
+        .map(|bind_idx| wgpu::BindGroupLayoutEntry {
+            binding: bind_idx as u32,
+            visibility: wgpu::ShaderStages::COMPUTE,
+            ty: wgpu::BindingType::Buffer {
+                ty: wgpu::BufferBindingType::Storage { read_only: false },
+                has_dynamic_offset: false,
+                min_binding_size: None,
+            },
+            count: Some(NonZeroU32::new(1).unwrap()),
+        })
+        .collect();
+
+    let bind_group_layout = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
+        label: Some("Custom Storage Bind Group Layout"),
+        entries: &bind_group_layout_entries,
+    });
+
+    let bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
+        label: Some("Combined Storage Bind Group"),
+        layout: &bind_group_layout,
+        entries: &bind_group_entries,
+    });
+
+    (bind_group_layout, bind_group)
+}
+
+fn calculate_chunks(numbers: &[f32], max_buffer_size: u64) -> Vec<&[f32]> {
+    let max_elements_per_chunk = max_buffer_size as usize / std::mem::size_of::<f32>();
+    numbers.chunks(max_elements_per_chunk).collect()
+}
+
+fn create_storage_buffers(device: &wgpu::Device, numbers: &[f32]) -> Vec<wgpu::Buffer> {
+    let chunks = calculate_chunks(numbers, MAX_BUFFER_SIZE);
+
+    chunks
+        .iter()
+        .enumerate()
+        .map(|(e, seg)| {
+            device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
+                label: Some(&format!("Storage Buffer-{}", e)),
+                contents: bytemuck::cast_slice(seg),
+                usage: wgpu::BufferUsages::STORAGE
+                    | wgpu::BufferUsages::COPY_DST
+                    | wgpu::BufferUsages::COPY_SRC,
+            })
+        })
+        .collect()
+}
+
+fn create_staging_buffers(device: &wgpu::Device, numbers: &[f32]) -> Vec<wgpu::Buffer> {
+    let chunks = calculate_chunks(numbers, MAX_BUFFER_SIZE);
+
+    (0..chunks.len())
+        .map(|e| {
+            let size = std::mem::size_of_val(chunks[e]) as u64;
+
+            device.create_buffer(&wgpu::BufferDescriptor {
+                label: Some(&format!("staging buffer-{}", e)),
+                size,
+                usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
+                mapped_at_creation: false,
+            })
+        })
+        .collect()
+}
+
+#[cfg_attr(target_arch = "wasm32", allow(clippy::allow_attributes, dead_code))]
+async fn run() {
+    let numbers = {
+        const BYTES_PER_GB: usize = 1024 * 1024 * 1024;
+        // 4 bytes per f32
+        let elements = (BYTES_PER_GB as f32 / 4.0) as usize;
+        vec![0.0; elements]
+    };
+    assert!(numbers.iter().all(|n| *n == 0.0));
+    log::info!("All 0.0s");
+    let t1 = std::time::Instant::now();
+    let results = execute_gpu(&numbers).await;
+    log::info!("GPU RUNTIME: {}ms", t1.elapsed().as_millis());
+    assert_eq!(numbers.len(), results.len());
+    assert!(results.iter().all(|n| *n == 1.0));
+    log::info!("All 1.0s");
+}
+
+pub fn main() {
+    #[cfg(not(target_arch = "wasm32"))]
+    {
+        env_logger::init();
+        pollster::block_on(run());
+    }
+}
+
+#[cfg(test)]
+#[cfg(not(target_arch = "wasm32"))]
+mod tests;
diff --git a/examples/features/src/big_compute_buffers/shader.wgsl b/examples/features/src/big_compute_buffers/shader.wgsl
@@ -0,0 +1,34 @@
+const OFFSET: u32 = 1u << 8u;
+const BUFFER_MAX_ELEMENTS: u32 = 1u << 25u; // Think `buffer.len()`
+const NUM_BUFFERS: u32 = 8u;
+const TOTAL_SIZE: u32 = BUFFER_MAX_ELEMENTS * NUM_BUFFERS;
+
+
+// `binding_array` requires a custom struct
+struct ContiguousArray {
+    inner: array<f32>
+}
+
+@group(0) @binding(0)
+var<storage, read_write> storage_array: binding_array<ContiguousArray, NUM_BUFFERS>;
+
+
+@compute @workgroup_size(256, 1, 1)
+fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
+    let base_index = global_id.x * OFFSET;
+
+    for (var i = 0u; i < OFFSET; i++) {
+        let index = base_index + i;
+
+        if index < TOTAL_SIZE {
+            let buffer_index = index / BUFFER_MAX_ELEMENTS;
+            let inner_index = index % BUFFER_MAX_ELEMENTS;
+
+            storage_array[buffer_index].inner[inner_index] = add_one(storage_array[buffer_index].inner[inner_index]);
+        }
+    }
+}
+
+fn add_one(n: f32) -> f32 {
+    return n + 1.0;
+}
diff --git a/examples/features/src/big_compute_buffers/tests.rs b/examples/features/src/big_compute_buffers/tests.rs
@@ -0,0 +1,35 @@
+use super::*;
+use wgpu_test::{gpu_test, GpuTestConfiguration, TestParameters};
+
+#[gpu_test]
+static TWO_BUFFERS: GpuTestConfiguration = GpuTestConfiguration::new()
+    .parameters(
+        TestParameters::default()
+            .features(
+                Features::BUFFER_BINDING_ARRAY
+                    | Features::STORAGE_RESOURCE_BINDING_ARRAY
+                    | Features::SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING,
+            )
+            .downlevel_flags(wgpu::DownlevelFlags::COMPUTE_SHADERS)
+            .limits(wgpu::Limits {
+                max_buffer_size: MAX_BUFFER_SIZE,
+                max_binding_array_elements_per_shader_stage: 8,
+                ..Default::default()
+            }),
+    )
+    .run_async(|ctx| {
+        // The test environment's GPU reports 134MB as the max storage buffer size.https://github.com/gfx-rs/wgpu/actions/runs/11001397782/job/30546188996#step:12:1096
+        const SIZE: usize = (1 << 27) / std::mem::size_of::<f32>() * 8;
+        // 2 Buffers worth, of 0.0s.
+        let input = &[0.0; SIZE];
+
+        async move { assert_execute_gpu(&ctx.device, &ctx.queue, input).await }
+    });
+
+async fn assert_execute_gpu(device: &wgpu::Device, queue: &wgpu::Queue, input: &[f32]) {
+    let expected_len = input.len();
+    let produced = execute_gpu_inner(device, queue, input).await;
+
+    assert_eq!(produced.len(), expected_len);
+    assert!(produced.into_iter().all(|v| v == 1.0));
+}
diff --git a/examples/features/src/lib.rs b/examples/features/src/lib.rs
@@ -4,6 +4,7 @@
 pub mod framework;
 pub mod utils;
 
+pub mod big_compute_buffers;
 pub mod boids;
 pub mod bunnymark;
 pub mod conservative_raster;
diff --git a/examples/features/src/main.rs b/examples/features/src/main.rs