GPU memory management #393

saeedhd96 · 2022-11-17T03:33:05Z

saeedhd96
Nov 17, 2022

Hi,

I had two simple question about the memory management in mitsuba in mode 'cuda_ad_rgb':

Question 1:

When calling
image = mi.render(scene, spp=4096, integrator=integrator)
if the computation doesn't fit the GPU memory, what options do we have to divide the computation?
What if setting to 'samples_per_pass=1' is not enough?
Also does using 'samples_per_pass~=-1' makes the result non-differentiable?
Would it make sense to use a Spiral in wavefront modes too?

Question2:

In mode 'cuda_ad_rgb':

        for i in range(len(sensors)):
            image = mi.render(scene, sensor=sensors[i], spp=4096, integrator=integrator)

I have examples where the GPU memory usage increases at every iteration (similar to a leakage) and sometimes it runs out of memory after many iterations. How can I avoid this?

Thanks

Answered by wjakob

Nov 17, 2022

A key limit is that OptiX cannot trace more than 1 billion rays in a single kernel. Dr.Jit can deal with at most 4 billion samples at a time. In the future, it might be possible to tweak Dr.Jit's OptiX backend so that it can launch up to 4 kernels in sequence without re-tracing to at least reach 2^32 rays.

However, at that point the end is truly reached. You will need multiple passes if you want more samples. If you are rendering more than 2^32 pixels at a time, you will have to devise your own method of rendering multiple image sub-regions and piecing the result back together. However, I think that you will have difficulties opening the resulting image with standard tools.

We will not be…

View full answer

Speierers · 2022-11-17T07:46:16Z

Speierers
Nov 17, 2022
Maintainer

Question 1:

Are you using the path integrator? For differentiation it is recommended to use the prb integrator as it will use much less memory.

Question 2:

You will need schedule the sampler state after every call to render and force the evaluation of the resulting image.
Also you should probably set a different seed at every iteration maybe?

0 replies

saeedhd96 · 2022-11-17T20:36:05Z

saeedhd96
Nov 17, 2022
Author

Q1: Yes, but even for forward rendering, the minimum possible size of compuation is 'samples_per_pass=1' and that might not be small enough in high resolution and high spp. Is there any other way to make the computation units smaller?

Q2: Aren't these already handled in mi.render()? Here:

mitsuba3/src/render/integrator.cpp

Line 663 in 2d89762

sampler->schedule_state();

0 replies

wjakob · 2022-11-17T20:41:52Z

wjakob
Nov 17, 2022
Maintainer

A key limit is that OptiX cannot trace more than 1 billion rays in a single kernel. Dr.Jit can deal with at most 4 billion samples at a time. In the future, it might be possible to tweak Dr.Jit's OptiX backend so that it can launch up to 4 kernels in sequence without re-tracing to at least reach 2^32 rays.

However, at that point the end is truly reached. You will need multiple passes if you want more samples. If you are rendering more than 2^32 pixels at a time, you will have to devise your own method of rendering multiple image sub-regions and piecing the result back together. However, I think that you will have difficulties opening the resulting image with standard tools.

We will not be adapting the spiral class to the GPU, this is meant for interactive CPU rendering (which we in any case don't even support on Mitsuba at the moment).

1 reply

wjakob Nov 18, 2022
Maintainer

By the way, the following are related: issue #394, mitsuba-renderer/drjit-core#33

saeedhd96 · 2022-12-17T16:44:18Z

saeedhd96
Dec 17, 2022
Author

By the way, as an idea, I still think a Spiral based rendering would be helpful on the GPU backends. Basically, it could be orthogonal to any backend. It is true that with a sample_per_pass = 1, you can fit your rendering in GPU memory in most cases, with high spp and high res using path tracing, but once you have additional componets in PyTorch (for exmple BRDF parameters etc.) the computation would be too expensive even at samples_per_pass = 1. I wonder what would be the best way to break the computation to even smaller peaces other than Spiral...

1 reply

saeedhd96 Dec 17, 2022
Author

I have a working implementation of a pytroch network inside the renderer currently, and it is (expectedly) computationally more expensive than a regular BRDF to render even at a medium resolution/medium spp.

GPU memory management #393

Uh oh!

Uh oh!

saeedhd96 Nov 17, 2022

Replies: 4 comments · 2 replies

Uh oh!

Speierers Nov 17, 2022 Maintainer

Uh oh!

Uh oh!

saeedhd96 Nov 17, 2022 Author

Uh oh!

wjakob Nov 17, 2022 Maintainer

Uh oh!

wjakob Nov 18, 2022 Maintainer

Uh oh!

saeedhd96 Dec 17, 2022 Author

Uh oh!

Uh oh!

saeedhd96 Dec 17, 2022 Author

saeedhd96
Nov 17, 2022

Replies: 4 comments 2 replies

Speierers
Nov 17, 2022
Maintainer

saeedhd96
Nov 17, 2022
Author

wjakob
Nov 17, 2022
Maintainer

wjakob Nov 18, 2022
Maintainer

saeedhd96
Dec 17, 2022
Author

saeedhd96 Dec 17, 2022
Author