Extracting gradients per intersection #975

lesphere · 2023-11-13T14:30:26Z

lesphere
Nov 13, 2023

Hi,

I'd like to get information per ray per depth, i.e. per intersection.

I have two needs.

For each intersection, I want to get the gradient of radiance w.r.t. some scene parameters, e.g. reflectance of a bsdf.
Besides, I want to get the gradient of bsdf w.r.t. the same parameters.

For the first question, I expect to get gradients per intersection.

def sample(...):
    for i in range(self.max_depth):
        dr.backward_from(Lo) # Lo is depend on param and ray, dr.shape(Lo) = [3, n_ray]
        grad_L = dr.grad(param) # dr.shape(param) = [3, 1], want to get grad_L with shape [3, n_ray]

For the second question, I want to calculate the second gradient with the same computation graph.

def sample(...):
    for i in range(self.max_depth):
        dr.backward_from(bsdf_val, flags=dr.ADFlag.ClearVertices)
        grad_bsdf = dr.grad(param)
        dr.set_grad(opt[key], 0)
        dr.backward_from(Lo)
        grad_L = dr.grad(param)

But after sample(), dr.grad(param) returns 0 despite the value of grad_bsdf and grad_L. Why?

I have some question about how jit works and how multithreads sync and share the data.

How are gradients of different rays accumulated into a single gradient?
Moreover, I already know that Dr.jit will record the operations into a computation graph and compile it into a kernel, then run it in a parallel way. But how do different threads communicate and share their data? How is the python code executed in details? In which case will the code be compiled into more than one kernel? What happened under the hood?

My questions may be ambiguous. Let me explain more if you need.

Thank you in advance!

Answered by njroussel

Nov 17, 2023

I don't think there is an easy workaround. The code is failing here because the plugin expects a 1-sized parameter, and not something wider. Maybe disabling symbolic vcalls (dr.set_flag(dr.JitFlag.VCallRecord, False)) will help, but even with that I would assume that some other parts of exisiting code would break because of this unexpected change in parameter width.

I don't know what your final goal is, but I don't think there is a way in which you could make this per-ray gradient tracking work in a conventional Mitsuba setup (scene, plugins, etc..). You're better off writing whatever you need from scratch with Mitsuba "primitives". This might seem like a lot, but depending on what exactl…

View full answer

njroussel · 2023-11-14T07:49:17Z

njroussel
Nov 14, 2023
Collaborator

Hi @lesphere

Gradients are accumulated/summed. The gradient always has the same shape & type as the original primal value. If you want to know the gradient per thread/ray, you can introduce a dummy variable that is simply a repetition of your original parameter by the width of your kernel (number of threads/rays). Something like this:

dummy_params = dr.repeat(my_param, dr.width(rays))
output = f(dummy_params) # some differentiable computation
dr.set_grad(output, 1)
dr.backward_to(dummy_params)
grad = dr.grad(dummy_params)

For your second question, there is no implicit mechanism that will clear gradients. When are you checking dr.grad(params) ? The gradient is reset after the optimizer step, as expected. It can also be zero-ed out by some other differentiation operation if the correct ADFlag aren't passed - but it looks like you're well aware of this. These are the only two operations that come to mind which would clear a gradient value.

Finally, regarding these questions:

Moreover, I already know that Dr.jit will record the operations into a computation graph and compile it into a kernel, then run it in a parallel way. But how do different threads communicate and share their data? How is the python code executed in details? In which case will the code be compiled into more than one kernel? What happened under the hood?

If you haven't already, I'd recommend reading through this gentle introduction to Dr.Jit. ForYou'll learn how to get log messages on every kernel launch such that you now if your code is running multiple kernels or not. For even more details, you can have a look at the paper or video.
Threads don't really communicate with each-other, they rarely do it directly. As you might have noticed, with Dr.Jit you write SIMD-like code: you basically write what operations a single thread should do. If they do communicate, it's typically through some global memory.
What this means in practice is that we have a single copy of the scene in memory, and all threads/rays read from that same memory region and do their operations independently. Finally, when they have computed some radiance value they atomically add it to some "shared" (still global memory) output buffer.
I hope this helps.

9 replies

lesphere Nov 14, 2023
Author

For the second question, I check dr.grad(param) right after dr.backward(...).

I wonder what will happen if I have dr.printf_async(dr.grad(param)) in the sample(). Will each thread output a gradient (# of rays totally) or just one gradient? If it's the former case, are there some ways to calculate gradients per ray through these gradients?

njroussel Nov 14, 2023
Collaborator

Every thread will print the accumulated sum of the per ray gradients. As I said in my previous response, the width of the gradient matches exactly the width of the original type. If your parameter is a single BSDF color, then dr.grad(color_param) will always exactly be 3 x 1. The only way to compute some per-ray metric is to have a per-ray parameter. This can quite easily be achieved by repeating your parameter as I explained before. However, if at any point in the AD graph all the threads traverse some shared vertex of width 1, then the result you'll be computing is most likely not what you'd want because the intermediary result of that shared vertex would again be the accumulation of all threads.

lesphere Nov 16, 2023
Author

I tried to implement like this:

import drjit as dr
import mitsuba as mi

mi.set_variant('cuda_ad_rgb')

scene = mi.load_file('scenes/cbox.xml')
params = mi.traverse(scene)

key = 'white.reflectance.value'

parem_r = dr.repeat(params[key], 2)
params[key] = parem_r
params.update()

o = mi.Point3f(0, 0, 4)
d = mi.Vector3f([[0, 0.25], 0, -1])

ray = mi.Ray3f(o, d)
si = scene.ray_intersect(ray)
bsdf = si.bsdf(ray)

wo = mi.Vector3f(1, 0, 1)
wo = si.to_local(wo)

bsdf_ctx = mi.BSDFContext()
bsdf_val = bsdf.eval(bsdf_ctx, si, wo)

But I got an error below:

Traceback (most recent call last):
File "...", line 48, in
bsdf_val = bsdf.eval(bsdf_ctx, si, wo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: jit_var_vcall(): the virtual function call associated with instance 4 accesses an evaluated variable r2234 of type float32 and size 2. However, only scalar (size == 1) evaluated variables can be accessed while recording virtual function calls

Is there some ways to overcome this? Can I have some param arrays like other cuda arrays (e.g. ray.d)?

njroussel Nov 17, 2023
Collaborator

I don't think there is an easy workaround. The code is failing here because the plugin expects a 1-sized parameter, and not something wider. Maybe disabling symbolic vcalls (dr.set_flag(dr.JitFlag.VCallRecord, False)) will help, but even with that I would assume that some other parts of exisiting code would break because of this unexpected change in parameter width.

I don't know what your final goal is, but I don't think there is a way in which you could make this per-ray gradient tracking work in a conventional Mitsuba setup (scene, plugins, etc..). You're better off writing whatever you need from scratch with Mitsuba "primitives". This might seem like a lot, but depending on what exactly you want to render, it can be quite straight forward. Take a look at this tutorial, we don't use any of the conventional Mitsuba pipelines, we only use certain types and the ray intersection routine.

However, if at any point in the AD graph all the threads traverse some shared vertex of width 1, then the result you'll be computing is most likely not what you'd want because the intermediary result of that shared vertex would again be the accumulation of all threads.

This is assumption absolutley does not hold in Mitsuba plugins in general, as you've seen with the diffuse BSDF.

Answer selected by lesphere

lesphere Nov 17, 2023
Author

Thank you for your patience!

So, my custom BSDF plugin cannot inherit from mi.BSDF, and I need to implent my own base BSDF class, right?

njroussel Nov 17, 2023
Collaborator

I think it can still inherit from mi.BSDF.

Thinking about it a bit more, I just think this would be very awkward to write.
Normally (where you have 1-wide parameters) you would have something like this:

def eval(si):
    return self.reflectance * some_coefficient.

Now (where you have N-wide parameters) you would need to write:

def eval(si):
    idx = dr.arange(mi.UInt32, dr.width(si)) # Assumes that dr.width(si) == number of rays
    relfectance = dr.gather(mi.Color3f, self.repeated_reflectance, idx)
    return reflectance * some_coefficient.

lesphere Nov 20, 2023
Author

I've tried the eval() like yours. But it doesn't work as expected. See the code below:

scene = mi.load_file('scenes/cbox.xml')

params = mi.traverse(scene)

key = 'white.reflectance.value'

parem_r = dr.repeat(params[key], 2)
params[key] = parem_r
dr.enable_grad(params[key])
params.update()

o = mi.Point3f(0, 0, 4)
o = dr.repeat(o, 2)
d = mi.Vector3f(0, 0, -1)
d = dr.repeat(d, 2)

ray = mi.Ray3f(o, d)

si = scene.ray_intersect(ray)

bsdf = si.bsdf(ray)

wo = mi.Vector3f(0, 0, 1)
wo = si.to_local(wo)

bsdf_ctx = mi.BSDFContext()
bsdf_val = bsdf.eval(bsdf_ctx, si, wo)

dr.backward(bsdf_val)

dr.eval(bsdf_val)

print(f'dr.grad(params[key]) = {dr.grad(params[key])}')

Here is the code of eval() of BSDF:

def eval(self, ctx, si, wo, active):
    if not ctx.is_enabled(mi.BSDFFlags.DiffuseReflection):
        return mi.Spectrum(0)

    cos_theta_i = mi.Frame3f.cos_theta(si.wi)
    cos_theta_o = mi.Frame3f.cos_theta(wo)

    active &= (cos_theta_i > 0) & (cos_theta_o > 0)

    idx = dr.arange(mi.UInt32, dr.width(si))
    reflectance_r = self.m_reflectance.eval(si, active)
    reflectance = dr.gather(mi.Color3f, reflectance_r, idx)
    value = reflectance * dr.inv_pi * cos_theta_o

    return value & active

There's no RuntimeError of jit_var_vcall(). But all the gradients are accumulated to the first param, and the gradient of the second param remains 0. Moreover, in eval(), dr.width(si) is always 1 while dr.width(si) and dr.width(ray) are both 2 outside. I think it's because that bsdf and si are one-to-one correspondence. So, the first bsdf will only eval() the first wi and so on. As a result, idx is always 0, and the second param is never used.

lesphere Nov 22, 2023
Author

Hi,

is it possible to get the per-ray gradients before atomically accumulating into global shared memory?

Extracting gradients per intersection #975

Uh oh!

lesphere Nov 13, 2023

Replies: 1 comment · 9 replies

Uh oh!

njroussel Nov 14, 2023 Collaborator

Uh oh!

Uh oh!

lesphere Nov 14, 2023 Author

Uh oh!

njroussel Nov 14, 2023 Collaborator

Uh oh!

lesphere Nov 16, 2023 Author

Uh oh!

njroussel Nov 17, 2023 Collaborator

Uh oh!

lesphere Nov 17, 2023 Author

Uh oh!

njroussel Nov 17, 2023 Collaborator

Uh oh!

lesphere Nov 20, 2023 Author

Uh oh!

lesphere Nov 22, 2023 Author

lesphere
Nov 13, 2023

Replies: 1 comment 9 replies

njroussel
Nov 14, 2023
Collaborator

lesphere Nov 14, 2023
Author

njroussel Nov 14, 2023
Collaborator

lesphere Nov 16, 2023
Author

njroussel Nov 17, 2023
Collaborator

lesphere Nov 17, 2023
Author

njroussel Nov 17, 2023
Collaborator

lesphere Nov 20, 2023
Author

lesphere Nov 22, 2023
Author