Improving JIT and running time for arbitrary size matrix-vector product #933

gerwang · 2023-10-07T14:26:12Z

gerwang
Oct 7, 2023

I want to implement a 64x64 matrix-vector product in the megakernel of drjit. Here is how I do it:

class Linear:
    def __init__(self, input_dims, output_dims):
        self.weight = dr.ones(mi.TensorXf, shape=(output_dims, input_dims))

    def __call__(self, x: List[mi.Float]) -> List[mi.Float]:
        res = []
        for i in range(self.weight.shape[0]):
            row_vec = dr.ravel(self.weight[i])
            v = mi.Float(0.)
            for j in range(len(row_vec)):
                v = dr.fma(row_vec[j], x[j], v)
            res.append(v)
        return res

However, seems that each time the self.weight changes, a new kernel needs to be JIT compiled.

My question is, how to implement the matrix-vector product that achieves the fastest performance, while producing a kernel code that can be reused across different self.weight? Thank you!

Answered by njroussel

Oct 11, 2023

Aha, I see!

This is definitely feasible.

I've modified the snippet you sent:

import drjit as dr
import mitsuba as mi
import numpy as np

mi.set_variant('cuda_ad_rgb')

class Linear:
    def __init__(self, input_dims, output_dims):
        self.weight = mi.TensorXf(np.random.rand(output_dims, input_dims))
        self.bias = mi.Float(np.random.rand(output_dims))
        dr.make_opaque(self.weight, self.bias)

    def __call__(self, x):
        res = []
        width = dr.width(x[0])
        for i in range(self.weight.shape[0]):
            row = dr.ravel(self.weight[i])

            out = dr.zeros(mi.Float, shape=width)
            for j in range(self.weight.shape[1]):
                weight

View full answer

njroussel · 2023-10-09T12:58:23Z

njroussel
Oct 9, 2023
Collaborator

Hi @gerwang

I'm confused by what you're currently trying to achieve. Could you share an example snippet using this Linear class ?

One issue I see is that row_vec[j] is essentially an indexing operation into a Float which is very inefficient - it's running a tiny kernel every time. I also don't believe that this is what you intend to be computing. You're just multiplying a single number with a vector.
If that is actually what you want to compute, you'll want to use dr.gather(mi.Float, row_vec, j) to preserve the mi.Float type without launching a kernel as an equivalent operation to row_vec[j].

The tensor support in Dr.Jit is limited. I typicall recommend directly accessing its underlying (flat) buffer with my_tensor.array and performing the indexing by carefully using dr.gather operations. This is the only way to guarantee code that can be completely traced symbolically without any intermediate kernel evaluations.

0 replies

gerwang · 2023-10-09T16:19:58Z

gerwang
Oct 9, 2023
Author

I want to perform inference on a small (tiny-cuda-nn scale) MLP network within the megakernel of drjit. The specific invocation of Linear.__call__ takes place within a function mi.SamplingIntegrator.sample. Here's the step-by-step process:

We start with the current shading position represented as a mi.Point3f.
This position is converted into a list of three mi.Float values.
These three mi.Float values serve as input to an MLP network with an input dimension of 3.
Going through several (no more than three) Linear layers output a 64-dimensional feature as the output.
This output is stored as a list of 64 mi.Float values.

From an external perspective, the entire rendering operation is initiated by calling mi.render(). During this rendering process, all the relevant operations should be compiled into a single megakernel. Specifically, we should fuse all operations involving row_vec[j] into this kernel.

Some considerations for my case:

It's not necessary for the neural network inference to be differentiable.
A compilation time of several minutes is acceptable.
The primary focus is on optimizing the inference speed. While it's acknowledged that tensor cores are not accessible within the megakernel, the expectation is that the performance will be slightly slower than that of the "tiny-cuda-nn" due to this limitation.
It's important to note that the kernel code generated during compilation can be reused even if there are changes to the neural network weights.

Could you please comment on whether this goal is feasible in mitsuba3 and some tips to improve my current implementation? Thank you!

1 reply

njroussel Oct 11, 2023
Collaborator

Aha, I see!

This is definitely feasible.

I've modified the snippet you sent:

import drjit as dr
import mitsuba as mi
import numpy as np

mi.set_variant('cuda_ad_rgb')

class Linear:
    def __init__(self, input_dims, output_dims):
        self.weight = mi.TensorXf(np.random.rand(output_dims, input_dims))
        self.bias = mi.Float(np.random.rand(output_dims))
        dr.make_opaque(self.weight, self.bias)

    def __call__(self, x):
        res = []
        width = dr.width(x[0])
        for i in range(self.weight.shape[0]):
            row = dr.ravel(self.weight[i])

            out = dr.zeros(mi.Float, shape=width)
            for j in range(self.weight.shape[1]):
                weight = dr.gather(mi.Float, row, j)
                bias = dr.gather(mi.Float, self.bias, j)
                out += weight * x[j] + bias

            res.append(out)

        return res

a = Linear(3, 16)
b = Linear(16, 64)

x = mi.Float(2, 4)
y = mi.Float(3, 5)
z = mi.Float(4, 6)

dr.set_log_level(dr.LogLevel.Info)

out = b(a([x, y, z]))

dr.eval(out)
print(f"{out=}")

This will generate a single kernel for the computation. It will recompile a new kernel for changes in the architecture (number or layers and their sizes), but will not need a recompilation on changes of the weights/biases.

This is just a quick and dirty suggestion. I don't know how well it scales. My concerns are the following:

The python for loops might produce very long compilation/tracing times. Maybe this can be re-written with drjit recorded loops.
There most likely is a way to get rid of the inner loop and write it as a vectorxvector reduction which should be more efficient. My concern is that this might split kernels. At worst, you'll have one kernel per layer.

Answer selected by gerwang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving JIT and running time for arbitrary size matrix-vector product #933

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improving JIT and running time for arbitrary size matrix-vector product #933

Uh oh!

gerwang Oct 7, 2023

Replies: 2 comments · 1 reply

Uh oh!

njroussel Oct 9, 2023 Collaborator

Uh oh!

gerwang Oct 9, 2023 Author

Uh oh!

njroussel Oct 11, 2023 Collaborator

gerwang
Oct 7, 2023

Replies: 2 comments 1 reply

njroussel
Oct 9, 2023
Collaborator

gerwang
Oct 9, 2023
Author

njroussel Oct 11, 2023
Collaborator