Reducing JIT compilation times #565

keithrse · 2023-02-28T15:18:01Z

keithrse
Feb 28, 2023

Hi,

We;re trying to use mitsuba to simulate photon propagation and we need to be able put 10^6 and about photons into each render (all with different positions, directions, wavelengths etc). at the moment we're doing this in an inefficient manor which involves xml files with 10^6 defined emitters. These scenes are taking hours to compile with drjit (for cuda/llvm variants). We've been led to understandn that we can replace 10^6 emitters with 1 emitter which has 10^6 positions passed to it in vector format. I've checked though the documentation and its not entirely clear how to use this vectorised format.

so we currently have something like this :

<emitter type="spot">
<transform name="to_world">
<lookat origin="-0.086816, 1028.0, 102.75" target="-0.038295, 1028.99515, 102.835613" up="0,1,0" />
</transform>
<float name="cutoff_angle" value="0.3" />
</emitter>

but 10^6 times in a single file

how would we go about making this emitter, emit from two different positions?

like this prehaps?

<emitter type="spot">
<transform name="to_world">
<lookat origin=[-0.086816, 1028.0, 102.75;-0.086816, 1028.0, 102.75] target=[-0.038295, 1028.99515, 102.835613;-0.038295, 1028.99515, 102.835613] up="0,1,0" />
</transform>
<float name="cutoff_angle" value="0.3" />
</emitter>

Thank You

njroussel · 2023-02-28T15:49:52Z

njroussel
Feb 28, 2023
Collaborator

Hi @keithrse

Before we dive into a way to circumvent the XML changes, is using a scalar_ variant not an option for you? Is the actual rendering just too slow?

There are a couple of ways in which you could circumvent this huge XML generation. Have you measured the parsing and scene loading? Is this also worth improving or is the JIT-compilation the main pain point for your use case?

0 replies

keithrse · 2023-02-28T16:01:48Z

keithrse
Feb 28, 2023
Author

Hi @njroussel

Yes we do need to be able to use other variants specifically we want to run cuda_variants but we're also testing with llvm as well, for several reasons 1) we run on heterogeneous clusters and 2) we will likely need to run these large renders millions of times. So in essence yes, the render is too slow using a scalar variant.

Now as for timing, our results look like this

2023-02-16 10:57:46 INFO  main  [Scene] Embree ready. (took 395ms)
2023-02-16 10:57:47 INFO  main  [xml.cpp:1451] Done loading XML file "/hepgpu6-data1/sbarre/mphys/timing/scaling/geometry_1000000.xml" (took 1.1859m).
2023-02-16 10:57:50 INFO  main  [AdjointIntegrator] Starting render job (1024x1024, 16 samples)
2023-02-17 00:26:32 INFO  main  [AdjointIntegrator] Code generation finished. (took 13.474h)
2023-02-17 00:43:53 INFO  main  [AdjointIntegrator] Rendering finished. (took 16.817m)

this was with 10^6 spot type emitters, the xml file is about 200mb, we were running with the llvm_mono variant on a single thread (as a test) and as you can see its the drjit compilation thats taking the time

Thanks

Keith

6 replies

njroussel Mar 1, 2023
Collaborator

This ☝️ is the in-depth answer as to what is currently happening in your setup.

Depending on what you're doing, writing everything yourself from scratch by using some Mitsuba "primitives" like Vector3f, Ray3f, ray_intersect() will remove a huge amount of complexity from the system (some of which creates longer JIT compilations).
This tutorial is a great example of how you can pick and chose exactly what you want/need from the entire system. Taking a scripting approach is also usually easier to debug than writing a custom plugin or using existing plugins for whatever you're trying to achieve.

njroussel Mar 3, 2023
Collaborator

We've change most emitter & sensors to use opaque: #571. This should have significantly reduced the JIT compilation times in your examples. You will need to rebuild master to see this change.

keithrse Mar 3, 2023
Author

Thanks Nicolas, yes this modification has drastically reduced the drjit compilation time and seems to have reduced the xml parsing time as well. we're still in the minutes for compilation and seconds for rendering, however, this is a significant step forward

Ill include the output so you can see the progress!

-bash-4.2$ mitsuba -m llvm_mono geometry_1000000.xml -t 1
2023-03-03 13:37:12 INFO  main  [mitsuba.cpp:334] Mitsuba version 3.2.1 (master[c864e08f], Linux, 64bit, 1 thread, 16-wide SIMD)
2023-03-03 13:37:12 INFO  main  [mitsuba.cpp:335] Copyright 2022, Realistic Graphics Lab, EPFL
2023-03-03 13:37:12 INFO  main  [mitsuba.cpp:336] Enabled processor features: cuda llvm avx512 avx2 avx fma f16c sse4.2 x86_64
2023-03-03 13:37:12 INFO  main  [xml.cpp:1433] Loading XML file "geometry_1000000.xml" with variant "llvm_mono"..
2023-03-03 13:40:18 WARN  main  [HDRFilm] Monochrome mode enabled, setting film output pixel format to 'luminance' (was rgb).
2023-03-03 13:40:25 INFO  main  [Scene] Embree ready. (took 559ms)
2023-03-03 13:40:26 INFO  main  [xml.cpp:1451] Done loading XML file "geometry_1000000.xml" (took 3.2288m).
2023-03-03 13:40:29 INFO  main  [AdjointIntegrator] Starting render job (1024x1024, 16 samples)
2023-03-03 13:49:14 INFO  main  [AdjointIntegrator] Code generation finished. (took 8.7414m)
2023-03-03 13:49:58 INFO  main  [AdjointIntegrator] Rendering finished. (took 44.603s)
2023-03-03 13:49:58 INFO  main  [HDRFilm] ✔  Developing "geometry_1000000.exr" ..

njroussel Mar 3, 2023
Collaborator

Fantastic!

Is there a specific reason why you are rendering with just one thread? (-t 1 in the command)
The scene parsing/laoding is done in parallel by default, the thread limit is just blocking that feature.

A compilation time of 8 minutes is still quite a lot. Could you tell me more about your scene's contents? Other than their position, do the spot lights differ in any other way? Are there other plugins which are used more than once?

keithrse Mar 3, 2023
Author

Hi Nicolas,

I was just running on one thread to get comparable results to previous tests, obviously for production runs we'd use the max number of threads possible.

As far as I know, the spots only differ in position and direction, in reality they would also differ in wavelength as well, however, we're currently using mono variants

Thanks

Keith

keithrse · 2023-03-07T15:08:07Z

keithrse
Mar 7, 2023
Author

Hi Nicolas,

I've been using the most recent updates you've applied and as I said it now allows us to run 10^6 emitter renders at least using llvm which is great. However, I'm struggling to run the same renders using cuda variants which Im trying to understand.

I think the issue is due to the large amount of memory the kernel seems to require (during compilation). I've attached a 10^6 emitter output from a llvm variant. The JIT compiler reports a table size of 27Gb, if this would be the same for a cuda variant and held on the device it would easily exceed our GPUs global memory. I've profiled a 10^6 emitter cuda variant render (which did not complete) and it certainly saturates the GPUs memory but does not crash. I think the JIT compiler is (maybe) overflowing into system memory and trying to transfer data from HtoD and back which would explain the slowness, but thats just a guess.

If I'm vaguely correct about the cause of the issue, how would I go about reducing the memory usage?

1 reply

njroussel Mar 7, 2023
Collaborator

Hi @keithrse

The 27GiB is the JIT's memory, which is used to track variables and their types. This is host memory.

When running the CUDA variant, do you not get any logs at all? If it were to overflow as you describe, I believe you should see this warning message.

Also, if you run your setup through a debugger, could you tell me where exactly it is "hanging"? There are a few different steps in the JIT-compilation process.

I would recommend that you implement your own custom Emitter which is an aggregate of however many emitters you need. You could feed it your own binary representation which would reduce the load on the XML parser and scene loading. In addition, from my understanding, some of the spot class members are not relevant for you and some that are useful to you are actually identical across all your emitters. Given the scale of your scene, removing/reducing these allocations might be what you need if you are truly memory bound.

Reducing JIT compilation times #565

Uh oh!

Uh oh!

keithrse Feb 28, 2023

Replies: 3 comments · 7 replies

Uh oh!

njroussel Feb 28, 2023 Collaborator

Uh oh!

Uh oh!

keithrse Feb 28, 2023 Author

Uh oh!

njroussel Mar 1, 2023 Collaborator

Uh oh!

njroussel Mar 3, 2023 Collaborator

Uh oh!

keithrse Mar 3, 2023 Author

Uh oh!

njroussel Mar 3, 2023 Collaborator

Uh oh!

keithrse Mar 3, 2023 Author

Uh oh!

keithrse Mar 7, 2023 Author

Uh oh!

njroussel Mar 7, 2023 Collaborator

keithrse
Feb 28, 2023

Replies: 3 comments 7 replies

njroussel
Feb 28, 2023
Collaborator

keithrse
Feb 28, 2023
Author

njroussel Mar 1, 2023
Collaborator

njroussel Mar 3, 2023
Collaborator

keithrse Mar 3, 2023
Author

njroussel Mar 3, 2023
Collaborator

keithrse Mar 3, 2023
Author

keithrse
Mar 7, 2023
Author

njroussel Mar 7, 2023
Collaborator