Skip to content

Conversation

@tgrant-nv
Copy link
Contributor

Description

This PR is a continuation of #1829, updated to include the recently added triangle mesh support. It enables full path tracing support for the OptiX backend in testrender. We have tried to share code between the CPU and OptiX backends where practical. There is more sharing in this PR than there was in #1829, which should reduce the maintenance burden a bit.

ID-based dispatch

Virtual function calls aren't well supported in OptiX, so rather than using regular C++ polymorphism to invoke the sample(), eval(), and get_albedo() functions for each of the BSDF sub-types, we manually invoke the correct function based on the closure ID (which we have added as a member of the BSDF class).

#define BSDF_CAST(BSDF_TYPE, bsdf) reinterpret_cast<const BSDF_TYPE*>(bsdf)

OSL_HOSTDEVICE Color3
CompositeBSDF::get_albedo(const BSDF* bsdf, const Vec3& wo) const
{
    Color3 albedo(0);
    switch (bsdf->id) {
    case DIFFUSE_ID:
        albedo = BSDF_CAST(Diffuse<0>, bsdf)->get_albedo(wo);
        break;
    case TRANSPARENT_ID:
    case MX_TRANSPARENT_ID:
        albedo = BSDF_CAST(Transparent, bsdf)->get_albedo(wo);
        break;

Iterative closure evaluation

Another key change is the non-recursive closure evaluation. We apply the same style of iterative tree traversal used in the previous OptiX version of process_closure() to the shared implementations of process_closure(), evaluate_layer_opacity(), process_medium_closure(), and process_background_closure().

Background sampling

We've included support for background closures. This includes an OptiX implementation of the Background::prepare() function. We've broken that function into three phases, where phases 1 and 3 are parallelized across a warp and phase 2 is executed on a single thread. This offers a decent speedup over a single-threaded implementation without the complexity of a more sophisticated implementation.

    // from background.h
    
    template<typename F>
    OSL_HOSTDEVICE void prepare_cuda(int stride, int idx, F cb)
    {
        prepare_cuda_01(stride, idx, cb);
        if (idx == 0)
            prepare_cuda_02();
        prepare_cuda_03(stride, idx);
    }

Tests

I have enabled the render-* tests for OptiX mode. I've added alternative reference images, since the GPU output exceeds the difference threshold on many of the tests. But in most cases the difference between the CPU and GPU output is very small.

Checklist:

  • I have read the contribution guidelines.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • My code follows the prevailing code style of this project. If I haven't
    already run clang-format v17 before submitting, I definitely will look at
    the CI test that runs clang-format and fix anything that it highlights as
    being nonconforming.

…error with the EnergyCompensatedOrenNayar BSDF.

Signed-off-by: Tim Grant <[email protected]>
…test_microfacet test case since it's no longer necessary.

Signed-off-by: Tim Grant <[email protected]>
…. Tests that make texture calls are optimize-only because the OptiX osl_texture function requires handles.

Signed-off-by: Tim Grant <[email protected]>
Signed-off-by: Tim Grant <[email protected]>
… intializers, since they aren't standard until C++20."

Signed-off-by: Tim Grant <[email protected]>
Signed-off-by: Tim Grant <[email protected]>
OSL_HOSTDEVICE void prepare_cuda(int stride, int idx, F cb)
{
prepare_cuda_01(stride, idx, cb);
if (idx == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave a comment here as well that this is running on a single warp? At first it wasn't clear to me how you can get away with no synchronization -- but it makes sense if there's only a single warp here.

trace_ray(handle, payload, V3_TO_F3(r.origin), V3_TO_F3(r.direction),
tmin);
if (payload.hit_id == skipID1) {
tmin = payload.hit_t + 2.0f * epsilon;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could nudge by bumping the integer representation instead. This would let you use a smaller epsilon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a trick I was not aware of, and it appears to work. Nifty.

#include <OSL/oslconfig.h>

#include <Imath/half.h>
#if defined(__has_include) && __has_include(<Imath/half.h>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dropped support for Imath 2.x in main, so this part shouldn't be necessary (probably just a leftover from a previous merge?)

@@ -0,0 +1,97 @@
// Copyright Contributors to the Open Shading Language project.
// SPDX-License-Identifier: BSD-3-Clause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Imath now supported cuda out of the box. Is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the casting macros are needed in the current iteration. So I'll move them to where they are used and remove this file.

{
if (getBackgroundShaderID() >= 0) {
const int bg_res = std::max<int>(32, getBackgroundResolution());
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_bg_values),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a way to wrap the cudaMalloc calls so that user code doesn't need to have as many reinterpret_casts everywhere ? It could also take care of calling m_ptrs_to_free.push_back() at the same time to avoid mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there is already an unused function that does almost exactly that. I'll look at adapting it to streamline these operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a nice improvement, but maybe a little out-of-scope for this already big change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a stab at this and it makes the code a bit easier on the eyes. How do you feel about this?

#define DEVICE_ALLOC(size) reinterpret_cast<CUdeviceptr>(device_alloc(size))
#define COPY_TO_DEVICE(dst_device, src_host, size) \
    copy_to_device(reinterpret_cast<void*>(dst_device), src_host, size)

It's a wrapper for the wrapper that takes care of some of the casting. I don't necessarily want to completely obscure the fact that we're dealing the CUdeviceptr and not void*, although they are for the most part interchangeable.

Sampler sampler(x, y, si);
// jitter pixel coordinate [0,1)^2
Vec3 j = sampler.get();
Vec3 j = no_jitter ? Vec3(0, 0, 0) : sampler.get();
Copy link
Contributor

@fpsunflower fpsunflower Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want Vec3(0.5f, 0.5f, 0.0f) in the no_jitter case ? Otherwise you always get -1 from the warp below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch.

#include "../sampling.h"

// clang-format off
// These files must be included in this specific order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make shading.cpp include shading.h and only include the .cpp here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is another leftover from an earlier iteration.

#include "../raytracer.h"


#define RAYTRACER_HIT_QUAD 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from pre-triangle version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it must have snuck back in during my manual rebase.

Copy link
Contributor

@fpsunflower fpsunflower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor notes aside, this looks good to me.

Will let @lgritz take it for a spin on a machine that can run the new code since we don't have that covered by CI yet.

@lgritz
Copy link
Collaborator

lgritz commented Nov 10, 2024

Does this fully replace #1829? Should we close that other one to avoid confusion?

@lgritz lgritz requested a review from chellmuth November 10, 2024 18:30
@lgritz lgritz requested a review from aconty November 10, 2024 18:30
@lgritz
Copy link
Collaborator

lgritz commented Nov 10, 2024

@chellmuth and @aconty does this look reasonable to you? On an absolute scale, but also, using a set of idioms that make it a decent proxy for what we care about in a real renderer?

@tgrant-nv tgrant-nv mentioned this pull request Nov 11, 2024
4 tasks
@lgritz
Copy link
Collaborator

lgritz commented Nov 12, 2024

@tgrant-nv This LGTM, I ran tests on my machine and came up with all sorts of failures (not your fault). The vector2/color2 tests are unrelated, I will look into that separately. But there were lots of optix tests that failed because of relatively small number of differences in the sampling noise. I see you added reference images, but even those didn't match quite right for me -- maybe different version of optix, or driver? Anyway, loosening up the thresholds did the trick. (I also changed the names of your ref images to the usual convention, a very nit-picky thing.)

So I went to push these updates on top of your branch, and it wouldn't let me, despite this very page saying "Maintainers are allowed to edit this pull request" -- I get an error "Authentication required: You must have push access to verify locks". I can do this to PRs on OIIO, but not on OSL, for reasons I don't understand.

So, could I trouble you to please take the optix-testrender-overhaul-take2 branch from my "lgritz" account (it's public) and then push that to yours, to amend this PR?

@tgrant-nv tgrant-nv force-pushed the optix-testrender-overhaul-take2 branch from 71b7791 to a5f418b Compare November 13, 2024 00:02
Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and I was able to make it run the tests correctly at work on a real GPU (modulo that I needed to push some new reference images, my results didn't quite match Tim's, but they were an obviously close enough visual match).

I'm going to merge it as it is now. We can always continue to revise it if @aconty or @chellmuth or others have further suggestions down the road, but at least that will get things unstuck -- I know that at the very least, @fpsunflower is waiting for this to go in to finalize his displacement work.

@lgritz lgritz merged commit bda7495 into AcademySoftwareFoundation:main Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants