I am writing webgpu ray tracer, is anyone interested in this topic?

is there anyone willing to support the development of open source webgpu ray tracer ?

(preview online demo (no bvh yet) is available here (link), by subscription for only one Euro for free) :smiley:

WGSL GPGPU LBVH Building & Traverse underway

P.S.:
(babylon.js is used to access webgpu)

7 Likes

Cool project! You could share the github repo if available?

(BVH hasn’t ported yet, I’m porting it from my previously unpublished opencl RTRT application, but it also requires writing a WGSL version of radix sort, because the Boost.Compute library was used for this earlier)

4 Likes

You can find them in this webgpu issue

specifically the dawn-ray-tracing project :

firstly, it requires custom chrome build

secondly, it requires support for ray tracing in video card drivers

P.S.:

well, I wanted to implement hardware agnostic software gpgpu ray tracing requiring only support for WGSL compute shaders :smiley:

1 Like

that‘s great!But The larger the heightand width of the webpage, the worse the performance。When I maximize the webpage,I got 18 FPS。

I think that’s expected, as more width and height = more pixels to go over

seen webrtx under webgpu ? :grin: (released 4 months ago)

example from description

P.S.:

also works on compute shader,

WebRTX is not hardware ray tracing and is a pure compute shader implementation. This means WebRTX works as long as your browser supports WebGPU (only tested on Chrome so far).

but again without GPGPU BVH Building apparently

The building of BVH happens on host which is then flattened into a buffer for stackless traversal on GPU.

the implementation is tricky, of course :smiley: although we can say that this is a professional approach (in that tracing can be written on GLSL shaders, even if it is not supported in drivers), but taking into account the CPU’s BVH build, this is of course still quite useless …

### Code structure
  • /bvh - Rust code for building BVH and serializing it to a format suitable for stackless traversal on GPU.
  • /glsl - Rust code for parsing and manipulating user provided shaders.
  • /naga - WASM binding for naga, based on wasm-naga.
  • /src - All other typescript library code.

if anyone is interested, there is an implementation of radix sort for webgpu (native)

(link to github)

good performance, but after porting to Babylon JS webgpu layer - performance decreases by 5-7 times

therefore, you will probably have to implement the ray tracer in webgpu native, without Babylon JS - at least until the latter has the ability to access the low-level webgpu API (it seems there is no such option now?)

P.S.:

radix sort is part of the LBVH implementation

Would you have a link to your port? I don’t really see why the port to Babylon would be slower because the layer is very thin, as we basically dispatch the compute shader calls to the browser…

2 Likes

I know I’m late to the thread, but paging @erichlof !

1 Like

webgpu used Babylon.js

settings below (in function test()):

`
let count = 64*64*64 *4*2; // default (64*64*64)
let max_value = 1073741824-1; // default (10000)
let bits = 30;  // default (14)
`

wth this settings :
(webgpu used Babylon.js) time 5-7 milliseconds (my code above)
(webgpu native API) time 1 ms (Fei Yang code)

I think there’s a mistake here:

bGroup1_t and bGroup2_t don’t exist and are never written, so you end up creating two buffers every time Update_PipelineRadixScan2 is called.

What’s more, in the original code, it creates all bind groups in advance and simply uses them in the main loop. In Babylon.js, when you update a compute shader input, the bind group must be recreated. If you want better performance, you should create as many compute shaders as you can have variations of the inputs. This way, the behavior will be closer to the original code, where everything is created once and you only have to dispatch the compute shaders (and update the uniform buffers).

fixed, and now time 4 ms on average :smiley: still a lot …

creating a lot of “new BABYLON.ComputeShader” for all variations of the inputs, isn’t it?

this is not particularly convenient, manipulating only of bind groups is much easier and faster

Yes.

We can’t expose the bind groups, it’s too low level. But a ComputeShader is fairly light, so creating a number of them should not be a problem.

fynv native API webgpu “radix sort” (I optimized it slightly and added a benchmark) - 0.8 ms on average

my optimized code (“radix sort”) for Babylon.js - 2 ms on average

my optimized2 (close to the original fynv) code (“radix sort”) for Babylon.js - 2 ms on average

as you can see, the difference is 2.5 times in performance (radix_sort_native vs radix_sort_opt)

Additionally, sometimes there are performance drops (15ms or more) when radix sort using Babylon.js

Any other ideas for optimization? :smiley: (for Babylon.js)

Bind groups and other GPU resources are created when ComputeShader.dispatch is called. So, in your tests with Babylon, you will incur this recreation on each test, whereas the creation of the GPU resources is not included in the timing for the native test.

To improve things in Babylon.js, you should either run first the whole loop to make sure the GPU resources are created before timing the real test, or reuse the same compute shaders for all tests.

I also created a PR which adds a fastMode property to the ComputeShader class:

When true, isReady is not called by dispatch anymore, and it does not check either if the underlying GPU resources should be recreated (because of changes in the inputs). So, in your case, you could pre-init by running the loop this way:

const promises = [];

for (let i = 0; i < bits; i++) {
    let j = i % 2;
    {
        {
            promises.push(pipeline_radix_scan1_buf[j].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan1_buf[j].fastMode = true));
        }


        for (let k = 1; k < buffers_scan1.length; k++) {
            promises.push(pipeline_radix_scan2_buf[k - 1].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan2_buf[k - 1].fastMode = true));
        }

        for (let k = (buffers_scan1.length - 1) - 1; k >= 0; k--) {
            promises.push(pipeline_radix_scan3_buf[k].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan3_buf[k].fastMode = true));
        }
    }

    {
        promises.push(pipeline_radix_scatter_buf[j].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scatter_buf[j].fastMode = true));
    }
}

await Promise.all(promises);

The PR also lets you pass (0,0,0) to dispatch for the workgroup counts. That way, all GPU resources are created but the compute shader is not executed.

Here’s a PG with these changes (will work as expected only when the PR is merged):

3 Likes

I updated the benchmarks on github and now the code reuses the same compute shaders for all tests, but no performance changes

radix_sort_native - 0.8 ms [webgpu native API]
radix_sort (no bind groups optimization) - 4 ms [Babylon.js used]
radix_sort_opt - 2 ms [Babylon.js used]
radix_sort_opt2 - 2 ms [Babylon.js used]

What about the PG I linked above?

Even without the PR in, it should already improve things.

no difference on PG (3 ms avg at all) :

PG [run first the whole loop with Promises] - 3 ms avg

PG [reuse the same compute shaders for all tests] - 3 ms avg

PG (default) - 3 ms avg