(BVH hasn’t ported yet, I’m porting it from my previously unpublished opencl RTRT application, but it also requires writing a WGSL version of radix sort, because the Boost.Compute library was used for this earlier)
WebRTX is not hardware ray tracing and is a pure compute shader implementation. This means WebRTX works as long as your browser supports WebGPU (only tested on Chrome so far).
but again without GPGPU BVH Building apparently
The building of BVH happens on host which is then flattened into a buffer for stackless traversal on GPU.
the implementation is tricky, of course although we can say that this is a professional approach (in that tracing can be written on GLSL shaders, even if it is not supported in drivers), but taking into account the CPU’s BVH build, this is of course still quite useless …
### Code structure
/bvh - Rust code for building BVH and serializing it to a format suitable for stackless traversal on GPU.
/glsl - Rust code for parsing and manipulating user provided shaders.
if anyone is interested, there is an implementation of radix sort for webgpu (native)
(link to github)
good performance, but after porting to Babylon JS webgpu layer - performance decreases by 5-7 times
therefore, you will probably have to implement the ray tracer in webgpu native, without Babylon JS - at least until the latter has the ability to access the low-level webgpu API (it seems there is no such option now?)
Would you have a link to your port? I don’t really see why the port to Babylon would be slower because the layer is very thin, as we basically dispatch the compute shader calls to the browser…
bGroup1_t and bGroup2_t don’t exist and are never written, so you end up creating two buffers every time Update_PipelineRadixScan2 is called.
What’s more, in the original code, it creates all bind groups in advance and simply uses them in the main loop. In Babylon.js, when you update a compute shader input, the bind group must be recreated. If you want better performance, you should create as many compute shaders as you can have variations of the inputs. This way, the behavior will be closer to the original code, where everything is created once and you only have to dispatch the compute shaders (and update the uniform buffers).
Bind groups and other GPU resources are created when ComputeShader.dispatch is called. So, in your tests with Babylon, you will incur this recreation on each test, whereas the creation of the GPU resources is not included in the timing for the native test.
To improve things in Babylon.js, you should either run first the whole loop to make sure the GPU resources are created before timing the real test, or reuse the same compute shaders for all tests.
I also created a PR which adds a fastMode property to the ComputeShader class:
When true, isReady is not called by dispatch anymore, and it does not check either if the underlying GPU resources should be recreated (because of changes in the inputs). So, in your case, you could pre-init by running the loop this way:
const promises = [];
for (let i = 0; i < bits; i++) {
let j = i % 2;
{
{
promises.push(pipeline_radix_scan1_buf[j].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan1_buf[j].fastMode = true));
}
for (let k = 1; k < buffers_scan1.length; k++) {
promises.push(pipeline_radix_scan2_buf[k - 1].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan2_buf[k - 1].fastMode = true));
}
for (let k = (buffers_scan1.length - 1) - 1; k >= 0; k--) {
promises.push(pipeline_radix_scan3_buf[k].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan3_buf[k].fastMode = true));
}
}
{
promises.push(pipeline_radix_scatter_buf[j].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scatter_buf[j].fastMode = true));
}
}
await Promise.all(promises);
The PR also lets you pass (0,0,0) to dispatch for the workgroup counts. That way, all GPU resources are created but the compute shader is not executed.
Here’s a PG with these changes (will work as expected only when the PR is merged):
I updated the benchmarks on github and now the code reuses the same compute shaders for all tests, but no performance changes
radix_sort_native - 0.8 ms [webgpu native API]
radix_sort (no bind groups optimization) - 4 ms [Babylon.js used]
radix_sort_opt - 2 ms [Babylon.js used]
radix_sort_opt2 - 2 ms [Babylon.js used]