I am writing webgpu ray tracer, is anyone interested in this topic?

Evgeni_Popov · December 12, 2023, 10:21pm

For some reasons, the code that checks the output of the shader messes with the results (even if it’s not between the start/end timing marks!). In my testing, I alternate between a small and a big duration:

If I comment out the code that checks the array, I don’t experience this problem:

Note that it seems the accuracy of performance.now is 0.1ms, so timing things that are < 1ms won’t be very precise.

Also, you sometimes get a measure or two that can be quite bigger than the other measures, and it will bias the average. It’s probably due to the browser and/or what happens at that time on the computer. For a better average calculation, you should remove the X biggest / smallest values to compute the final average.

Note that the PR is now merged, and is available on the Playground.

I tested it on my computer (I implemented the average calculation as explained above and commented out the code that checks the array):

fastMode=false: https://playground.babylonjs.com/?webgpu#6E7FJ1#5 => ~0.375ms taking the average of 3/4 runs
fastMode=true: https://playground.babylonjs.com/?webgpu#4EI0PY#13 => ~0.200ms taking the average of 3/4 runs

My numbers are a bit too small given the resolution of performance.now, but there’s still a measurable difference between the two PGs.

Vadim1 · December 13, 2023, 5:01am

with native webgpu API no performance drops (4/4 passes) I still think that you need to calculate the average, otherwise it’s cheating

on PG (webgpu#6E7FJ1#5) the difference seems to be 2.5 ms vs 1.3 ms (with 3/4 passes), but when running locally (on my PC) there is suspiciously no difference (1.7-2 ms, with 4/4 passes)

with “promises” (fastMode=true, webgpu#4EI0PY#13) prefetch code results of radix sort is not correct (see log “count_unmatch: 2097151”)

because you comment on the result log, but the problem remains

P.S.:

if you remove this code below then the results are ok (log “count_unmatch: 0”),
but if left, the results do not match the presorted references (see log “count_unmatch: 2097151”)

so your promises cause an error as a result

    const promises = [];

    for (let i = 0; i < bits; i++) {
        let j = i % 2;
        {
            {
                promises.push(pipeline_radix_scan1_buf[j].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan1_buf[j].fastMode = true));
            }


            for (let k = 1; k < buffers_scan1.length; k++) {
                promises.push(pipeline_radix_scan2_buf[k - 1].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan2_buf[k - 1].fastMode = true));
            }

            for (let k = (buffers_scan1.length - 1) - 1; k >= 0; k--) {
                promises.push(pipeline_radix_scan3_buf[k].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scan3_buf[k].fastMode = true));
            }
        }

        {
            promises.push(pipeline_radix_scatter_buf[j].dispatchWhenReady(0, 0, 0).then(() => pipeline_radix_scatter_buf[j].fastMode = true));
        }
    }

    await Promise.all(promises);

Evgeni_Popov · December 13, 2023, 10:58am

I calculate it! I simply remove the extreme values, because they just happen (for some reason beyond our control (GC?) - or at least I don’t see what we can do about it).

For eg:

Clearly, the average time should be around 0.5, but due to the high value of 5.1, the average will be 0.98.

Do you have the PR merged in your local Babylon sources?

My bad, I forgot the Update_PipelineRadixScan1_buf(j, i); call in the pre-fetch loop:

Note: I’m not sure what you mean when you say “3/4 passes” or “4/4 passes”. What I meant on my side with “3/4 passes” was that I ran the PG 3 or 4 times (in fact, I think it was more 5 or 6) and I did an average of the average displayed at the end of the test.

Vadim1 · December 13, 2023, 12:53pm

this is cheating because FPS is determined by the average frame time per second, not just frames outside of extreme values

sad

what I need to do for this ?

in your this link to PG states that :

const fmode = false;

this is not fast promises prefetch mode as I think, and if I set it to “true” than radix sort result is not correct “(see log “count_unmatch: 2097151”)”

this does not affect the result in any way

I mean that code used your timer wthout extrime values

I mean that code used my timer with all values of duration

P.S.:

and so PG time result of benchmark with promises prefetch with “fmode = true;” is fast (0.8 ms average, the same as in the native webgpu API), but the results are incorrect (do not pass the check, see log “count_unmatch: 2097151”))

Evgeni_Popov · December 13, 2023, 5:28pm

Ah, doing too many things at the same time I guess!

Let me have a look again.

Evgeni_Popov · December 14, 2023, 11:51am

After this PR is merged:

You will be able to call engine.flushFramebuffer() to submit the current command buffers and reset the ubo GPU buffers, which will fix the problem.

See:

I call engine.flushFramebuffer() at the same point where the native code submits the command buffers.

Note that you are still able to use native code in Babylon if you wish, as you can access the device through engine._device (in the same way we have engine._gl for people that want to issue direct WebGL commands).

Vadim1 · December 14, 2023, 12:46pm

I saw and run your this PG, but no changes because the problem remains ( see “count_unmatch: 2097146”)

Ok

Evgeni_Popov · December 14, 2023, 12:47pm

The PR must be merged before the PG can work.

Vadim1 · December 14, 2023, 12:52pm

Ok, thanks, I understood it now

Vadim1 · December 15, 2023, 4:05am

now it’s merged and I ran your PG (webgpu#4EI0PY#47), but result yet incorrect (“count_unmatch: 2097149”)

Evgeni_Popov · December 15, 2023, 9:47am

Well, the PR must be merged AND the playground updated with it

It appears the Playground is now updated:

(with this link: https://playground.babylonjs.com/?webgpu#4EI0PY#47)

So it should also be ok for you.

Vadim1 · December 15, 2023, 1:20pm

without “promises + engine.flushFramebuffer();” avg_time 1.5-2 times faster (see PG below)

your avg_time with durations gets final result less than the minimum value was this is incorrect

(see “0.9799999997019768 milliseconds [average in 14 PASSES]”)

PG :

Evgeni_Popov · December 15, 2023, 11:53pm

My bad, I did not divide by durations.length but by NUM_TEST_COUNT. Fixed PG:

Normal: https://playground.babylonjs.com/?webgpu#4EI0PY#68
With promises: https://playground.babylonjs.com/?webgpu#4EI0PY#73

The first one is faster than the second one, but the flushFramebuffer call should also be done in the first case, else it is not comparable: currently, device.queue.submit(...) is never done during the tests in the first PG, it is done only after tests are finished because it is done in the engine.endFrame method.

If you add the flushFramebuffer call:

Normal: https://playground.babylonjs.com/?webgpu#4EI0PY#74

It’s now slower than the “With promises” PG.

I will have to dig a little more, when I have time, to understand where the time is lost compared to the native version.

Evgeni_Popov · December 16, 2023, 12:29am

Timing are a bit better in these ones:

Normal: https://playground.babylonjs.com/?webgpu#4EI0PY#81
With promises: https://playground.babylonjs.com/?webgpu#4EI0PY#80

I now start a test each frame and not run a loop of X tests during the same frame. I think there are some strange interactions with the browser when we do all the tests in a single requestAnimationFrame…

Evgeni_Popov · December 16, 2023, 2:02am

Ok, last pass…

My changes to flushFramebuffer are buggy, I must revert them (see WebGPU: Reseting ubos in flushFramebuffer does not work by Popov72 · Pull Request #14623 · BabylonJS/Babylon.js · GitHub).

However, there’s a way to make it work without calling flushFramebuffer: create as many compute shaders as necessary to avoid having to update the uniform buffers during the loop.

It means you must create “bits” (=30) compute shaders instead of 2 for pipeline_radix_scan1_buf. It’s not really a problem, and on my computer the “promise” version is on-par with / slightly faster than the native one (though, if you would make the same changes in the native version, you would be faster too)!

You don’t need the rollback above to have it work in the current Playground:

normal: https://playground.babylonjs.com/?webgpu#4EI0PY#86
with promises: https://playground.babylonjs.com/?webgpu#4EI0PY#85

I hope I didn’t mess something in the tests this time…

Vadim1 · December 16, 2023, 8:24am

(avg with extremes)

very cool, works 1.5-2 times faster than “native webgpu API” (even taking into account extreme values)

Vadim1 · December 2, 2024, 8:54am

if anyone is interested, the native webgpu code for LBVH is here, it seems to work (but optimizations are needed if you want real-time)

but the main thing is that the concept works (there is a demo at the link available)

P.S.:

code with atomics, so it is not a fact that it will work on Apple

(the code is not mine, by the way)

Vadim1 · December 15, 2024, 1:08am

about this WEBGPU LBVH demo (on github) :

P.S.:

I already set up the timers and here we have, it builds from 25 to 250 thousand triangles in 1.3 ms on average (excluding heavy initialization operations, like the initial creation of buffers)

even if we count the initial creation of temporary buffers (except for creating and copying the model’s triangle array to VRAM), then on average 2-3 ms

(but it is clear that there is no need to recreate the buffers for each frame)

P.S.2:

and yes, if someone wants to dig into it, then keep in mind that all operations like await device.queue.onSubmittedWorkDone() needs to be removed (I don’t remember the details now, but it’s a really slow thing - which is not necessary), everything works without it (and it’s orders of magnitude faster)

Vadim1 · December 15, 2024, 6:03am

and if anyone is interested, out of these 1.3 ms, radix-sort takes a whole 1 ms there is an assumption that it is possible to even speed it up (for example, by caching operations (?) or by abandoning atomics) - about 2 times (or more ?), but it is clear that this is already excessive (at least for now)

Bumpy · December 16, 2024, 2:20am

Looks interesting

Topic		Replies	Views
BabylonJS Plans Questions	6	482	February 13, 2023
Path Tracing with Babylon, Background and Implementation Demos and projects	40	17778	January 23, 2023
Help wanted to create a babylon adapter for three-gpu-pathtracer Feature requests	6	786	August 12, 2023
Path traced light mapper / baker Demos and projects rendering , lightmaps	18	358	April 9, 2025
Pathtracing in Babylon, Implementation details (cont.) Questions	6	1401	July 28, 2020

I am writing webgpu ray tracer, is anyone interested in this topic?

Related topics