When I started learning about compute-shaders a few weeks ago, I looked at the Babylon examples in the docs and realized that I should first get a basic idea of WebGPU and WGSL to understand what is going on.
Now I have a project where a compute-shader (set up outside of the Babylon classes) calculates new positions of a large particle system each frame and then I let Babylon render the particle meshes with an SPS.
The bottleneck is the GPU → CPU data shoveling (mapping a buffer), which I knew is slow, but I didn’t know how crazy slow… Obviously, it feels wrong to pull data from the GPU just for Babylon to put it back to the GPU to render it.
My question is: If I would use the compute-shader class within Babylon, can I avoid this bottleneck and render the SPS with data directly from the GPU buffer?
I tried to understand the “Boids” example again, but honestly, it’s the same mystery to me as it was before I knew anything about WebGPU! It looks like it renders white triangles directly from the shader?
Now I understand the Boids example. Three questions:
-1- What is the equivalent of getting the vertex buffer data into the mesh for an SPS? Currently I set the positions by defining the SPS.updateParticle function. Would it be something like SPS.mesh.setVerticesBuffer(myVertexBuffer, false)? In the Boids example, using the “magic” boidMesh.forcedInstanceCount, the buffer size info is somehow used to get things right. How one would do this for an SPS is not clear to me.
-2- Maybe an SPS is not even the optimal solution for this task? (The task is to calculate the interaction for a large number of objects using a compute shader and then update their positions.) I have not much experience with this yet, but using SPS.setParticles() for a standard IcoSphere with 4 subdivisions (if I didn’t miscalculate this should be 1280 triangles), I hit the 16 ms zone (on Apple M1 Max) for about 500 spheres, that is 640,000 triangles. Is this to be expected or is it because the CPU is involved? …which brings us back to the original question…
-3- This is a more general question. I wonder what possibilities or restrictions there are to let BabylonJS’s buffers/shaders cooperate with a separate compute shader. If at all possible, I would like to avoid the extra interface layer of BJS creating a compute shader and rather use the WebGPU buffer/pipeline/commandEncoder code I’ve already written.
I’m able to hook on to the BJS engine._device and create my own compute shader and it works. (Please let me know if there is something I have to be careful about here. There is a TypeScript issue for the GPUBindGroupLayout about the ‘__brand’ property, but it can be patched.)
The problem now is how to do something like in the Boids example
this.vertexBuffers = new BABYLON.VertexBuffer(engine, this.particleBuffer.getBuffer(), "a_particlePos",...)
if particleBuffer is not a BABYLON.StorageBuffer but a normal GPUBuffer (with usage: GPUBufferUsage.STORAGE of course) in my own compute shader.
To my knowledge, SPS does not use instance rendering, it just uses a big vertex buffer. If you have 5 cubes in your SPS, you will have the positions of all the vertices of all the 5 cubes in the vertex buffer.
If the objects are all of the same type (say, a sphere), or you only have a few different types of objects (sphere, cube, …), maybe using thin instances would be better, because in this case you only have to update a matrix to move/rotate an object, whereas with SPS you will have to update all the vertices of the objects.
You’re a bit on your own if you want to mix Babylon with custom WebGPU code… I would advise you to port your custom WebGPU code to Babylon, as you should normally have little to do as the compute shader class in Babylon is a very thin wrapper around the WebGPU compute shaders.
maybe using thin instances would be better, because in this case you only have to update a matrix to move/rotate an object, whereas with SPS you will have to update all the vertices of the objects.
Makes sense. I’ll try it out. How to update the buffers in the shader to be used for the thin instances transformation matrices is not clear to me. Is there any documentation on the structure of those matrices? (For a pure translation I figured out that for a length 16 array representation of the 4x4 matrix, the x, y, z components are at index 12, 13, 14, respectively. But how are rotations and scalings represented?)
I would advise you to port your custom WebGPU code to Babylon
In my compute shader pipeline, I use commandEncoder.copyBufferToBuffer(...)
How would I do this in Babylon?
Regarding copyBufferToBuffer, what do you use it for? If it is to read data back to the CPU, we have a StorageBuffer.read method which is doing it under-the-hood. We try not to expose methods that are too low level if it’s not necessary, but if there are use cases for them we can think about it.
Regarding copyBufferToBuffer , what do you use it for?
No, reading data back to the CPU is what I want to avoid!
I use it to copy one GPUBuffer to another one on the GPU. For example, in a time stepping scheme there are parallel computations done in the compute shader to calculate new positions based on old positions. Only after this is done, the whole new position buffer is copied to the whole old position buffer.
All I need is to get the data of such a position buffer into the matrix buffer for the thin instances without getting the CPU involved. I will try to get this done… cannot be so hard…
The thin instances are really a completely different game compared to the SPS! I can easily render 20k interacting particles now! Thanks for the tip!
I wonder how the alpha channel for the colorBuffer behaves when I do mesh.thinInstanceSetBuffer('color', colorBuffer, 4)
the alpha channel only has an effect if material.alpha < 1. Are material.alpha and the alpha channel added together? (For alpha channel = 0 the rendering is strange.)
Is there a way to do PBR materials for thin instances? I get these WebGPU errors if I try
Back to the original topic:
Regarding the data transfer from GPU to CPU, I actually made a mistake in the performance measurement. The additional time needed for the transfer is actually small compared to the time from the submission of the command encoder to the device queue until the work is done.
Now I find that independent of if there is actually any work done in the compute shader, from the submission of the command encoder until device.queue.onSubmittedWorkDone() it takes at least 3 - 4 ms. Of course, this will depend on the hardware, but I wonder if this is to be expected?
Then I’m at a loss how we should time GPU commands. Afaik, await onSubmittedWorkDone and await mapAsync are the only ways to find out when the GPU finished the work. How else would we query this?
I time these GPU commands in the animation loop where not much else is done except scene.render, and average over many frames. Is there another way? Or is the GPU and the data traffic to it a big black box?
You should normally don’t need to know when some GPU work finishes.
If you need some work to be finished before reusing it afterwards, it will happen automatically simply by the sequence of your calls and the fact that your are going to reuse a texture (or a buffer) as an input to a shader (for eg).