Thanks for your suggestions! I actually have many more questions. I am not familiar with the low level WebGPU/WebGL API and how these functions relates to the GPU architecture and performance, and hence comes my questions and confusions.
However, the read uses a promise, which means that you won’t get the result for the current frame but for the next one (or even later, depending on the time needed to read from the GPU). If this is not a problem in your design, then it might work.
This is a problem. I would need to count to be in sync every frame.
I actually found a different way to achieve my initial goal without need to read the computed count on CPU side. But I don’t know if it actually improves performance or not. I will explain in the following. Could you help to check if it makes sense?
Like I mentioned, I have a StorageBuffer with fixed size N looks like this
[x0, y0, z0, orientation0, ....., xN-1, yN-1, zN-1, orientationN-1]. It is bound to vertex buffer so the vertex shader and fragment shader can read the position (x,y,z) and orientation for N meshes instances.
With culling, the StorageBuffer is not full. I tried to build the StorageBuffer data so it has the position and orientation of visible meshes packed at the beginning of the StorageBuffer. E.g. after applying culling, there is 8 mesh instances that should be visible. After index 31, StorageBuffer has all 0s.
[x0, y0, z0, orientation0, ....., x7, y7, z7, orientation7, 0, 0, 0, 0......0, 0, 0, 0]
Because of culling, at the next frame, there might be 5 visible mesh instances. So the expected StorageBuffer data becomes:
[x0, y0, z0, orientation0, ....., x4, y4, z4, orientation4, 0, 0, 0, 0......0, 0, 0, 0]
I am able to achieve this by using a sync point like barrier() in compute shader and resetting position and orientation data for current index GlobalInvocationID.x to 0 before the sync point. This gives StorageBuffer with all index reset to 0 after the sync point. And then compute shader code checks if the mesh instance should be visible, if so it adds the the position and orientation to index
visibleMeshCount in StorageBuffer and use atomicAdd to increase the visibleMeshCount.
Now my question is do I actual gain any performance improvement by doing this? In BJS inspector, I do see the absolute FPS improves quite a lot with culling compare with drawing all N mesh instances. But since the StorageBuffer has fixed size, am I right to assume there is no memory saved? But GPU computation time could be improved based on the absolute FPS rate in inspector?
Other questions I would like to ask:
Do I loss some performance if I don’t set the forcedInstanceCount?
Is there performance difference if I don’t use sync point or atomic incremented index? This would result in a StorageBuffer that doesnt have visible meshes data continuously packed at the front. E.g. there could be entries with 0 values in between data for visible meshes:
[x0, y0, z0, orientation0, .., 0, 0, 0, ..., x4, y4, z4, orientation4, 0, 0, 0, 0......0, 0, 0, 0]