Dynamic size of instance count

I have a storage buffer, and I use compute shader to create position of a number of mesh instances.

The storage buffer has a fixed size to store positions of mesh instances. But the number of visible instances (<= fixed mesh instance size in storage buffer) are dynamically determined in the compute shader (e.g. culling).

I have another storage buffer that is updated by compute shader, it stores the number of visible mesh instance. Is it possible to read this visible mesh instance count from StorageBuffer in javascript, and set mesh.forcedInstanceCount to this dynamically computed mesh instance count at each rendering frame?

cc @Evgeni_Popov

Yes, this is theoretically possible, you have a read method on StorageBuffer.

However, the read uses a promise, which means that you won’t get the result for the current frame but for the next one (or even later, depending on the time needed to read from the GPU). If this is not a problem in your design, then it might work.

In Babylon, there is a mode in WebGPU called “non-compatible” mode (engine.compatibilityMode = false) that uses a bundle for each call to draw a mesh. The special feature of this mode is that the number of instances (when a mesh is instantiated) is stored in a small GPU buffer, and is updated when the number of instances changes. You can update this buffer yourself by passing it to your compute shader code.

If you want to try it, here are some tips:

2 Likes

Hi @Evgeni_Popov

Thanks for your suggestions! I actually have many more questions. I am not familiar with the low level WebGPU/WebGL API and how these functions relates to the GPU architecture and performance, and hence comes my questions and confusions.

However, the read uses a promise, which means that you won’t get the result for the current frame but for the next one (or even later, depending on the time needed to read from the GPU). If this is not a problem in your design, then it might work.

This is a problem. I would need to count to be in sync every frame.

I actually found a different way to achieve my initial goal without need to read the computed count on CPU side. But I don’t know if it actually improves performance or not. I will explain in the following. Could you help to check if it makes sense?

Like I mentioned, I have a StorageBuffer with fixed size N looks like this [x0, y0, z0, orientation0, ....., xN-1, yN-1, zN-1, orientationN-1]. It is bound to vertex buffer so the vertex shader and fragment shader can read the position (x,y,z) and orientation for N meshes instances.

With culling, the StorageBuffer is not full. I tried to build the StorageBuffer data so it has the position and orientation of visible meshes packed at the beginning of the StorageBuffer. E.g. after applying culling, there is 8 mesh instances that should be visible. After index 31, StorageBuffer has all 0s.

[x0, y0, z0, orientation0, ....., x7, y7, z7, orientation7, 0, 0, 0, 0......0, 0, 0, 0]

Because of culling, at the next frame, there might be 5 visible mesh instances. So the expected StorageBuffer data becomes:

[x0, y0, z0, orientation0, ....., x4, y4, z4, orientation4, 0, 0, 0, 0......0, 0, 0, 0]

I am able to achieve this by using a sync point like barrier() in compute shader and resetting position and orientation data for current index GlobalInvocationID.x to 0 before the sync point. This gives StorageBuffer with all index reset to 0 after the sync point. And then compute shader code checks if the mesh instance should be visible, if so it adds the the position and orientation to index visibleMeshCount in StorageBuffer and use atomicAdd to increase the visibleMeshCount.

Now my question is do I actual gain any performance improvement by doing this? In BJS inspector, I do see the absolute FPS improves quite a lot with culling compare with drawing all N mesh instances. But since the StorageBuffer has fixed size, am I right to assume there is no memory saved? But GPU computation time could be improved based on the absolute FPS rate in inspector?

Other questions I would like to ask:

Do I loss some performance if I don’t set the forcedInstanceCount?

Is there performance difference if I don’t use sync point or atomic incremented index? This would result in a StorageBuffer that doesnt have visible meshes data continuously packed at the front. E.g. there could be entries with 0 values in between data for visible meshes:

[x0, y0, z0, orientation0, .., 0, 0, 0, ..., x4, y4, z4, orientation4, 0, 0, 0, 0......0, 0, 0, 0]

1 Like

I’m not sure I understand this point, the absolute FPS in the inspector is based on the javascript time of the frame, so the GPU time shouldn’t have any impact on this counter (unless it’s > 16.6ms, but I don’t think you’re in that case?) Are you sure you didn’t have variability between runs because of the small time values involved?

You should look at the GPU frame time, to know how much GPU time your frame takes. However, this counter cannot be read by default in Chrome, you have to start it with a special flag: --disable-dawn-features=disallow_unsafe_apis.

Moreover, this counter is deliberately not very accurate, in order to limit fingerprinting… So the best solution is to use PIX to take a snapshot of a frame and see precisely the times with and without your changes. This might help to setup PIX: Using PIX with Chrome Canary · GitHub

Atomic operations are probably slower than non-atomic operations, so if you can get rid of them it’s better, but to what extent it’s hard to say… It’s very hard to answer questions about performance, it’s always better to test and compare (and even then your results may be different depending on the GPU, for example).

Note that the best performance will probably be achieved if you don’t clear the buffer and if you set the forcedInstanceCount to the right value. But would the difference with what you are doing now be significant, it is impossible to say without testing…

2 Likes

In Babylon, there is a mode in WebGPU called “non-compatible” mode (engine.compatibilityMode = false ) that uses a bundle for each call to draw a mesh. The special feature of this mode is that the number of instances (when a mesh is instantiated) is stored in a small GPU buffer, and is updated when the number of instances changes. You can update this buffer yourself by passing it to your compute shader code.

Hi @Evgeni_Popov,

After some performance measurement, it seems to suggest compacted data in storage buffer alone does not provide performance improvements. I think your suggestions is the right approach to go. However, I am having trouble with accessing the indirectDrawBuffer.

I have set engine.compatibilityMode = false. And this is what I tried to retrieve the WebGPUDrawContent:

grassMesh.subMeshes[0]._drawWrappers[1].drawContext

  1. There are 2 items in _drawWrappers. The first draw wrapper have bindGroups and fastBundle undefined, so I tried to access _drawWrappers[1].
  2. indirectDrawBuffer doesn’t exist in drawContext. And useInstance is false. So I tried to set grassMesh.subMeshes[0]._drawWrappers[1].drawContext.useInstance=true. After this, indirectDrawBuffer property shows up in the log, but it has value undefined. :frowning:

Do you know if there is anything I am missing so indirectDrawBuffer is not instantiated?

My understanding is if I can obtain this indirectDrawBuffer, I can pass it to my computer shader. And I can update this indirectDrawBuffer from compute shader? I also have a question about this. Shall I use ComputeShader.setStorageBuffer or ComputeShader.setUniformBuffer to pass indirectDrawBuffer? My guess is setStorageBuffer?

Update: It seems setting seInstance=true works for _drawWrappers[0]. I can see an indirectDrawBuffer is now instantiated in this DrawWrapper. I am going to test passing the indirectDrawBuffer to compute shader. I will report back if I made any progress.

Update 2 I failed to bind indirectDrawBuffer to my compute shader. Because this indirectDrawBuffer is a GPUBuffer, it doesn’t implement getBuffer() method like StorageBuffer or UniformBuffer. So I got error at this line:

Update 3 I manually created a new storage buffer instance, and replace drawContext.indirectDrawBuffer with this new instance. This approach seems to be working. I tried to read the instanceCount from this StorageBuffer by calling the async read() method. I can see the bytes at index 4 - 7 (the 2nd uint 32 in the buffer) is different from time to time. And if I remove culling, this count matches the max mesh instance size. I don’t have windows computer, my buddy will help to run PIX to finally confirm if this approach actually improves the FPS.

A side question:

Why the GPU buffer size for indirectDrawBuffer is 40? If it is an unsigned int array of size 5 and each unsigned int is 4 bytes (32 bits), shouldn’t the buffer size of indirectDrawData be 5 * 4 = 20?

@Evgeni_Popov

It is still not working. Please see my last post above for what I tried in more details.

How I tested:

I removed the barrier() in compute shader and remove the code to reset storage buffer for mesh instance positions and orientations to 0. Now I can see meshes left over if previous frame has more instances.

And the other strange thing is I can retrieve the correct instance count on CPU side, setting forcedInstanceCount is not taking effect any more. No matter what value I set to forcedInstanceCount (e.g. a very small number), I don’t see visible mesh instance reduced. The following is the code called in scene.onBeforeRenderObservable , that can successfully retrieve the instance count computed in compute shader. But setting forcedInstanceCount doesn’t have any effect any more.

    const data = await indirectDrawStorageBuffer.read();
    grassMesh.forcedInstanceCount =
      data[4] * Math.pow(2, 0) +
      data[5] * Math.pow(2, 8) +
      data[6] * Math.pow(2, 16) +
      data[7] * Math.pow(2, 24);

I don’t have a PG, but hopefully you might able to notice if I didn’t anything wrong. What I felt suspicious:

  • grassMesh.subMeshes[0]._drawWrappers is an array of two DrawWrappers instances. I am not sure if I set it correctly.
  • DrawContext doesn’t contain indirectDrawBuffer by default.
  • I can call setter useInstance=true to init a indirectDrawBuffer. But it is a GPUBuffer that cannot be used for compute shader storage buffer binding.
  • If I create a StorageBuffer instance, and set the GPUBuffer inside this StorageBuffer as indirectDrawBuffer. It all seems to be working fine. But setting forcedInstanceCount doesn’t take effect any more, and I see mesh instances from previous frame are left visible if previous frame has more visible instances than current frame.

Here’s a proof of concept:

It will only work after this PR is merged:

You can’t create your own buffer, the buffer you modify must be the one created by the engine because it is embedded in a render bundle.

The size of 40 is indeed an error, it should be 20, this is corrected in the PR above.

5 Likes

If I were to follow these steps, could I test this out right now?

Yes, I think it should work.

Hi @Evgeni_Popov,

I tested with the PR build version. It works! Thanks a lot for helping. Really appreciate your help to investigate into the problem and creating a PoC.

2 Likes