Mesh with thin instance slower than with full vertices?

Hello all,

I am trying to imporve the speed Gaussian Splatting, where I find the usage of thin instance a problem.

To simplify the problem, I tried to render certain number of quads with or without thin instance.

If I put geometry info of these quads whithin a single mesh, the fps is 60.
Draw 2 million quads with single mesh

While using thin instance will have much lower fps around 15.
Draw 2 million quads with thin instance

I wonder if there is any problem with my usages? I thought thin instance should have better performance. Many thanks!

Welcome aboard!

Having X thin instances is faster than having X meshes (or even X instanced meshes), but it’s not necessarily faster than integrating all the data directly into vertex buffers. The GPU has extra work to do when you use instantiation compared with not using it. It’s probably only a small amount of work, but in the end it can add up (you have over 2.5 million instances!). What’s more, the basic mesh is just 2 triangles. Perhaps with a larger geometry, the difference in performance would be smaller.

2 Likes

Thank you @Evgeni_Popov~ If this behaviour is as expected, mesh with thin instance rendered much slower than without it, I have to try some other approches to render large number of splats (quads).

Instances and merges are both ways to optimize rendering performance.
The advantage of instance objects is that they have a smaller memory footprint.More flexibility in operation.
Node merge theory should render faster, but take up more memory.

For example:
Babylon.js Playground (babylonjs.com)
In this example, if 10+ meshes need to be rendered, the performance of the instance object will be better than node merging.

Millions of quads within one single mesh | Babylon.js Playground (babylonjs.com)
See, in this case the availability of the instance object is higher than the merge.

Thank you for the nice explaination @xiehangyun, I get it the advantage of instancing is draw call.
I was trying to render large number of quads (to simulate gaussian splatting) within a single batch, so I need something like geometry merging or instancing.

It is strange that Three.js instanced mesh (I thought it is the equivalant of babylon instancing) with same number of quads (2.5 millions) can still run with a full fps 60.
Example here

And babylon instancing as mentioned above have fps 15.

@xiehangyun @Evgeni_Popov

This seems to be a problem with Angle and DirectX…

If you change Angle’s backend to OpenGL (chrome://flags/ => angle), you’ll see that you get 60fps. Similarly, using WebGPU as an engine makes the PG run at 60fps.

The problem is that Angle reorganizes our buffers and de-interleaves them, causing cache misses to skyrocket! We’ll have to find out why, and probably open a ticket on the chromium tracker…

Using OpenGL as ANGLE backend or using WebGPU did fix this problem! Thank you Evgeni!
It would be great if Babylon instancing could have similar performance with three.js on the default D3D11 backend.

@Evgeni_Popov any workaround possible on our side to match perfs ? while awaiting for the fix ?

I find out the performance gap is due to the flag STATIC_DRAW / DYNAMIC_DRAW on binding buffer data. Three.js use static by default while Babylon use dynamic.

/// thinEngine.ts
public createVertexBuffer(data: DataArray, _updatable?: boolean, _label?: string): DataBuffer {
    return this._createVertexBuffer(data, this._gl.STATIC_DRAW);
}
public createDynamicVertexBuffer(data: DataArray, _label?: string): DataBuffer {
    return this._createVertexBuffer(data, this._gl.DYNAMIC_DRAW);
}

After adding a true as staticBuffer param to thinInstanceSetBuffer() usage, the fps raised to 60. see PG

quad.thinInstanceSetBuffer("matrix", matricesData, 16, true);
quad.thinInstanceSetBuffer("color", colorData, 4, true);
1 Like

@Evgeni_Popov I wonder if this could impact in a lot of other places ???

Maybe we can default to dynamic=false and mark it a breaking change? It would be a small one and an easy fix for people that don’t pass a value and expect dynamic=true, but at least the fastest mode would be enabled by default?

I am just so surprised of this level of impact knowing it is only a hint.

Is there a way to validate/repro on all platforms ?

And agree if it is as we speak, we should default to fastest :slight_smile:

The level of impact would depend on the size of the data and on how Angle reorganizes the buffers. The problem in the PG is that the data is quite big, and because of Angle reorganization, we experience cache misses for each instance. With smaller data, the effect may be less dramatic.

I don’t reproduce on my iPhone SE / Samsung Galaxy A23: the 3 variations (Babylon dynamic/static and Threejs fiddle) have the same performance.

This is an awesome thread. I’m glad someone is deep diving into this.

2 Likes