I could only run ~2000 instances before FPS drops below 60 in the PG below. This PG is taken from babylon baked texture animation:
I found this tweet about baked animation in WebGL and then linked to their playable demo: https://exp-abduction.lusion.co/. I checked the index.js file of their code, the following are the number of instances used in the scene.
Is there a way to profile the GPU performance? I tried to profile the PG. As I increase the number of thin instances, I donāt see any JS function consume significantly more time. In the inspector, I see GPU frame time and inter frame time both going up with more instances.
So this is probably due to the shader complexity
also to compare with Abduction we have to be sure they are using the same number of everything (joints, bones, animations, etcā¦)
Thanks for asking. It is on my side to do more investigation. I donāt have question right now. I am a bit overwhelmed by two kids during the summer and my wife is happier seeing me practicing LC&SD instead of showing her 3D warrior models running all over the place.
I am able to run with FPS 60 with this PG on my computer. But if I comment out the line below, FPS drops significantly down to 50. Also it remains FPS 60 if I crease instance number to 20k with the following line. If the following line is commented out, FPS for 20k instances drops to ~28.
meshes[1].computeBonesUsingShaders = false;
So I guess this performance issue has something to do with bone computation using shaders? Do you have any suggestions what to investigate next?
When computeBonesUsingShaders = true, the code is doing 4 texture reads PER vertex PER instance per frame.
When computeBonesUsingShaders = false, thereās no texture reads anymore but the vertices are transformed according to their bones on the CPU and the final vertex positions are uploaded to the GPU once per frame.
As 2 is independent from the number of instances and has a fixed cost (depending on the number of vertices), thereās a point where 2 will be faster than 1, and it will be more and more in favor of 2 with increasing number of instances.
As your mesh has a very low number of vertices, 2 is very fast and outperforms quickly 1 (also depending on the GPU - Iām still at 60fps with 20k instances and bones computed in shader).
In this configuration, bones will be applied on the GPU but using a static bone array, thereās no texture read involved. It will likely be faster than 1 but probably not than 2 given that your mesh has a very low vertex count.
Note that this latest configuration will set a limit to the total number of bones a skeleton can have (which depends on the number of uniforms a vertex shader can take).
In this configuration, bones will be applied on the GPU but using a static bone array, thereās no texture read involved.
I thought about this as well. For the mesh.skeleton.useTextureToStoreBoneMatrices = false configuration, do you mean it will use the mBones array in the else branch of BONETEXTURE? I also thought this else branch could be faster than texture2D() calls.
But I tried to apply this parameter in my PG, I donāt see any difference in FPS:
Is there a way to verify if setting useTextureToStoreBoneMatrices = false; does use the bone array instead of texture read?
(also depending on the GPU - Iām still at 60fps with 20k instances and bones computed in shader)
Thatās the other thing I want to find out. I use Chrome on Mac with the following:
Is windows generally better than Mac? What is your graphic card like for running 20K instances at 60 FPS?
BTW, I just tested on Safari with meshes[1].computeBonesUsingShaders = true; (bones computed in Shader). It was able to stay on 60 FPS for 10K instances. Dropped down to 36 FPS for 20K instances. My original post was tested on Chrome: 50 FPS for 10K instances. 27 FPS for 20K instances.
Thatās really the best tool you can use to see whatās going on under the hood in your frame.
Windows is not better nor worse than Mac, itās different
My GPU is a NVidia 3080Ti, so not a bad GPU.
I can see in your screenshot that you have two GPUs: are you sure to use the Radeon in your testing and not the Intel integrated Graphics, as the Radeon is probably faster than the Intel?
When computeBonesUsingShaders = true, the vertex shader code is doing 4 texture reads PER vertex PER instance PER frame.
Is it right that it might be multiple ā4 texture readsā, depending on the number of bones that have influences on the vertex? So the total is 4 * NumberOfBoneInfluences texture reads.
I recorded the following screenshot for the two test PGs above. It seems to show two different bone matrix reading technics are applied corrected for the two test PGs.
No, it is 4 texture reads if you have 4 bone influences or less and 8 reads if you have more than 4 (and less than 8, as we donāt support more than 8 bone influences per vertex).
Yes, the first screenshot corresponds to bones being passed through a uniform buffer and not doing texture reads, and the second screenshot to bones being passed through a texture.
That means you are really bound by the vertex shader (the number of instructions in the vertex shader) and not the texture throughput. So, the solution for you is to use CPU bones calculation if you want the best performances.
Hmmm, I think each bone influence requires 4 texture2D calls. Because a texture2D call returns a vec4, a bone matrix will required 4 texture2D calls to build up a 4 * 4 transformation matrix. You can see this in the readMatrixFromRawSampler implementation.
For example, the last screenshot that calls readMatrixFromRawSampler has 4 influences. It calls readMatrixFromRawSampler 4 times. So in total texture2D has been called 4 * 4 = 16 times.