Baked animation performance

Hi babylon team,

Happy Friday!

I could only run ~2000 instances before FPS drops below 60 in the PG below. This PG is taken from babylon baked texture animation:

I found this tweet about baked animation in WebGL and then linked to their playable demo: https://exp-abduction.lusion.co/. I checked the index.js file of their code, the following are the number of instances used in the scene.

Lowest: 8192,
Low: 16384,
Medium: 32768,
High: 65536,
Highest: 131072,
Extreme: 262144

Can we also support this level of instances?

Probably :slight_smile: we will need first to investigate with the profiler to see where the perf hit is

Wanna check the f12 profiler and see where we are spending most of the time? We have a lot of options to optimize rendering

Hi @Deltakosh

Is there a way to profile the GPU performance? I tried to profile the PG. As I increase the number of thin instances, I don’t see any JS function consume significantly more time. In the inspector, I see GPU frame time and inter frame time both going up with more instances.

So this is probably due to the shader complexity
also to compare with Abduction we have to be sure they are using the same number of everything (joints, bones, animations, etc…)

I can prepare a PG by taking their lowpoly model and baking some simple mixamo animations. If this helps, I would make one when I have time.

yes and then we can have a look as the perf of the shader to see if we can do better (for instance I’m pretty sure they are not using a PBR shader )

2 Likes

Hello @slin just checking in if you’d like any more help with this?

Hi @carolhmj

Thanks for asking. It is on my side to do more investigation. I don’t have question right now. I am a bit overwhelmed by two kids during the summer and my wife is happier seeing me practicing LC&SD instead of showing her 3D warrior models running all over the place. :smiling_face_with_tear:

1 Like

Good luck! xD We’ll be here if you need anything :stuck_out_tongue:

Hi @Deltakosh, Hi @Evgeni_Popov,

Can you have a look at this PG: https://playground.babylonjs.com/#9QCNHK#110

I am able to run with FPS 60 with this PG on my computer. But if I comment out the line below, FPS drops significantly down to 50. Also it remains FPS 60 if I crease instance number to 20k with the following line. If the following line is commented out, FPS for 20k instances drops to ~28.

meshes[1].computeBonesUsingShaders = false;

So I guess this performance issue has something to do with bone computation using shaders? Do you have any suggestions what to investigate next?

It’s kind of expected.

  1. When computeBonesUsingShaders = true, the code is doing 4 texture reads PER vertex PER instance per frame.

  2. When computeBonesUsingShaders = false, there’s no texture reads anymore but the vertices are transformed according to their bones on the CPU and the final vertex positions are uploaded to the GPU once per frame.

As 2 is independent from the number of instances and has a fixed cost (depending on the number of vertices), there’s a point where 2 will be faster than 1, and it will be more and more in favor of 2 with increasing number of instances.

As your mesh has a very low number of vertices, 2 is very fast and outperforms quickly 1 (also depending on the GPU - I’m still at 60fps with 20k instances and bones computed in shader).

You can try another option:

  • mesh.computeBonesUsingShaders = true
  • mesh.skeleton.useTextureToStoreBoneMatrices = false

In this configuration, bones will be applied on the GPU but using a static bone array, there’s no texture read involved. It will likely be faster than 1 but probably not than 2 given that your mesh has a very low vertex count.

Note that this latest configuration will set a limit to the total number of bones a skeleton can have (which depends on the number of uniforms a vertex shader can take).

3 Likes

Hi @Evgeni_Popov

Thanks a lot for your inputs.

In this configuration, bones will be applied on the GPU but using a static bone array, there’s no texture read involved.

I thought about this as well. For the mesh.skeleton.useTextureToStoreBoneMatrices = false configuration, do you mean it will use the mBones array in the else branch of BONETEXTURE? I also thought this else branch could be faster than texture2D() calls.

But I tried to apply this parameter in my PG, I don’t see any difference in FPS:

    meshes[1].computeBonesUsingShaders = true;
    meshes[1].skeleton.useTextureToStoreBoneMatrices = false;

Is there a way to verify if setting useTextureToStoreBoneMatrices = false; does use the bone array instead of texture read?

(also depending on the GPU - I’m still at 60fps with 20k instances and bones computed in shader)

That’s the other thing I want to find out. :rofl: I use Chrome on Mac with the following:

Screen Shot 2022-11-05 at 8.15.38 AM

Is windows generally better than Mac? What is your graphic card like for running 20K instances at 60 FPS?

BTW, I just tested on Safari with meshes[1].computeBonesUsingShaders = true; (bones computed in Shader). It was able to stay on 60 FPS for 10K instances. Dropped down to 36 FPS for 20K instances. My original post was tested on Chrome: 50 FPS for 10K instances. 27 FPS for 20K instances.

Yes indeed.

Use Spector.js => https://spector.babylonjs.com/

That’s really the best tool you can use to see what’s going on under the hood in your frame.

Windows is not better nor worse than Mac, it’s different :slight_smile:

My GPU is a NVidia 3080Ti, so not a bad GPU.

I can see in your screenshot that you have two GPUs: are you sure to use the Radeon in your testing and not the Intel integrated Graphics, as the Radeon is probably faster than the Intel?

1 Like

@Evgeni_Popov

Thanks a lot for helping. I will check with spector.js. Hopefully accessing the bone matrix array will give better performance.

are you sure to use the Radeon in your testing and not the Intel integrated Graphics, as the Radeon is probably faster than the Intel?

Yes. Babylon.js inspector shows Chrome is using Radeon.

My GPU is a NVidia 3080Ti, so not a bad GPU.

Your GPU seems to win with a big margin. :grinning:

This is such good information I think we should add in the docs expanding this section here: Bones and Skeletons | Babylon.js Documentation :smiley:

Done:

4 Likes

@Evgeni_Popov

In your documentation PR:

  1. When computeBonesUsingShaders = true, the vertex shader code is doing 4 texture reads PER vertex PER instance PER frame.

Is it right that it might be multiple “4 texture reads”, depending on the number of bones that have influences on the vertex? So the total is 4 * NumberOfBoneInfluences texture reads.

Hi @Evgeni_Popov

I tested the following 2 PGs with spector.js Chrome extension. The performance gain is not very promising. :broken_heart:

mBones array FPS: 51.5 - 52

texture read FPS: 50.5 - 51

I recorded the following screenshot for the two test PGs above. It seems to show two different bone matrix reading technics are applied corrected for the two test PGs.

read from mBones array:

texture read, which is implemented in readMatrixFromRawSampler

No, it is 4 texture reads if you have 4 bone influences or less and 8 reads if you have more than 4 (and less than 8, as we don’t support more than 8 bone influences per vertex).

Yes, the first screenshot corresponds to bones being passed through a uniform buffer and not doing texture reads, and the second screenshot to bones being passed through a texture.

That means you are really bound by the vertex shader (the number of instructions in the vertex shader) and not the texture throughput. So, the solution for you is to use CPU bones calculation if you want the best performances.

Hmmm, I think each bone influence requires 4 texture2D calls. Because a texture2D call returns a vec4, a bone matrix will required 4 texture2D calls to build up a 4 * 4 transformation matrix. You can see this in the readMatrixFromRawSampler implementation.

For example, the last screenshot that calls readMatrixFromRawSampler has 4 influences. It calls readMatrixFromRawSampler 4 times. So in total texture2D has been called 4 * 4 = 16 times.