Baked animation performance

slin · July 8, 2022, 6:57pm

Hi babylon team,

Happy Friday!

I could only run ~2000 instances before FPS drops below 60 in the PG below. This PG is taken from babylon baked texture animation:

I found this tweet about baked animation in WebGL and then linked to their playable demo: https://exp-abduction.lusion.co/. I checked the index.js file of their code, the following are the number of instances used in the scene.

Lowest: 8192,
Low: 16384,
Medium: 32768,
High: 65536,
Highest: 131072,
Extreme: 262144

Can we also support this level of instances?

Deltakosh · July 11, 2022, 3:01pm

Probably we will need first to investigate with the profiler to see where the perf hit is

Wanna check the f12 profiler and see where we are spending most of the time? We have a lot of options to optimize rendering

slin · July 11, 2022, 5:04pm

Hi @Deltakosh

Is there a way to profile the GPU performance? I tried to profile the PG. As I increase the number of thin instances, I don’t see any JS function consume significantly more time. In the inspector, I see GPU frame time and inter frame time both going up with more instances.

Deltakosh · July 11, 2022, 5:17pm

So this is probably due to the shader complexity
also to compare with Abduction we have to be sure they are using the same number of everything (joints, bones, animations, etc…)

slin · July 11, 2022, 5:50pm

I can prepare a PG by taking their lowpoly model and baking some simple mixamo animations. If this helps, I would make one when I have time.

Deltakosh · July 11, 2022, 5:52pm

yes and then we can have a look as the perf of the shader to see if we can do better (for instance I’m pretty sure they are not using a PBR shader )

carolhmj · August 10, 2022, 2:10pm

Hello @slin just checking in if you’d like any more help with this?

slin · August 17, 2022, 2:22am

Hi @carolhmj

Thanks for asking. It is on my side to do more investigation. I don’t have question right now. I am a bit overwhelmed by two kids during the summer and my wife is happier seeing me practicing LC&SD instead of showing her 3D warrior models running all over the place.

carolhmj · August 17, 2022, 10:13am

Good luck! xD We’ll be here if you need anything

slin · November 5, 2022, 2:15am

Hi @Deltakosh, Hi @Evgeni_Popov,

Can you have a look at this PG: https://playground.babylonjs.com/#9QCNHK#110

I am able to run with FPS 60 with this PG on my computer. But if I comment out the line below, FPS drops significantly down to 50. Also it remains FPS 60 if I crease instance number to 20k with the following line. If the following line is commented out, FPS for 20k instances drops to ~28.

meshes[1].computeBonesUsingShaders = false;

So I guess this performance issue has something to do with bone computation using shaders? Do you have any suggestions what to investigate next?

Evgeni_Popov · November 5, 2022, 10:07am

It’s kind of expected.

When computeBonesUsingShaders = true, the code is doing 4 texture reads PER vertex PER instance per frame.
When computeBonesUsingShaders = false, there’s no texture reads anymore but the vertices are transformed according to their bones on the CPU and the final vertex positions are uploaded to the GPU once per frame.

As 2 is independent from the number of instances and has a fixed cost (depending on the number of vertices), there’s a point where 2 will be faster than 1, and it will be more and more in favor of 2 with increasing number of instances.

As your mesh has a very low number of vertices, 2 is very fast and outperforms quickly 1 (also depending on the GPU - I’m still at 60fps with 20k instances and bones computed in shader).

You can try another option:

mesh.computeBonesUsingShaders = true
mesh.skeleton.useTextureToStoreBoneMatrices = false

In this configuration, bones will be applied on the GPU but using a static bone array, there’s no texture read involved. It will likely be faster than 1 but probably not than 2 given that your mesh has a very low vertex count.

Note that this latest configuration will set a limit to the total number of bones a skeleton can have (which depends on the number of uniforms a vertex shader can take).

slin · November 5, 2022, 3:24pm

Hi @Evgeni_Popov

Thanks a lot for your inputs.

In this configuration, bones will be applied on the GPU but using a static bone array, there’s no texture read involved.

I thought about this as well. For the mesh.skeleton.useTextureToStoreBoneMatrices = false configuration, do you mean it will use the mBones array in the else branch of BONETEXTURE? I also thought this else branch could be faster than texture2D() calls.

But I tried to apply this parameter in my PG, I don’t see any difference in FPS:

    meshes[1].computeBonesUsingShaders = true;
    meshes[1].skeleton.useTextureToStoreBoneMatrices = false;

Is there a way to verify if setting useTextureToStoreBoneMatrices = false; does use the bone array instead of texture read?

(also depending on the GPU - I’m still at 60fps with 20k instances and bones computed in shader)

That’s the other thing I want to find out. I use Chrome on Mac with the following:

Screen Shot 2022-11-05 at 8.15.38 AM

Is windows generally better than Mac? What is your graphic card like for running 20K instances at 60 FPS?

BTW, I just tested on Safari with meshes[1].computeBonesUsingShaders = true; (bones computed in Shader). It was able to stay on 60 FPS for 10K instances. Dropped down to 36 FPS for 20K instances. My original post was tested on Chrome: 50 FPS for 10K instances. 27 FPS for 20K instances.

Evgeni_Popov · November 5, 2022, 5:39pm

Yes indeed.

Use Spector.js => https://spector.babylonjs.com/

That’s really the best tool you can use to see what’s going on under the hood in your frame.

Windows is not better nor worse than Mac, it’s different

My GPU is a NVidia 3080Ti, so not a bad GPU.

I can see in your screenshot that you have two GPUs: are you sure to use the Radeon in your testing and not the Intel integrated Graphics, as the Radeon is probably faster than the Intel?

slin · November 6, 2022, 4:52pm

@Evgeni_Popov

Thanks a lot for helping. I will check with spector.js. Hopefully accessing the bone matrix array will give better performance.

are you sure to use the Radeon in your testing and not the Intel integrated Graphics, as the Radeon is probably faster than the Intel?

Yes. Babylon.js inspector shows Chrome is using Radeon.

My GPU is a NVidia 3080Ti, so not a bad GPU.

Your GPU seems to win with a big margin.

carolhmj · November 7, 2022, 10:38am

This is such good information I think we should add in the docs expanding this section here: Bones and Skeletons | Babylon.js Documentation

Evgeni_Popov · November 7, 2022, 1:48pm

Done:

slin · November 7, 2022, 5:04pm

@Evgeni_Popov

In your documentation PR:

When computeBonesUsingShaders = true, the vertex shader code is doing 4 texture reads PER vertex PER instance PER frame.

Is it right that it might be multiple “4 texture reads”, depending on the number of bones that have influences on the vertex? So the total is 4 * NumberOfBoneInfluences texture reads.

slin · November 7, 2022, 6:15pm

Hi @Evgeni_Popov

I tested the following 2 PGs with spector.js Chrome extension. The performance gain is not very promising.

mBones array FPS: 51.5 - 52

texture read FPS: 50.5 - 51

I recorded the following screenshot for the two test PGs above. It seems to show two different bone matrix reading technics are applied corrected for the two test PGs.

read from mBones array:

texture read, which is implemented in readMatrixFromRawSampler

Evgeni_Popov · November 7, 2022, 6:28pm

No, it is 4 texture reads if you have 4 bone influences or less and 8 reads if you have more than 4 (and less than 8, as we don’t support more than 8 bone influences per vertex).

Yes, the first screenshot corresponds to bones being passed through a uniform buffer and not doing texture reads, and the second screenshot to bones being passed through a texture.

That means you are really bound by the vertex shader (the number of instructions in the vertex shader) and not the texture throughput. So, the solution for you is to use CPU bones calculation if you want the best performances.

slin · November 7, 2022, 9:38pm

Hmmm, I think each bone influence requires 4 texture2D calls. Because a texture2D call returns a vec4, a bone matrix will required 4 texture2D calls to build up a 4 * 4 transformation matrix. You can see this in the readMatrixFromRawSampler implementation.

For example, the last screenshot that calls readMatrixFromRawSampler has 4 influences. It calls readMatrixFromRawSampler 4 times. So in total texture2D has been called 4 * 4 = 16 times.

Topic		Replies	Views
Different animation of the instances Questions	6	496	December 29, 2022
Weekly Video: Faster Scenes and Smaller Scene Graphs with Thin Instances Announcements	12	826	August 16, 2020
Issues related to baked animation instancing Questions	3	565	April 15, 2021
Baked Texture Animations to Reduce CPU Usage with Many Concurrent Animations Questions	2	437	August 18, 2023
Optimizing cloned/instanced meshes playing separate animations Questions instances , animations , clones	17	917	February 3, 2023

Baked animation performance

Related topics