Mission to improve render performance

Hi all,

I’m almost ready to go beta with my game tuggowar.io and finally have time to work on the rendering performance (the goal is to get at least a 4x boost).

For this post I’m looking for ideas on where to look next.

How I’m measuring performance
I boot into a performance test which prints out some stats:
#/s | ms
updates 30 0.81 (capped to 30, where meshed get created, removed and positioned)
renders 52 2.93 (just calls to babylon render(), so that is: 52 renders/s(fps), 2.93 ms per render call)
meshes 111
render time / mesh: 0.026334

Of course, in the full game there can be 4x as many meshes in the scene and on my Samsung S7 I currenly get 35 renders/s on average, which is the biggest issue.

Info & what I’ve done so far

  • The camera is static and in orthographic mode, meshes move around on 3 axes but are flat and (all polygons) are always facing the camera
  • There’s only ambient lightning
  • There is a lot of text, so a lot of meshes with opacity textures
  • Almost everything you see is it’s own mesh (a card for example is already around 10 meshes)
  • Most rounded things are actual round meshes because opacity textures on square meshes performe worse
  • I’m caching a lot of generated Meshes and Textures
  • mesh settings:
    mesh.convertToUnIndexedMesh();
    mesh.freezeNormals();
    mesh.cullingStrategy = AbstractMesh.CULLINGSTRATEGY_BOUNDINGSPHERE_ONLY;
    mesh.doNotSyncBoundingInfo = true;
    mesh.ignoreNonUniformScaling = true;

Instances
My first thought was, for cached meshes, to use .createInstance() in stead of .clone(), expecting the rendering to become faster.
Unfortunately this has the opposite effect of what I expected:
updates 30 0.8
renders 52 3.15
meshes 75
instances 36
render time / mesh (incl instances): 0.028363

Does anyone understand why this could be?

Throttling
If I set 4x CPU throttling in Chrome the stats become
updates 24 2.59
renders 21 15.75
meshes 71
instances 34
render time / mesh: 0.14944

In my mind, 4x CPU throttling should have not that much effect on render time, which should be handled by the GPU. But definitely it should not cause as much as a 5x slowdown. Surely I’m doing something very inefficiently (GPU is enabled for sure).

To reiterate, I am looking for all tips to improve render performance. I know WebGL is slow, but it seems to me this is a pretty basic scene that should be able to render much faster, especially concidering the static camera and 2D nature of it, but I have no idea how to exploit that.

I can show code later of course. It’s pretty basic, there’s only a couple lines of code that interact with Babylon.

Thanks for your help,
Mise

I’m now looking into _evaluateActiveMeshes.
Every mesh in my scene that isVisible should always be active and vica versa, but when I freezeEvaluateActiveMeshes a lot of the meshes are not rendered anymore.
Can I simple call setActive somewhere to control it myself?

This has tons of tips you might want to look at Optimize your scene - Babylon.js Documentation

In your case you could rely on the alwaysSelectAsActiveMesh to bypass the evaluate for meshes which are always on screen and probably freeze materials and/or static meshes.

Instances should definitely help as it reduces the draw calls so it is utterly strange it does not.

Regarding instances not being faster, there is a small penalty on the CPU side to using instances because some buffers have to be refilled each frame. With only a few instances, the penalty on the CPU side may be greater than the boost on the GPU side, all the more if the GPU is not the bottleneck.

If possible for you you should try to use the thin instances, there’s no penalty to use them, even if there are only a few. But if the GPU is not your bottleneck, I think you should focus more on the js side (freeze all you can as @sebavan suggested, merge meshes if possible, etc).

Yes, I’ve gone through the tips a couple of times already.
I agree that the problem seems to be the CPU.

I don’t think alwaysSelectAsActiveMesh is relevant for me as I don’t have frustrum clipping enabled. It has no effect either way, when I use freezeEvaluateActiveMeshes, new meshes I add to the scene are not rendered until suddenly they all are (I haven’t figuered out yet what causes this change). In principle what I want is very easy: isVisible means render it.

you could potentially also freezeEvaluateActiveMeshes and unfreeze on demand only when you add/remove meshes getting the best of both ?

I don’t see why it should ever need to evaluate it, because I can just tell the engine which meshes are active.

Ok, let’s make this a little simpler :slight_smile:
here’s profiling of a static scene (so no meshes added/removed, no animation) with 6x CPU throttling.
I have freezed the evaluation.
On paper the CPU should’t have to do much here, yet we’re dropping 65% of frames.

I can try thin instances, I’ll have to look into that more.
I can also try merginig meshes if you think it will give me a lot of benifit.
I am also going to try sharing more Materials.

by the way: if someone knows how to get rid of the Update Layer Tree that would be great, because that seems like a waste of time to me in a DOM with only a canvas.

Evaluate actually do more than building a list it also takes car of all the dispatch to the proper step in the rendering like opaque alpha test and alpha blend + sorting which explains the heavy cpu usage on large scene graphs.

and yup in your case material sharing will help

1 Like

I’m not setting any parent meshes, so every mesh is on its own, could that be something that matters a lot for performance?

This should not impact much

Actually there is more work needed to compute the world matrix of a mesh with a parent. Heavy nesting of parenting is something I have had to undo in the past.

1 Like

in reference to my last CPU profile picture:
this is taken on a static scene, so afaik I’m not running any code other than render(), can someone explain why a function like setTexture is called every frame? Looking at the babylon source this seems to be making a call to the GPU, I don’t understand that.

setTexture is used to set the texture associated to a shaderProgram. so it only sets the current texture index to the actual texture pointer which has a really tiny cost but is necessary as soon as some of the active textures are changing between draw calls.

This is not transferring the texture on this call, the bandwidth is reduced here.

Thank you for sticking with me :slight_smile:
It’s good to know that it is not a costly action, but forgive me if I will inquire some more.

As I’m not changing anything between draw calls, I hope to learn what the CPU is doing exactly and why in order to know how I can do less work, or more work on the GPU. (If needed I can work more directly against WebGL)

I ensured freezeMaterials(), freezeActiveMeshes() and freezeWorldMatrix().
Then we’re left with the following functions being called each frame:

  • gl.uniformMatrix4fv
  • gl.bindBuffer
  • gl.bindTexture
  • gl.bufferSubData

note that drawElements is not taking a lot of time at all, the GPU is not busy at all and still the FPS is 47.

So the main question is: why are these calls needed?

I have learned a lot already in the past days and faced with tough desicions I’m still missing some insight.

they are required because webgl is a state machine. You set the states, do the rendering, set the new states, do the rendering etc…

so all the gl.xxx functions are about setting the state for a given mesh

You can reduce this calls by reducing the drawcalls
To achieve so you can:

  • use instances or thin instances
  • merge meshes
  • share materials

Also, RenderGroup.renderSorted takes quite a bit of time: do you use a lot of transparent objects?

You should expand this item in the profiling report above to see what’s really going on.

yes, I am using a lot of alpha blending (text/icons) and there’s also just a lot of different text/textures. which all makes using instances harder as well. I’m looking into using less alpha blending and merging meshes.

I’m also seeing some opportunities in the function you mentioned:

image

  • I don’t know why the getter on alphaIndex takes so long
  • the distance to camera for me is always exactly analogous to alphaIndex, so this can be skipped in my case

It seems like both these could in principle be 0 ms if the getter on alpha index is fast.

I need to think about whether I should optimize my scene for Babylon or go with a fork, or if Babylon is just not optimized for this special case and I should completely rethink my rendering strategy. It could also be that the team wants to support more feature flags that would benefit this Orthagraphic static camera type scene, let me know!

In either case I super appreciate your help because I already know so much more.