Instanced Meshes and Performance

A little background information on our application, and what we’re trying to enhance. We have a scene which is made up of thousands and thousands of meshes. These objects are being compressed into fewer meshes via the well known MergeMesh call. On top of this, we’ve taken the lovely Solid Particle System, and used this to manage operations (such as changing materials) we need on the particles. On top of this, we’re careful not load all the scene content at one time by prioritzing content closer to the camera. This implementation works very well for us.

We’ve been interested in enhancing some of the performance of our implementation, and in doing so, we’ve pinpointed 3 areas where performance enhancement could occur:

  1. The time taken to produce the meshes. Note, we are doing the mesh creation on the fly in other web workers, and then pushing the merged mesh content back to the main thread.
  2. FPS performance.
  3. Memory consumption of the scene content.

Since our scene is typically made up of quite a few duplicate shapes - for example, let’s just consider that we have many simple cylinders - we’ve been interested in investigating the usage of instanced meshes. We know the restrictions of instanced meshes, first and foremost, and I suggest anyone who is learning about them first start here: Use Instances - Babylon.js Documentation

Our desire is that the scene which typically has 10,000 or even 50,000 cylinders can benefit greatly from instances. Note, with our solid particle system implementation, we do many tricks with reducing the actual burden of the scene content. Our current implementation caps the number of vertices allowed to be present in the scene.

The hope is that instances will be able improve all three areas of performance noted above AND will allow us to load even more of the actual scene. Below, I’d like to detail a few of my observations, and to log a few questions.

  1. Time taken to produce the meshes
    Certainly creation of instances is super fast AND we obviously don’t merge any geometry. Check on number one. Instance creation is much faster than even cloning meshes.

  2. FPS performance
    I’ve noticed that FPS performance for many instances (50k) is actually very slow. I believe this is occurring for 2 reasons:

The first is the frustum check. Under the hood, the engine checks to see which meshes (instances included) are in the view frustum. This is not all that time consuming on a few hundred or even a thousand meshes, but it seems this check does get bogged down for thousands and thousands of meshes… even instances. Sure, we can disable, or even abstract the frustum check’s precision. Since our application does it’s own frustum logic in a separate web worker to determine what content gets loaded into the scene, this is not too big of a deal for us. Let’s proceed with the assumption that frustum inclusion/exclusion can be disabled in our solution to get past this bottleneck.

The second fps hit is the computation of the world matrix for each instance. Anything beyond a few ten thousand instanced meshes (hardware dependent of course) will start to seriously bog down. Throw in the fact that this computation occurs every frame, and it becomes apparent that many instances are not practical. The reader will note that this calculation can be disabled for every frame; see freezeWorldMatrix().

The idea then followed that we can load in the 50k instances very fast, freeze their world matrices, and do some magic when the camera is moving. For example, when the camera moves, we can have a small number of cylinders “active”, and only on camera idle, do we start to calculate the world matrices for dirty cylinders. This might be feasible, but I’m noticing that even loading 50k instances into the scene, and computing their world matrix once will choke.

  1. Memory Consumption

This is a very appealing winning case for instances. Since we would no longer be cloning geometry thousands of times, it appears that instances are far superior over solid particle systems in this regard. Sure we have some extra baggage for every instance with respect to matrices; a burden that merged meshes do not have… but it seems that this pales in comparison to the duplicated geometry that comes with cloning meshes.

So, now here’s the question. Is there a clever way to overcome the fps bottlenecks that comes to mind when loading many instances into the scene? It seems instances are a bit of a catch 22… you want them for many identical (or scalably identical) meshes, but you can’t have too much of a good thing, or you’ll bog down the engine anyway. Please, if I’ve misunderstood any of my findings, by all means let me know. The above is my observations from performance profiling tools and test code.

AFAIK

  • Instanced meshes have their world matrices, bouding boxes, bounding spheres and frustum tests computed each frame.
    The final performance, CPU side, is more related to the global number of instances.
  • Solid particles have their lighter dedicated world matrices (and, optionnally their Bboxes, BSpheres) computed only on the call to setParticles() . The frustum test isn’t enabled by default.
    The final performance, CPU side, is more related to the global number of vertices.

So, numerous high poly instanced meshes should be better than the SPS. In other hand, numerous low poly (<50 vetices) solid particles should be better than instanced meshes. Cylinders being high poly meshes, instances seems to be the right choice.

That said, dealing with high number of high poly objects is quickly painful regarding the FPS perfs.
Knowing that a screen has a limited area and that the user can’t see all the meshes at once, I would suggest two tricks :

  • maybe reducing the global number of managed meshes. Who can make the difference at sight between 50K and, say, 30K meshes especially if they aren’t visible at once ? It’s a big difference for the CPU.
  • recycling meshes : a logical data map could describe where and how are all the meshes at the current time and only the visible ones (so just a part of them are initially built) are rendered using a pool af pre-built meshes.
    That’s the way used in this example : 3000 solid particles, recycled according to the camera position, to render more than 70K ones Test Babylon SP Terrain
3 Likes

Your first suggestion is in line with what I was thinking. Since we have our scene content organized into pages of data, spatially assigned to octants, we use this to limit the pressure on these scene already… I can extend this to limit the total number of instances. I could even load the rest of the instances when the camera is no longer moving.

I think I’ve overlooked an important piece of the puzzle here! I am also applying a scale on each instance. You’ll notice that 50,000 instances (hardware dependent of course) is actually not so bad here:

50,000 Instances
https://playground.babylonjs.com/#SEG8Z6

But now apply a scale and things start slowing down on instance creation. In fact, if you increase the number of instances to 50,000, the application will bog down entirely.

10,000 Scaled Instances
https://playground.babylonjs.com/#3S4XTC

I suppose my question should have been geared towards scaling instances. Are there some clever tricks to speed up the creation/matrix computation of SCALED instanced meshes? How does this scaling work under the hood?

1 Like

This thread has a similar question.

Just my two cents :slight_smile:

You might need to wait on WebGPU to achieve the kind of mesh volumes you are talking about.

I think the state of WebGL right now is kool for less than 2000 meshes (Including InstancedMeshes)

But tens of thousands of meshes (Even instances) is way too heavy right now… But the WebGPU is the future and will be closer to the metal

Check it out: Web GPU - Babylon.js Documentation

1 Like

I’ve noticed that if the property ignoreNonUniformScaling is set to true, 50,000 scaled instances are created no problem. I’ll need to dig further to see what this property is being used for, but as far as scaling the actual instance, it seems even with this property disabled, the instance is still resized.

50,000 Scaled Instances
https://playground.babylonjs.com/##SE22AV

twice faster: https://playground.babylonjs.com/#SE22AV#1

1 Like

and if nothing moves, an immutable SPS can still do the job with a decent performance : https://playground.babylonjs.com/#SE22AV#7

and no, I didn’t cheat, there are really 50K cylinders at 60 fps : https://playground.babylonjs.com/#SE22AV#9

[EDIT] 100K cylinders : https://playground.babylonjs.com/#SE22AV#10

4 Likes

harder : animation of 50K with the SPS and a decent fps performance
https://playground.babylonjs.com/#SE22AV#14

Each frame only offset cylinders are rotated (rotation computed but not rendered).
Each frameRate frame, the whole sps is updated for rendering.

The rest of the scene is still updated (camera, etc) each frame

1 Like

Jerome,

One of our requirements is the ability to change materials (on the fly) of particles in the SPS. Can your single shape implementation of the SPS be extended to (differently) colorize instances in the same SPS (red, blue, yellow for example)?

I presume no, since we’d need to copy color buffers for each of the 50k cylinders, and we’d probably lose performance?

We would probably need an SPS per material (red, blue, yellow). When changing a cylinder from red to blue, we’d need to rebuild the red and blue sps with modified parameters?

My plan with instances was to create instances on the fly from differently colored master meshes.

https://doc.babylonjs.com/how_to/solid_particle_system#colors-and-uvs

full doc : Use the Solid Particle System - Babylon.js Documentation

1 Like

Some final observations.

The SPS solution, in the end, duplicates all the geometry from the master cylinder. As far as I can tell, mesh builder inside of the solid particle system, builds internal arrays for the geometry of each particle, along with the color. Because of this, the memory footprint is fairly hefty. I believe for the 100k cylinder example, we see that 600Mb are in use. Granted, the load time, and FPS perfmance are very good, and the particles allow the changing of color (which in the end, is modifying the innards of the color buffer). It seems this solution is very much like merging all the cylinders into a single mesh, and indexing the geometry and color data structures for quick modification.

The instance solution proposed by myself and DK are also surprisingly hefty on memory usage, despite my earlier comments. Using profiler tools, even instances (with or without scale) still consume the same amount of memory as the proposed SPS solution! I wonder why this is the case. I seems that the instances do not duplicate the geometry, but nevertheless, the backbone of each instance is not cheap. Profiler suggests each one is about 8Kb :(… far more than a few doubles for a unique matrix.

This is because instances are real meshes(they do not contain geometry but geometry is only inside the GPU) with all properties associated with a real mesh.

8K still seem a lot. How did you end up with that number?

I took a peek at the heap snapshot for the 50,000 cylinders using instances, and noted that the heap went to around 500Mb. Poking around the snapshot, i noticed a tens of thousands of objects with a retained size of 8Kb.

Also, I noted for 10,000 instanced cylinders the heap is 100Mb, so about 10Kb per instance in that case. I’m a journeyman with heap profiling, but it seems the number of instances is fairly linear with respect to memory usage. Every 10,000 seems to add 100Mb.

Just another guess from the profile, it seems the instance copies a lot of the script for each function. It might be this that is the cost in memory for the instance above all.

Which PG are you using? It seems that 10K per instance is far too much.

Adding @sebavan to get an additional brain

It is like we would dupplicate the geometry data which sounds strange

Digging into it. I feel like something is definitely wrong. Will keep you guys posted!

Ok so I’m seeing instances around 400bytes which is more correct to me:


1 920 096 bytes for 5001 instances => 400 bytes per instance

So all good for me

I agree and I should have been much more careful with my words.

It seems that the instances are nice and small, but something seems to be swelling the memory footprint. I might be misunderstanding somethere, not sure.

10,000 instances (chrome profiler, but edge gives similar results)

the “e” line is because of the playground code editor. This should be ignored