What I gathered from feedback here over time is that the best way to improve render performance in my project is to use more instances and less draw calls, materials and textures.
I finally managed to figure out how to combine instances, sprite maps and alpha blending in such a way that I could control the render order (spoiler alert: it involves overwriting the renderSorted function).
I was super excited to try this and managed to get the number of draw calls in a control scene down from 202 to 82!! The number of materials from 266 to 157 and textures from 162 to 132.
I also am super excited about the possible performance gains of webgpu.
But the results are extremely disappointing so far, because the performance (see absolute fps) went down for the instances and webgpu takes it down more (my hopes are still up though, so please help me out ).
(left: before, middle: after, right: after + webgpu)
this is based on exactly a 10 second run
What is bottleknecking me?
some initial thoughts:
- I need to work on getting evaluateActiveMeshes down, but this will only save a bit of time
- I have more inactive meshes now because of the instance-prototypes, is this impacting performance?
- GPU frame time is much higher… no idea why → ?
- I use instancebuffers for uv coordinates and color, but am only setting these on instance creation, could this be what makes it slow?
- I know the fps is still pretty high, but on mobile it’s below 60 unfortunately
for final reference, the scene (I thought 82 draw calls was quite nice considering each card still consists of lots of independent parts):
I’m happy to supply more info, thank you so much for helping!
About WebGPU you should look into WebGPU Optimizations | Babylon.js Documentation
It will be a bit slower without any optims due to all the involved caching necessary to handle all the various weirdness possible in WebGL @Evgeni_Popov did an amazing job of bringing in a lot of toys to make it faster.
absolute FPS is only the CPU side of the equation and yes renderSorted will be slow as it will sort an array of meshes on every frame which is why it is not done by default.
That said on Mobile are you bound on CPU or GPU if it is CPU, optimizing for might have a counter intuitive result.
What I do not understand is why your GPU time is going up. Are you using more complex shaders or a lot of transparency ? cause overdraw will have a negative impact on some mobile GPUs
I do think CPU is the bottleneck, but I’m not exactly sure how to confirm this.
I do use a lot of transparency (drop shadows, the textbox background, all the text, some icons, the black overlays, shading on top of the cards, etc…) but this has not changed, only before they were all separate materials/draw calls).
I think I use one custom shader on the green background (wasn’t there before).
I also use this trick I scouted somewhere here to share material amongst the big numbers for example:
Your browser dev tools should help to confirm and find the culprit.
Regarding the GPU frame time, unfortunately I noticed (when doing some WebGL / WebGPU comparisons) it can largely vary from one run to another for the same PG…
You should run your tests several times (for exactly the same scene - which is not the case in your screenshot as the number of faces/vertices/active meshes is not the same) and see if you still experience the same gaps.
Also, the durations are very low (2ms - 2.5ms for the frame time) and even a 0.1ms difference will make quite a difference on the absolute fps: 2.13ms corresponds to 470 absolute fps whereas 2.03ms is 492 fps. So, you should be sure to make some averages before drawing conclusion.
Using the performance tab of Chrome may provide better measuring for the CPU/GPU threads, as you can let the snapshot run for 5/10s (or more). But be careful on what is currently running on your computer, it could kill your recording (for eg, if your anti-virus decides to kick just at the wrong time…). You should also perform several recordings and make either an average or throw out the fastest/slowest ones.
Clearly, the scene with the lower number of draw calls should run at least at the same speed, if not faster. But the absolute fps is based on the CPU time, so you should really see about the same perf in both cases (if you test the same scene, meaning active meshes / indices / faces is the same) as most of the time is spent in meshes selection. You normally should not see greater value for GPU frame time, but as explained above this counter is not always reliable (not Babylon.js fault).
I noticed something weird using this little tool
GitHub - jrprice/webgpu-bandwidth: A simple memory bandwidth benchmark implemented with WebGPU. .
Every few runs, the gpu takes twice as long for me. Maybe chrome is collecting telemetry or something , so perhaps benching against a metric like p75 would be more meaningful… idk
the sorting already happened before, except there was way less batching, so there even was more to be sorted.
I ran the tests multiple times over long times (10-60s) and performance clearly didn’t improve (I noticed it got a little worse while playing the game even).
But even if the test is 10% wrong, the expected result of having 60% less draw calls should still be improved performance as far as I understand (I would expect something like 30% less render time). This means there is something I don’t understand. Something is slowing it down more than is speeding it up, I’m sure of it. Like I said, there are more meshes now, because of the inactive meshes that the instances are base on. But I don’t think this fully explains the slowing down.
I still have no idea how to see the GPU action with devtools, can someone post a screenshot of this?
I also included the performance test from before. Both are exactly 10s runs, so that should be helpful I think.
A repro would be really nice, as it is impossible to guess
a reproduction is useful for isolation or if you want to test something, but this concerns the whole scene, so isolation is not the point I suppose.
I can give you the before and after in production.
Or maybe you can tell me what you want to test and I can do it?
update: I got the evaluateActiveMeshes down to 0.00ms on both before and after. As expected it just ups the FPS for both, so the problem has not to do with that.
Yup this would be nice
You need to look into the bottom up of your perf profile and try to narrow the culprit. You can also add more logs and so on to measure timing of some specific functions and such.
you’re right… the FPS fluctuates a lot and I think the GPU frame time might also be an anomaly, I don’t see it anymore.
I got the evaluateActiveMeshes down to 0.00ms on both before and after. As expected it just ups the FPS for both, so the problem has not to do with that.
sorry for the spam
I posted the before and after on the bottum ups (both 10s runs).
You already helped by saying you have no guesses!
I probably have made another change that is the culprit here, I will investigate this first.
Are you using any kind of bounding box renderer and so on ?
I am seeing the bufferSubData poping on the new version which might explain but now we should see why you have it higher
no, I don’t think so.
I annotated the perf image with what went up or was new.
Not sure why renderForCamera is up there, it doesn’t seem to do much in its ‘self’.
I also traced all my changes and it doesn’t look like anything impactful apart from the new instancing with the added color and uv instance buffers.
It just seems like all the gains from the lessened draw calls is lost by the added overhead of the buffers?
Yup something is definitely changing some data dynamically per frame. Could your uv trick update the buffer every frames ?
I think I copied it from here:
I think I also saw it suggested in the forums by Evgeni_Popov but it could have been another main contributor.
Yup it looks like it might involve quite some copies and upload to the gpu
if I use the uv trick I have 82 draw calls and get 540 FPS (like above)
if I disable the uv buffers I get 82 draw calls and 640 FPS (this looks bad of course, so just for testing)
if I don’t use the trick → less shared instances → 91 draw calls and 540 FPS
So yeah, it costs about the same as it gets me.
Does anyone have an idea on how to get the best of both worlds? (note that I don’t need to set the buffers after setting them once)
edit: upon further isolated testing the uv trick seems to be pretty performant, so I’m still not quite sure what is going on.
this is the isolated test: only 2 draw calls/materials and disabling the uv buffers is not much of a gain.
Any ideas on what or how to test more effectively is appreciated.