Memory Leak in Screenshots

Hi, this is a weird one, but here is the scenario we’re encountering. In BabylonJS 5x (after the alpha release) we are seeing a GPU memory leak when running CreateScreenshotUsingRenderTargetAsync continually. The issue appears most prominently when running on Linux, for some reason. We do not see this issue when running any alpha version. I think there is an issue with this commit here, which aims to force a render using CreateScreenshotUsingRenderTarget when the following is true:

  1. The scene’s active camera is not the camera used for the render.
  2. The original call is to CreateScreenshot

This is our scenario, only we’re calling CreateScreenshotUsingRenderTargetAsync with a free camera that is not the active Scene camera.

Just looking at this code, it looks like it’s possible that calling CreateScreenshotUsingRenderTargetAsync with a new camera without swapping the scene’s active camera may direct the renderer to call itself twice, and then some object is not being disposed by the GPU?

I’m happy to help set up a Playground next week but I wanted to see if any of this sounds plausible.

Actually, I think I lied there. It looks like that change only affects normal screenshots. We’re using this method here, which does not appear to force the render target texture method. We’re still trying to figure out why there’s a leak when we switch to 5.41.0 from 5.0.0-alpha.60

Yes, a PG would definitely help.

Do you know what is the object which is leaking?

No, but since it’s a GPU leak we tried a few things to isolate the issue. First, we just looked at the Scene graph in the Inspector and confirmed that we aren’t growing textures or materials. Those counts are stable and nothing is being added there. To really confirm that theory, we skipped any Scene updates altogether and just do the screenshot operation. If we swap that around, and do all the updates but return a single pixel from the screenshot operation, there is no leak, so it’s seems to be somewhere in that screenshot code.

We then reverted to the Alpha version and saw the issue disappear. We tried this several times and can reproduce it every time. Switch to latest, GPU leak, switch back, no GPU leak.

This is using the latest Chromium/Puppeteer in a Linux environment, if that helps at all. I’m not sure we’d be able to see it just using a PG, but I’d guess the way to set it up is to run an async render operation in a loop that just does them one after the other and see if your GPU climbs. I don’t see it happening on a Mac, either, which just makes it even more of a mystery!

One other note: to see the GPU leak we are watching the GPU using watch -n1 nvidia-smi on the Linux machine with an NVidia card.

That does not ring any bell to me…

One thing that would probably help narrow down the search would be to test different 5.x versions and locate the one that introduces the problem: there are too many changes between 5.0.0-alpha.60 and 5.41.0 to do a code comparison and try to pinpoint the exact change that leads to the behavior you are experiencing.

We tried to ascertain this but ran into an issue in which 5.0.0 - 5.12.0 would not load our models. So jumping to 5.21, we were able to see the leak. Strangely, the leak is there but not as pronounced in 5.21 as it is in 5.41. It leaks, but does it slower.

Here is a Playground. We can reproduce this in 5.41.0.

Thanks, I’m going to have a look, but why are you calling engine.endFrame by hand? The user is not supposed to do that when the render loop is handled by the engine, I’m not sure if there can be some side effects because of that…

Can you test by removing the 2x scene.render() and 2x engine.endFrame() calls to be sure the problem is not related to that?

It relates to this other issue we saw here:

(tl;dr) we had geometry that was missing from renders.

If you can advise us how to adjust the Playground, we’re still set up in Linux to test and see if that still results in a memory leak.

I just realized that I didn’t note that in our current app, we freeze the render loop, which is why we run that manual frame advancement. So this PG is slightly more accurate.

When you say that there is a GPU memory leak, I assume that the percentage of free memory is constantly decreasing and at the end you get an “out of GPU memory” message?

I tested your PG on my computer and used nvidia-smi (I have a 3080Ti). The %mem goes up but comes down regularly, so in the end I don’t see a memory leak on my side:

image

I guess at this point we can only try to rollback one by one our changes (there are not that many from 5.0.0 to 5.21.0) in the screenshot method so you can test it, and see when the leak disappears…

Correct. Are you testing on Windows or Linux? For us, the problem does not exist on Mac OS but in Linux (Amazon Optimized Linux, to be exact), the GPU goes from 600mb to 7,8,900 and up until it uses up the full 24GB. Nothing else is running on this machine.

I’m testing on Windows.

Trying to rollback some of our changes, can you try this PG:

1 Like

With this PG, the problem does NOT exist. The bottom process is our test after running about 5 minutes. This seems like the likely culprit!

1 Like

In the new Tools.DumpData function, we are creating a new texture each time the function is called but we are also disposing it, so I don’t understand why it would leak… Indeed, it seems it leaks only on your test machine but not on others.

Can you test this PG and see if it is leaking:

It’s basically what we are doing in the DumpData function…

EDIT: maybe you can lower the setInterval delay

2 Likes

Yep, this is leaking. It started at around 75mbs and is now up to 123mb.

That means any texture created is leaking, disposing of it does not reclaim the GPU memory…

It sounds like a bug of the driver to me, which does not reclaim GPU memory properly.

It is more apparent now with our new screenshot code because we are creating a new texture each time, but to me the problem is not with our code.

cc @sebavan in case it rings some bells for him.

Note that if that helps, you can keep the overriden method of BABYLON.DumpTools.DumpData I provided in the PG above if the screenshots are ok with you with this method. Our new implementation takes care of the pre-multiply setting of canvas that made data with alpha channel not having the proper look, but screenshots don’t have an alpha channel so that should be fine.

And thanks for your patience with the testings!

2 Likes

That’s interesting. We’re working with AWS on finding an updated driver. I’ll post back when we get info there.

2 Likes