Unable to create VAO on context lost with active frustum culling

Hello,

I’ve encountered a potential bug reproducible in the following playground:

Steps to reproduce:

  1. Open the playground.
  2. Resize the canvas/window so none of the Boombox objects on the left are visible.
  3. Rerun the playground.

Following these steps, I can reliably reproduce an Uncaught Error: Unable to create VAO.

Now the weird part: Resizing the window so that at least one of the Boombox objects is visible on startup or setting scene.skipFrustumClipping = true leads to no error on context loss.

This means, there is enough time to build the VAOs. But with active frustum culling enabled the engine tries to build VAOs for objects, that have been culled for the complete lifetime of the application (and still should be culled), just after the context lost event has been fired.

Implications to my real world use case:
Either I disable frustum culling (big performance loss), or I set alwaysSelectAsActiveMesh = true on all objects and disable it again after at least one renderloop pass (gets complicated when dynamically loading in the scene). Otherwise the application crashes on a context loss and cannot restore. I’d rather have the engine not unnecessarily trying to build VAOs after the context has been lost.

Scanning through the babylon source, I have not been able to find the reason for this so far.
I’d be glad for any help! :slight_smile:

gosh I tried for several minutes to repro but I finally managed to get the error.

I truly believe this is a bug in WebGL implementation (i know that initially the manually forced context lost was only recommended for debug sake).

Do you mind opening an issue for the chromium team to check?

I filed a bug report at chromium:
https://issues.chromium.org/issues/331092193

Please feel free to add additional information there, in case you have any idea about what might be causing this. TBH my bug report feels like a complete shot in the dark to me.

One additional thing: I could not reproduce the problem on firefox. Neither with the playground nor with our actual application, which on chromium reliably throws the error. So it really seems to be chromium related.

1 Like

Thanks a lot! to be fair I really think this is deeply internal and related to the lost context so we cannot really do a lot about it

Update:
Chromium think it’s an issue internal to BabylonJS and closed the bug report.

haha lol
seriously? So it works on firefox but this is babylon fault…

I’ll dig into it

To give a bit more context, the GL error returned by the VAO creation is 37442: “CONTEXT_LOST_WEBGL”.

We do handle context losts, but only after our “webglcontextlost” listener has been called by the browser. The problem here is that this event has not been raised yet when the VAO creation fails, so we are not aware that the context has been lost.

Gosh! I did not realize that and I was escalating with Google team :frowning:

The explanation coming from Ken from Google who kindly looked at the issue:

This indicates to me that sometimes Babylon is trying to use the WebGL context before the webglcontextrestored event is dispatched. It’s necessary to wait for this event to be dispatched after calling the restoreContext method of the WEBGL_lose_context extension as documented in WebGL WEBGL_lose_context Khronos Ratified Extension Specification .

Thanks a lot for investigating!

Kens explanation makes sense, but this does not explain, why Babylon is trying to create this VAO after context loss in the first place.

The error in the playground is reproducible even when commenting out the restoreContext() call. It’s happening immediately on context loss.

I’m still not convinced whether the problem is within chromium or Babylon, but I don’t see an error in the playground code:

Looking at Kens printfs, the first line "I think the context is lost" is printed in WebGL2RenderingContextBase::createVertexArray(), when the context has already been lost.
But "After context lost event dispatch: restore is allowed" printed within WebGLContextEvent::Create() comes afterwards.

To me, this seems to be a super subtle timing error (then Babylon would have tried to create an VAO after context loss) or createVertexArray() fails, before the context lost event has been fully dispatched (this would be a chromium internal bug).

I have two question:

  1. Why does Babylon even try to create the VAO in this scenario?
  2. Is it possible, that the renderloop is not stopped immediately on receiving the context lost event trying to finish it’s current run and thus accessing the lost context?

I’m posting here to ask for your guys opinion first, before reopening the chromium issue too and wasting more people’s time.

If you put a breakpoint in ThinEngine._onContextLost (which is our webglcontextlost handler, set by a canvas.addEventListener("webglcontextlost", this._onContextLost, false); call) and check “Pause on uncaught exceptions” in the debugger, you will see that you hit the latter before the former.

For me, it would mean it’s a problem on the browser side: as we don’t have been notified that the context has been lost, we run our regular code, which makes the create VAO code fail.

The strange thing is that the bug only appears when there are no meshes on the screen… So, maybe we are doing something we shouldn’t, that leads to this state of affairs…

Have you been able to test in other browsers, like Safari?

I did some tests on browserstack (Linux was tested locally on my Manjaro machine):

Not reproducible by me on:
Safari 16.5, 17.3 (Mac)
Firefox 124 (Win11, Linux, Android Samsung Galaxy S22)
Safari iPhone 14
Chrome iPhone 14

Reproducible on:
Edge 123 (Mac, Win11)
Chrome 123 (Mac, Win11, Linux, Android Samsung Galaxy S22)
Opera 109 (Mac, Win11)
Firefox 124 (Mac)

Huge correlation with the blink engine, although Firefox on Mac confuses me a bit.

Did you ever find a solution or a workaround to this?

Its one of the biggest errors that our players are facing at the moment :frowning:

Hey,
a workaround is to force the creation of all VAOs, VertexBuffers, etc. before a context loss happens. You can do this by disabling frustum culling and enabling it after at least one complete frame (e.g. in scene.onBeforeRenderObservable.addOnce()).
An alternative, which works better when dynamically loading objects, would be setting mesh.alwaysSelectAsActiveMesh = true and disabling it later.

I still have no clue, whether the actual bug is in Babylon or lies on browser side though. I will investigate futher, when I have more time. It’s still an issue for us too.

My guess on what’s happening:
When the browser dispatches a contextlost event, Babylon sets ThinEngine._contextWasLost. This flag is only checked at the beginning of each render loop though. So if the event occurs between this check and the WebGL calls in the render stage, Babylon still tries to access the lost context and fails.
Non-Blink browsers might differ in there implementation so that the event is dispatched earlier or the context is available just a little bit longer and the race condition is not met.
What I absolutely don’t understand is, why Babylon evaluates meshes as active, that have not been before. Maybe evaluateActiveMeshes() accesses some WebGL resource, which is suddenly unavailable, and therefore selects meshes wrongly.

1 Like

There are no workarounds once a context lost happened if context restored has not been called unfortunately :frowning:

If no context restored event is dispatched by the browser, this would mean a crash in the browser, driver or smth like that. There’s nothing to work around there. :sweat_smile:

I dug a bit deeper into Babylons code and the WebGL Spec. According to this article in the Khronos WebGL Wiki you’re not supposed to check for null on the creation functions for WebGL objects like VAOs. Null is returned in case of a context loss and functions consuming those objects can handle null and result in NoOps.
If you look into recordVertexArrayObject() there is a null check though. Also for similar functions like _createVertexBuffer(), createIndexBuffer(), _createShaderProgram() or _createTexture(). As I understand, when the context is lost, they all throw but actually shouldn’t. The article states, that it’s expected and fine to have null here.
Maybe it’s as easy as removing those null checks and let the render loop finish its current iteration. On context restore, all WebGl resources need to be recreated anyway.

Can someone of the Babylon team look into this, please? :slight_smile:

This won t help as after a big part of the engine state might have other issues depending on how those objects are being used internally. It will also enforce wrong typings in quite a few places within the code which might not help in maintenance.

The main thing is that after a context lost, we should as much as possible not render or try anything to save resources as much as possible.

Internal state is my biggest concern too. On the other hand buffers, VAOs, textures, etc. are only used within their respective WebGL functions and when these are NoOps, this shouldn’t effect internal state. If there are exceptions from this, it should be few.
@Evgeni_Popov seems to have built _restoreEngineAfterContextLost(), which handles rebuilding internal state. Maybe he can say more about this.

The alternative would be to stop the current render loop instead of just throwing and then try to restore from there. This sounds even more concerning about internal state to me.^^

Render should already be protected when context lost happen.

So something else should do it. Can you share a repro ?

Please do not tag team members even more cause mr evgeni_popov is in well deserved vacations at the moment.