Multithreading eventually possible!

Hi everyone,

I wanted to share some thoughts with you I had since @Deltakosh released the offscreencanvas feature

I just had an idea which will make multithreading possible (I think) thanks to that feature. Here it is:

  • We know we can use that feature in order to have a canvas rendered in a separated thread from the browser
  • And @Deltakosh confirmed that for each new canvas and babylonJS scene created using offscreen canvas, there will be a new thread.

So the main idea is what if we use overlapping canvas in order to render only one scene?
Setting the ambient color to black will then allow to see each overlapping canvas!
Theoretically, we could call that multithreading, right?

I still see some challenges:

  • It can be really complex to manage indexes between each canvas
  • We have to make sure cameras are always at the same position and target between all the scenes
  • Depending on the complexity of the scene that could lead to multiple asset loading for each canvas

But we can also see use-cases where it could be benefic to do that:

  • Separate particle effect which are often rendered above everything else
  • Same for GUI
  • Separate environment which always stay behind
  • etc

So, what do you think about it?
I will do some tests to see how feasible this is and whether it works well or not. :wink:

2 Likes

Sounds fun to try just afraid about the componsition performances for overlapping canvases in the browser :slight_smile:

What is that exactly @sebavan ?
So you are saying it could be counter productive to overlap canvas ?

yup at the browser level the compistion of overlapping transparent elements is somehow expensive. It is worth a try but keep a close look at the perf in the Chrome dev tools to see what happens.

1 Like

compistion means composition :slight_smile: but I leave it cause I think this should be a real word as it sounds quite nice.

1 Like

GREAT idea.

Old guy from C/C++, here, who misses MULTI-THREADING.

It would be nice. : )

Clever to multi-thread canvas layers.

May have usefulness in what we work on, Cinematics.

Doing something similar for UI over 3D.

If this multi-threading is a performance gain - what else could be threaded?

EXPORT AUDIO to THREAD would be fantastic.

No time.

And something in w3C pending on that.

Asset sharing, we talked about a month ago.

Interested to hear more about this.

:eagle: : )

1 Like

Sorry, but why exactly would multithreaded rendering be faster? Wouldn’t it be smarter to follow the usual C++ engine approach and have threads for different tasks, such as animation/physics/rendering? Unless I’m missing something about how WebGL works you won’t have any gains unless you’re CPU bound, and syncing the framebuffers in the end will be a major problem.

I had the same idea thinking it could be useful for complex multipass renders, since you can render N passes in parallel and compose them in postprocess at the end. I was looking into that, but even then you have to get all the textures together for the final composition, which only becomes interesting if N>2. I don’t even know if you could share textures between different threads – if you can’t, the performance penalty would be a killer. My little experience with webworkers tells me that exchanging data from a worker to the main thread is a bit of a pain and probably too slow.

Indeed @brunobg , the best way would be to be able to do exactly like in C++ but we we are on the web! :slight_smile:

I have still made some test pages using 3 suns from ParticleHelper just to see how it goes:
Test1 with 3 separated canvas/engine/scene to render 3 suns in 3 offscreencanvas and separated threads.
Test2 with only one canvas/engine/scene to render 3 suns.

Obviously for test1, the browser performance test is much better.

In terms of CPU/GPU usage, when I go to test1, my GPU is fully used (100%) and my CPU is at 30% in average. And my computer is having hard time if I want to see a youtube video in another tab. :crazy_face:
And with the test2, my GPU is at 80% and my CPU at 15%.

So with this specific use case (The 3 suns use case :grin:) this is better to have only one scene.
Maybe because of the composition at the browser level as you said @sebavan or the not shared texture as you suggest @brunobg. A bit of both? :thinking:

That would be interesting to try other use cases too!

1 Like

Here are my two cents, based on my experience.

There should be very little gain splitting rendering in threads and compositing in practice. I worked with distributed rendering and there were significant challenges, and it was only worth to compose the image from multiple renders with separate objects if the scene was too large to fit in memory or the render process was very slow, like with ray tracing. The speed of downloading a texture to the CPU and sending it again is very slow compared to accessing internal textures in the GPU. Again, I don’t know if you can share contexts between web workers or a web worker and the main thread (a quick search returned nothing, and since offscreen rendering is new I would guess probably not).

I have used overlapping transparent canvas without problem before, but there was no z-buffer involved. Browsers didn’t seem to have a performance problem with that, but with BJS you’d have to sort objects and split them yourself – quite a problem. It could be useful for a HUD, but then it’s probably better to do them with pure HTML or have a webworker handling the HUD in another canvas.

Applications using C/C++/native engines usually have very fast frame render times. It has to be quick, since they often have several passes (check this article to see what Doom does), and when the hardware is too slow to keep up with the game they reduce the effects – adding more render threads wouldn’t help, you’re already GPU bound. And if you’re CPU bound, rendering things faster on the GPU won’t help either.

I haven’t kept up-to-date on what hardware can do to improve GPU render times, but I know there are a few extensions like https://www.khronos.org/registry/OpenGL/extensions/OVR/OVR_multiview.txt or WebGL Deferred Shading - Mozilla Hacks - the Web developer blog. Also, simulation is not my domain, but I know about PhysX/Bullet and other attempts of performing physics in the GPU. Apparently this was already thought of, not sure if it was actually implemented in BJS: Running physics in a webworker

Anything that is CPU bound could potentially perform better with a webworker – particle animations, collision detection, things like that. But webworkers are still a pain to work with and passing data has a lot of restrictions. SharedArrayBuffer was deprecated and partially un-deprecated. Sending data by reference is somewhat muddy – possible, but could have some significant synchronization challenges and require a double or triple buffer. @aFalcon perhaps could link to the discussion he mentioned.

From my tests the bottleneck in BJS seems to be animating some objects (surprising me, since I imagined it’d be done in the GPU these days always, but apparently not in some cases) and _evaluateActiveMeshes, which I guess is slow from sending data to the GPU and not actual CPU calculation (any BJS gods could confirm this?). I suppose animation could benefit from moving to a web worker, but it’s probably not trivial since BJS seems to need it to be done before the render phase. The best advantage of using offscreen to me is releasing the main thread, so even if the rendering is slow your browser doesn’t freeze.

I think it boils down to: what are you trying to achieve and what is your bottleneck?

2 Likes

Premature Optimization - got it.

Yes SharedArrayBuffer was the spec.

I would like to see this used with AssetManager or ImportMesh.

To background load Assets. Including Audio.

Sounds like that POC is to be avoided, still.

Thx,

:eagle:

1 Like

I think loading assets, if there is significant processing after the network transfer, would be a great case for this. There was a thread this week about how loading DDS textures was slow, which is a prime candidate.

In my experience with gltf/glb is that they are very fast to process – after all, it’s JSON, so it’s already processed by the browser. So it wouldn’t help much. Audio also shouldn’t help at all. Why do you think it would help? Download is async and the browser decodes it.

1 Like

Good question. To clarify, not runtime. Download time optimization.

Audio webworker GOAL:

Dynamic audio on Demand. Audio downloaded when needed. And does not impact runtime.

Is this already easy, and I just missed it?

Seems like Audio has a Front-loading cost. I’d like to optimize that.

CONTEXT:

It is a movie. So best if audio is queued for each scene.

I have not looked into it yet.

If lucky, next week.

:eagle:

Downloads are async and shouldn’t hurt the main thread at all. I’d preload audio to ensure it’s ready when needed: Play Sounds and Music - Babylon.js Documentation

Also, perhaps this post will be useful to you: Good practice to load and access an asset (e.g. sound) from code?

1 Like

Well, I’m no god but looking at the code, there’s no GPU related work done in _evaluateActiveMeshes.

This function checks all the meshes (and sub meshes) / particle systems in the scene and build a list of meshes that are active (several in fact, as you have to separate opaque / transparent meshes), meaning meshes that need to be rendered. For example, it’s this function that checks if a mesh is in the frustum or not (among other things).

So, it’s somewhat expected it is the most heavy in CPU time because what other function could it be? When rendering a frame, the two main processes are the mesh visibility evaluation (_evaluateActiveMeshes, exclusively (I think) done on the CPU) and the mesh rendering (the various render functions, which are normally heavier on the GPU than on the CPU). If you have a lot of materials, you could also spend a fairly high amount of time in the isReady function of the materials, because it runs quite a number of tests to check if the effect has to be recreated or not, especially for the StandardMaterial and the PBRMaterial.

Note that you can make _evaluateActiveMeshes takes ~0 CPU time by calling freezeActiveMeshes() on the scene (of course, make sure your meshes and your camera don’t move, else the list of active meshes at the time freezeActiveMeshes() was called could be wrong). You can also speed things up on the material side by calling freeze() on the materials.

Regarding animations, it is done preferably on the GPU, except if your hardware can’t handle it or if you choose to do it on the CPU (mesh.computeBonesUsingShaders = false).

2 Likes

Correct