What Are the Most CPU-Intensive Tasks? (Worker Threads + WASM Discussion)

Hi everyone, glad to be back now that support for threading is in its infancy.

My intro question is in the title itself, but hopefully the follow-up explanation will clarify my intentions. Here goes…

1. WebWorkers are Multiple Single Threads

The strategy of using workers to offload heavy lifting from the main thread is the preferred option from an efficiency and quality of end-user experience perspective. But, it kicks the can down the road in terms of each worker being its own separate single-threaded process. Limitations or problems on the main thread can also be encountered on each worker thread.

2. WASM Modules are Multi-Core Instances

The restriction on using WASM in the context of deploying BabylonJS is that the core of the engine would need to be ported to a native language that compiles to that target, and the resource expense of implementing WASM modules would be so great in production that their apparent benefits would not outweigh said costs. It doesn’t help any that it is anti-pattern to deploy an app designed to use WASM minimally; there must be a sufficiently high workload for the WASM module(s) to perform in order to defend their necessitation in the overall app design.

3. WebWorker as WASM Loader

Instead of porting the entire engine over to a native language that compiles to WASM, why not just port computationally expensive non-rendering code instead? The stuff that most likely will not need to be changed moving forward, in other words.

For example, a WebWorker loads a WASM module as part of its instantiation. The WASM module could be used to calculate all transformations of vertices within world space, thereby allowing the interaction with each scene to be multi-core optimized in terms of geometries. The module would only need to be sent a minimal amount of information, such as an object describing the user’s position and orientation within world space.

The WASM module could then perform all necessary operations in memory and return the data that remains after performing a frustum clip, whereupon the WebWorker would render said data using its TypeScript render routines that implement WebGPU.

This way the worker(s) could perform the light workloads that are too small for WASM to bother with, while migrating all of the heavy workloads that aren’t a good fit for a single-threaded process over to WASM.


I’m aware that WebGPU could accomplish everything in one place. However, will the initial release compensate for being single-threaded in all respects? Would it prove to be efficient to run everything through a single bottleneck like that? It reminds me when games used to run exclusively on the CPU, leading to the necessity of inventing GPU cards to offload most of the work.

I know that what I’m effectively proposing is - at best - some form of parsing exported files (from Blender, etc) into data sets that are processed in parallel, and - at worst - multiple files that collectively describe a given entity. But since WebGPU requires new file formats anyway and file conversion will be required regardless… maybe it’s time to consider creating such an option in the design of that type of utility?

I’m not an architect level engineer or super expert in this field, it just seems to me that by combining GPU-optimized worker threads with in-mem multi-core code, app designers could get the best of both worlds and reach unparalleled performance.

All of this having been said, does this make sense? Would it be feasible?
If it is, I’d love to dive into building something like this.
Thanks for reading :robot:

5 Likes

Simd is the primary benefit of wasm imo. As i see it, there are 4 types of parallelism . Processor pipelines, simd instructions, threads, and gpu compute. As devs, we can program against the last 3. Of those 3, End users’ devices can benefit from 0 to 3.

  1. processor pipelines
    Nothing you can do , except probably dont use promises and async functions unless you have to because the built in platform call returns one.

  2. simd
    Any device running babylon should have the 128 (32x4) width instructions , arm/riscv included, so basically all devices stand to benefit from simd, but not all browsers support it yet.

  3. threads
    Not all devices will be multicore

  4. gpu compute
    Not all devices have a dedicated(ish) gpu. Babylon cant really assume they are submitting code to a gpu, even if using shaders bc of Angle/dawn

So… are u making a useless workplace app in vr that literally noone is going to use, targeting some vr headset with a phone duct taped onto it? If so, you are probably shit out of luck and nothing babylon can improve for you for several years. Thanks hololens.

But, lets suppose youre targeting the desktop with a gpu in electron/cef/webview/uwp where you can manually turn things on to have simd , shared memory, and webgpu / wgsl . lets goooo!

Btw, Wasm <-> js interop has pretty negligible overhead these days, theres lots of old info saying its costly but its not.

Now, the decision is based on the data size.
Also desired tradeoff for latency vs throughput.

Small data? Cpu / whatever.
Big similar data ? Gpu
Big different data? Cpu, maybe in a worker
Big data parsing? Wasm , even without simd can be exponentially faster than js as data size grows. I saw a parser bench where using wasm on small data(100kb) was 1.3x faster than js but 30x for 30mb input.

For a gpu compute scenario, consider most real world apps will have lots of different sized objects to do collision detection against , and prob better on cpu. However, If you are benchmarking 10k same size cubes or something , gpu compute shader will be faster. Maybe some use in background scenes like vehicle traffic/crowds or something. Point is, it depends, so multi dispatch / variants ftw.

Anything actionable now? Probably a wasm gltf loader , definitely recompile ammo with simd flag turned on, add a physics plugin for rapier3d , create a mirrored api of all the matrix/vec math to simd wasm (ultraviolet rust lib is maybe a good starting point), and some webgpu post processing , ( anti aliasing , ray trace denoiser ) . Also , having to convert glsl to wgsl is enraging, but tooling to automate that at build time instead of runtime could be beneficial perfwise.

2 Likes

This blog post from @Deltakosh should interest you:

1 Like

Here is a nice article on wasm simd use. Scroll down a little for a wasm simd vs wasm non simd vs native comparison.

@Evgeni_Popov maybe you will find these interesting.

I should mention , contrary to both my above generalization and @Deltakosh ’s blog, simd and wasm can be good for small data too.

This has the opposite effect, scaling fast towards being same speed as js as data size increases, but 30x speedup for small (100kb) data.

The next link is a strong argument to put some time into simd tooling for babylon, optimizing for cpu bottleneck and handsets / headsets.

See slide #4 for mind blowing mat4x4 speedup on arm64 17ms to 0.004ms , wtf 4000x speedup ??? I’ve seen similar results for vec dot product somewhere in a js vs hand written simd wasm module on super small input having something like 1000x speedup .

The thing to measure is how this will behave in a constant back and forth between JS world and WASM.

The data is not shared between both world so you will have to figure how to copy them very quickly or how to share then using acceptable tradeoff for the JS developers (we tried using SharedArrayBuffers but it was a complete disaster).

Also we need to write the WASM code right? Are you suggesting we write it with C++ and compile ? meaning that all JS developers will be excluded from contributing right?

@Deltakosh i sense you are fed up with empty promises of fast wasm, but simd wasm really is a (literal) game changer. Simd just shipped in chrome 91 so it hasnt been available very long. I think the first step would be to just recompile ammo using simd flags and see how that goes. Matrix math next and thats mostly it. Using ammo doesnt stop contributors, i doubt a simd version of gl-matrix will either.

Sidethought, it could be cool to create a simd tagged template library that jit compiles inline wasm base 64’d modules, similar to how styled components does for css. Feasible? Maybe someone with a bigger brain than me could figure it out. clang can run in the browser, so that could provide the autovectorization .

Your point about atomics being impossible to use is so true, current tooling and dx sucks ass. Its sad using rust is easier than javascript.

Also js wasm interop isnt necessarily required, bc in many cases arent we just submitting stuff to the gpu?

For external stuff like ammo: no problem (and we are already doing it)

for Math, this is simply not possible. We are not using gl-matrix but our own library (that I wrote from scratch) with a ton of optimizations and most of all maths are use million times during each frame.

What do you think about recompiling a variant of ammo with simd flag enabled and see how it goes?

totally game for it :slight_smile:

Wanna give it a try?

I’ll give it a go, but some collaboration may be in order

from the docs Porting SIMD code targeting WebAssembly — Emscripten 2.0.27 (dev) documentation
“…pass the -msimd128 flag at compile time. This will also turn on LLVM’s autovectorization passes, so no source modifications are necessary to benefit from SIMD.” w00t

I’m assuming u guys want to use the setup here? ammo.js/CMakeLists.txt at babylonjs · BabylonJS/ammo.js · GitHub

The build will need an additional output target though, a simd version and a non-simd version now. For importing, I think we can just do feature detection in the browser then dynamic import the appropriate file from the babylon cdn?

notes for the cmake overloards:

some other potentially relevant args:
–disable-asm
–extra-cflags="-s USE_PTHREADS=1"

https://emscripten.org/docs/optimizing/Optimizing-WebGL.html
suggests using -s MAX_WEBGL_VERSION=2
seems ok to do since chrome91 is required?

I think i remember seeing a flag ENABLE_WEBGL2=true , no mention of it there though.

Looks like MakeyK24 did the setup there, some input from him or her would be cool, I dont really use cmake.

alsooo… I saw this the other day.

could we just use this stuff to radically simplify things? I don’t know shit about dotnet scripts, but seems like u could just do a nuget install and run a cpx file? Is that right? (never really used either, just know they exist)

Let me summon our expert @syntheticmagus

Looks like something to try. Like Deltakosh said, give it a shot!

One thing to note: I believe WASM threading is built on SharedArrayBuffer, which has incomplete support in general and no support on Apple platforms, so taking hard dependencies on that brings some pretty stringent limitations. A number of WASM features are like that, actually, which isn’t necessarily a problem. It’s just something to be aware of as you’re figuring out what you want to target.

But yeah, give it a shot and let us know how it goes! In particular, if you have a scenario you care about that you couldn’t do before but you can do afterward, that’ll be super awesome to hear about. Best of luck!

I can build it locally for my own project no problem. The idea was to nab a low hanging fruit and experiment with integrating a simd wasm module into Babylon and like you said, figuring out how to target stuff and incorporate it into your build system, because, sadly, thats the hard part.

Javascript has SIMD. I tried using it in '17, I think. Big let down.

It doesnt:/ it was an intel proposal , idk if any browsers actually shipped anything even as a trial? You may have used a non functional shim GitHub - ljharb/simd: ES7 (proposed) SIMD numeric type shim/polyfill . Alsooo, writing simd manually is kind of dumb when a compiler can autovectorize , which just means compiling regular array methods / for loops / iterators into a simd equivalent . Tbh, theres no reason v8 cant do it on regular js, and my guess is it probably will some years in the future.

This may be an option GitHub - google/highway: Performance-portable, length-agnostic SIMD with runtime dispatch

@Evgeni_Popov Hey, I saw your comment on the webgpu discussions and thought you might find this useful? TLDR they explained how removed CPU limits on 3k+ draw calls with 10k+ geometries in zea engine.

https://github.com/ZeaInc/zea-engine

All the other talks in that vid are really interesting too, but just thought about your comment and wanted to mention the zea one specifically.

Also related to this thread’s purpose of identifying “what are the most cpu intensive tasks?”

Adding @sebavan FYI as well

1 Like


A quick benchmark on a scene with ~2000 draws shows that the _evaluateActiveMeshes and computeWorldMatrix takes a considerable amount of CPU time, which is also suggested to be optimized at docs.
It cound be possible to copy part of mesh data to wasm memory and offload the matrix and vector calculation to wasm, and copy the results back, as an alternative of _evaluateActiveMeshes on every draws.

Incomplete list of data needed for _evaluateActiveMeshes
#include <stdint.h>
#include <stddef.h>

typedef struct Vec3
{
    float x;
    float y;
    float z;
} Vec3;

typedef struct Vec4
{
    float x;
    float y;
    float z;
    float w;
} Vec4;

typedef Vec4 FrustumPlanes[6];

typedef float Mat4[16];

#define MESH_IS_ROOT 1U
#define MESH_SKIP_PROCESS 2U
#define MESH_ALWAYS_ACTIVE 4U
#define MESH_COMPUTE_WORLD_MATRIX 8U
#define MESH_INVISIBLE 16U
#define MESH_INFINITE_DISTANCE 32U
#define MESH_IGNORE_NON_UNIFORM_SCALING 64U
#define MESH_USE_PIVOT 128U
#define MESH_POST_MULTIPLY_PIVOT_MATRIX 256U
#define MESH_DIRTY 512U
#define MESH_COMPUTE_WORLD_BOUNDING 1024U
#define MESH_USE_ROTATION_QUATERNION 2048U
#define MESH_RE_INTEGRATE_ROTATION_INTO_ROTATION_QUATERNION 4096U
#define MESH_USE_TRANSFORM_TO_BONE_REFERAL 8192U

#define _In_
#define _Out_

typedef uintptr_t mesh_id_t;

typedef struct Mesh
{
    _In_ const uint8_t type;
    _In_ const uint8_t billboard_mode;
    _In_ const uint16_t flags;
    _In_ const uint32_t layer_mask;
    _In_ const float scaling_determinant;
    _In_ const mesh_id_t parent_id;
    _In_ const mesh_id_t transform_to_bone_referal;
    _In_ const Vec3 bounding_minimum;
    _In_ const Vec3 bounding_maximum;
    _Out_ Vec3 bounding_minimum_world;
    _Out_ Vec3 bounding_maximum_world;
    _In_ const Vec3 scaling;
    _In_ const Vec3 position;
    _In_ _Out_ Vec3 rotation;
    _In_ _Out_ Vec4 rotation_quaternion;
    _In_ const Mat4 pivot_matrix;
    _In_ const Mat4 pivot_matrix_inverse;
    _In_ _Out_ Mat4 local_matrix;
    _Out_ Mat4 world_matrix;
} Mesh;

typedef struct Scene
{
    // note that this is a performance trade-off for wasm
    _In_ const uint32_t freeze_active_meshes;
    _In_ const uint32_t skip_world_transform;
    _In_ const uint32_t billboard_use_parent_orientation;
    _In_ const uint32_t culling_strategy;
    _In_ const uint32_t layer_mask;
    _In_ const float epsilon;
    _In_ const Vec3 camera_global_position;
    _In_ const Mat4 camera_view_matrix;
    _In_ const FrustumPlanes frustum_planes;
} Scene;