10k+ Instances with MSDF Text Nameplates and inline WASM compilation


Howdy all. This is a followup to an original demo: JavaScript/TypeScript object mapping in GPU Shader structs for complex binding - #2 by knervous

This work is a sort of “addon” or plugin framework that I intend to create an npm package for and provides a few things:

  • GPU Object struct mapping to define well-known types on TS/JS objects and be able to modify those properties and have them reflected on the GPU side with easy semantics
  • Inline WASM generation via assemblyscript and binaryen for operating on AoS (arrays of structs) for operating on large data sets with mat/vector math that can leverage SIMD. I know WASM has been brought up before here in BJS and I think assuming developers will want to write any code that compiles to WASM or maintain it in the codebase is a slippery slope - this is all managed dynamically and creates assemblyscript templates and compiles them in runtime. The benefit of keeping translation on the CPU side is instant query time for getting and setting these values so vertex movement on the GPU isn’t lost in the render path.

I wanted to get to a point where this is demonstrating something useful, so here is an example setup to leverage existing code for MSDF Font assets and generate N templates scaling out to over 10k all with one draw call while being able to follow the translation of an “owner”. Also this is demonstrating custom properties of “thin instances” where a thin instance can have a unique color, scale, name, etc.

Will take a bit to hoist all this in an npm package but hopefully will be pretty plug and play and will provide good documentation and examples.

Here is a PG where I’ve been able to scale up to 20-30k instances while maintaining 200-300 FPS:

13 Likes

This is super cool! Love the usage!

1 Like

Holly Cr… This is huge !!!

1 Like

I really want this. Great work.

1 Like

Incredible!! :exploding_head:

@knervous Could you please share which browser you’re using for uncapped FPS? Do you use --args --disable-frame-rate-limit? Last time I tried, I wasn’t able to uncap FPS. Thank you!

1 Like

The video shows 60 fps, maybe this “200-300 FPS” here means “Absolute FPS”

1 Like

depends on his monitor, you can have 200+ fps in browser.

I run at like 220.

1 Like

@regna Sorry was referring to the absolute FPS like @kzhsw mentioned - the monitor I was rendering on was defaulting to 60 real FPS. I know that’s always hardware dependent and different ways to configure that from the OS/browser.

Thanks all for the replies and reception. Few more things I wanted to prove out and will start migrating this to a package with some helper extensions.

Here’s an update baking VAT for animations and a thin GPU implementation to work on the instance’s uniform buffer via this framework rather than vertex buffers:

3 Likes

Adding 100 at a time works ok.. Every time I add 1000 firefox crashes out. Both playgrounds. I can add 100 at a time just fine. :wink:

Outside of that runs great! Good job!

1 Like

Very very cool and advanced!! :heart_eyes:

I register the first FPS drop from 120 to 110 when reaching 20k instances. Macbook Pro M4 MAX.

I’m curious about the specs of your rig ! :flexed_biceps:

1 Like

Hey, thanks! The fps I was monitoring was from the first PG fwiw without the additional VAT work, the bottleneck seems to end up being GPU based with all the texel fetches. Going to work on variable byte length segments depending on struct properties, trade-off being padding for byte alignment. Currently using rgba so four floats per fetch. Also going to get the webgpu side working which will necessitate writing the shader code in wgsl but I think the storagebuffer will be much faster.

I’m running on an M2 Mac so interesting that we are seeing different results. Would have to dig more into work being done per frame. Was running on an external monitor capping the real fps at 60, wonder if that makes a difference.

1 Like

Wow, this is outstanding, and super useful :star_struck:. Count me as interested user once this is released :slight_smile:

3 Likes

Danke! Happy to hear people have interest in using this. Bit of an update I’m working on getting the webgpu side dialed in and ramping up on everything related. When I’m finished I’ll update with another PG proof of concept and start package development :slight_smile:

1 Like

Good news, was able to finish up the WGSL emission and StorageBuffer backing for WebGPU so this works in both engines now. WebGPU doesn’t have a strong lead but it scales up a bit better with instances 15k+

Big things to point out, all the instance translation (changing color, position etc.) in the WASM stays under 1ms per frame up to around 20k entities, so big arrays being mutated as a proof of concept in WASM is :white_check_mark: . Thinking of general applications there being types of motion functions whose fn can be defined with input parameters and just let it fly for interpolation every frame and keep it all CPU side.

5 draw calls for n number of instances + unique nameplates. This was really the goal here to be able to instance structs with any custom data and limit rendering to 1 draw call.

Will be working on the npm package from this point on.

Here is WebGPU:

And the latest WebGL2:

6 Likes

Been focusing on getting this package in good shape for an initial release and have been proving out other useful concepts with the WASM kernels.

I was curious what custom culling on instances would look like when scaling up – here is a quick snapshot of 70k instances in memory using SIMD in WASM to do a frustum culling pass each frame which takes less than 1ms CPU time. This is effectively instance culling since it controls the instance draw count, there is some index indirection in the code to point back to the true contiguous list of visible instances.

This is using the npm release of shader-object which is close to finished. I will throw up the source on GitHub soon and create a video on “how to”.

Here is the PG with the above sample

Here is the relevant AssemblyScript WASM SIMD pass for frustum culling… geeking out over how performant this is, pretty cool stuff.


@inline
function loadPlane(ptr: usize): v128 { return v128.load(ptr); }

@inline
function planeNormal0(p: v128): v128 {
  // zero W so mul ignores instance .w (scale)
  return f32x4.replace_lane(p, 3, 0.0);
}

@inline
function planeD(p: v128): f32 { return f32x4.extract_lane(p, 3); }

// Returns 1 if inside, 0 if outside, with sphere radius = baseRadius * scale
@inline
function inside6(pos: v128, baseRadius: f32, pn0s: StaticArray<v128>, ds: StaticArray<f32>): i32 {
  const scale = f32x4.extract_lane(pos, 3);
  const radius = baseRadius * scale;

  // For each plane: dot(n, xyz) + d >= -radius
  for (let k = 0; k < 6; k++) {
    const n0 = unchecked(pn0s[k]);
    const d  = unchecked(ds[k]);
    const mul = f32x4.mul(pos, n0);
    // horizontal sum of xyz lanes
    const dot = f32x4.extract_lane(mul, 0)
              + f32x4.extract_lane(mul, 1)
              + f32x4.extract_lane(mul, 2);
    const signed = dot + d;
    if (signed < -radius) return 0;
  }
  return 1;
}

export function frustumMarkAoS(
  base: usize,
  planesPtr: usize,
  baseRadius: f32
): void {
  const h = changetype<InstancePoolHeader>(base);
  const count = <i32>h.instancesCount;
  if (count <= 0) {
    h.visibleCount = 0;
    return;
  }

  // Preload planes, derive (n, d) split once
  const p0 = loadPlane(planesPtr +  0 * 16);
  const p1 = loadPlane(planesPtr +  1 * 16);
  const p2 = loadPlane(planesPtr +  2 * 16);
  const p3 = loadPlane(planesPtr +  3 * 16);
  const p4 = loadPlane(planesPtr +  4 * 16);
  const p5 = loadPlane(planesPtr +  5 * 16);

  const pn0s = StaticArray.fromArray<v128>([
    planeNormal0(p0), planeNormal0(p1), planeNormal0(p2),
    planeNormal0(p3), planeNormal0(p4), planeNormal0(p5)
  ]);
  const ds = StaticArray.fromArray<f32>([
    planeD(p0), planeD(p1), planeD(p2),
    planeD(p3), planeD(p4), planeD(p5)
  ]);

  
  // read pointer walks all instances
  let readPtr = h.instancesPtr;
  // write-head points at the *packed* area’s next visibleIndex slot (at array head)
  let writeHead = h.instancesPtr + <usize>OFFSET_ActorInstance_visibleIndex;

  let visCount = 0;

  for (let i = 0; i < count; i++) {
    store<i32>(readPtr + <usize>OFFSET_ActorInstance_visibleIndex, -1);
    const pos = v128.load(readPtr + <usize>OFFSET_ActorInstance_translation);
    let inside = 1;
    for (let k = 0; k < 6; k++) {
      const n0 = unchecked(pn0s[k]);
      const d  = unchecked(ds[k]);
      const mul = f32x4.mul(pos, n0);
      const dot = f32x4.extract_lane(mul, 0)
               + f32x4.extract_lane(mul, 1)
               + f32x4.extract_lane(mul, 2);
      const radius = baseRadius * f32x4.extract_lane(pos, 3);
      if (dot + d < -radius) { inside = 0; break; }
    }

    if (inside) {
      store<i32>(writeHead, i);
      writeHead += <usize>SIZEOF_ActorInstanceHeader;
      visCount++;
    }
    readPtr += <usize>SIZEOF_ActorInstanceHeader;
  }
  h.visibleCount = visCount;
}

UPDATE: Added a distance check culling portion of the cull function for easy cull paths. Wanted to stress test the memory limits of the ShaderObject. Here is 150k instances running at 60 FPS.

2 Likes