Batched or Instanced BoundingBoxRenderer

Version: git master (73ab4d6)

Status

Merged and released 7.37.2, latest example here: https://playground.babylonjs.com/#NRNVQA#2

Background

Currently in BoundingBoxRenderer, it renders in a loop, calling engine.drawElementsType for each bounding box in renderList, making every bounding box 1 or 2 draw call.
Since it renders on web, it would suffer the same performance issue like meshes when draw calls increases.

Proposal

Since only the worldMatrix changes for different bounding boxes, it should benefit from instancing, like thin instances for meshes.

  1. Allocate buffers for instancing.
  2. Loop renderList and fill buffers.
  3. Copy buffers to gpu.
  4. Render instanced.

With instancing the draw call needed for each BoundingBoxRenderer could be reduced to 1 or 2 (in case this.showBackLines enabled)

Example:

Alternatives

Since there are very few[1] vertex needed, and there is already vectorsWorld in BoundingBox, which updates every time world matrix updates, the computation of matrices can be skipped, and reuse computation result of vectorsWorld to rebuild vertex buffer every time it renders. In this case _indexBuffer needs to be reconstructed is count of bounding box mismatches.

[1]: 24 points in case of CreateBoxVertexData, or 8 points if constructed manually, since uvs and normals are not needed for rendering as lines

2 Likes

It looks like a good idea! Do you want to make a PR for it!

However, you should implement it with a flag (like useInstancing, default: false), so that itā€™s not a breaking change: currently, onBeforeBoxRenderingObservable and onAfterBoxRenderingObservable are notified for each bounding box, and it should still be the default behavior.

What about this optimization of drawing boxes with 4 lines of 4 vertices each instead of 12 lines of 2 vertices each?

I donā€™t think it changes anything at the GPU level, as the current bounding box renderer already issues a single draw call with 24 indices (2 indices per line), which canā€™t be optimized better:


(from Spector)

1 Like

Ah. Is that because drawElements doesnā€™t have a multiline primitive? Looks like the multidraw extension to WebGL would help, but itā€™s not supported everywhere (Firefox lacks support).

drawElements is already multilines, as you give a list of indices and it will draw lines between index 0 and 1, 2 and 3, and so on. Maybe I didnā€™t understand your question?

Agreed drawElements can draw multiple single-segment lines (lines defined by two points each). By ā€œmultilinesā€ I meant lines made up of multiple segments each. In my case, each multiline is three segments defined by four points. Points 0, 1, 2, 3 result in a multiline defined by line segments 0->1, 1->2, 2->3, and 3->4.

The more ā€œefficientā€ box definition uses four such multilines and is only 16 total vertices instead of 24. But if the end result is a drawElement(), which can only draw single-segment lines (defined by exactly two points each), then there is no savings in space or time.

1 Like

Since draws are batched, when should events trigger? Likeļ¼š

  1. loop renderList, make matrices and trigger onBeforeBoxRenderingObservable
  2. make draws
  3. loop renderList, make matrices and trigger onAfterBoxRenderingObservable

In this case, if one uses the 2 Observables to change rendering param for each box, it might not work as expected.

We would trigger each event only a single time (passing a dummy/undefined box), not for each box. That would be the ā€œbreaking changeā€ part (as well as the display being potentially different, because drawing the black/white part of each box one after the other can be different from drawing the black part of all boxes and then the white part).

Updated to use a dummy bounding box, but what is your option on whether or not to keep a renderList in DummyBoundingBox?

Comparing performance with or without SIMD:

Without SIMD (avg 3.694ms on my local chrome):

With SIMD (avg 3.277ms on my local chrome, ~11% diff):

source
#include <cglm/cglm.h>

extern unsigned char __heap_base;

uintptr_t get_heap_base() {
    // align with 64 bytes
    return (((uintptr_t) (&__heap_base)) + 63) & ~63;
}

unsigned bbox_compose(float * minmax, vec4 * mat, size_t count) {
    CGLM_ALIGN_MAT mat4 tmp_mat;
    CGLM_ALIGN_MAT vec4 diff, median;
    glm_mat4_identity(tmp_mat);
    float * m = (float *) tmp_mat;
    for (size_t i = 0; i < count; i++) {
        float * min = minmax;
        minmax += 4;
        float *  max = minmax;
        minmax += 4;
        glm_vec4_sub(max, min, diff);
        glm_vec4_scale(diff, 0.5, median);
        glm_vec4_add(min, median, median);
        // Directly update the matrix values in column-major order
        m[0] = diff[0];  // Scale X
        m[3] = median[0];  // Translate X
        
        m[5] = diff[1];  // Scale Y
        m[7] = median[1];  // Translate Y
        
        m[10] = diff[2];  // Scale Z
        m[11] = median[2];  // Translate Z
        glm_mat4_mul(mat, tmp_mat, mat);
        mat += 4;
    }
    return count;
}
1 Like

I donā€™t think we need a renderList (Iā€™m not sure what the user would do with it). We could simply document that when using the instanced mode, the passed bounding box has no meaning and should be ignored (would be better to be able to not pass a bounding box at all, but it would be a breaking change).

1 Like

Ok, Iā€™ll remove the renderList in events. Also, since SIMD does not have expected performance boost, Iā€™ll prefer not to use it.

And the WebGPU/WGSL port:

PR here:

And playground targeting this PR:

4 Likes