Unintended branch optimization

I have a conditional webGPU wgsl vertex shader set up like so:

@vertex
fn vs_main(input: VertexInput) -> VertexOutput {
    if (cond A && cond B) {
        // some logic dependent on instance index
          ...
        return VertexOutput( ... );
    } 
    // some other logic dependent on instance index
    ...
    return VertexOutput( ... );
}

Where cond A is set via a uniform -
uRenderParam.renderPass, and the condition is uRenderParam.renderPass == 1u.

I then use two different render passes back to back like so:

methodToSetTheUniformToZero();

renderPass1 = commandEncoder.beginRenderPass();
...
renderPass1.drawIndirect(indirectDrawBuffer1);
renderPass1.end();

methodToSetTheUniformToOne();

renderPass2 = commandEncoder.beginRenderPass();
...
renderPass2.drawIndirect(indirectDrawBuffer2);
renderPass2.end();

Cond B is set to true via another uniform on the mouse down event, and set back to false on mouse up event.

But what I’m observing is that when cond B is set to true, it’s as if the branch dependent on cond A and cond B is the only one being run but with the number of indirect draw calls that we would expect from indirectDrawBuffer1, which makes no sense.

Initially, I thought that maybe the driver ā€œoptimizesā€ by hoisting or pruning away one code path. I realize this wouldn’t fully explain the observed behavior, but it was my initial suspicion that it was something of this nature causing the problem.

So I refactored the shader like so-

@vertex
fn vs_main(input: VertexInput) -> VertexOutput {

  var output: VertexOutput;

  let RED   = vec4<f32>(1.0, 0.0, 0.0, 1.0);
  let BLUE = vec4<f32>(0.0, 0.0, 1.0, 1.0);
  
  if (cond A && cond B) {
      // some logic dependent on instance index
      ...
      output.vColor = RED;
  } else {
      // some other logic dependent on instance index
      ...
      output.vColor = BLUE;
  }

  return output;
}

And still, while cond B is true (during the mouse down event), I only see RED, where I would expect to see both present due to both render passes running per frame back to back.

In both iterations, if I’m not holding the mouse down, I only see the output from the later branch, setting vColor to BLUE, which is expected. It’s when I am holding the mouse down that I would expect to see a mixture of both.

If anyone has had similar experiences, or understands what is happening in this scenario I would really appreciate hearing from you.

OS: MacOS
Hardware: Apple silicon
Browser: Chrome

My updated understanding:

  1. One GPUBuffer for both passes.
    Under the hood, queue.writeBuffer (or whatever staging copy you use) only schedules a write OP; the GPU doesn’t necessarily ā€œsampleā€ the contents at the moment you bind the group, but rather when it actually executes that draw command.
  2. Second write overwrote the first.
    By the time the GPU gets around to executing the first pass’s draw call, my second uniform update has already overwritten it’s contents into it’s underlying GPU buffer. Both passes end up seeing the last‐written data.

Fix:
Use two separate GPUBuffers with two separate bindgroups used to represent the same uniform in the vertex shader used by both draw calls; use bindgroup_pass_1 for the first render pass, then use bindgroup_pass_2 for the second render pass.

2 Likes

Indeed, queue.writeBuffer executes at the time you call it, whereas draw calls are enqueued in a command buffer and are executed only when commandBuffer.submit is called (which is done at the end of the frame in Babylon). So, if you do writeBuffer1, draw1, writeBuffer2, draw2, writeBuffer1 and writeBuffer2 will happen before draw1 and draw2 are executed.

You can call engine.flushFramebuffer to execute the commands currently in the queue, but this is not recommended as it may adversely affect performance.

1 Like

You can call engine.flushFramebuffer to execute the commands currently in the queue, but this is not recommended as it may adversely affect performance.

Yeah, I was reading on the performance impacts of using multiple command pass encoders each with their own rendering pass / draw call, versus multiple render passes on the same command pass, and the later is far more performant (from what I can tell).