WebGPU compute shader babylon.js vs plain js speed comparison

Hi guys!

There is quite a difference reading the result from the storage buffer using babylon.js and w/o it. Does anybody know what adds the overhead in babylon.js comparing to the ‘native’ approach? Can it be because of the running renderLoop in babylon.js?

Another question: if I stop the renderLoop I can’t read the buffer. Is there a way to run a compute shader and read it’s output w/o the renderLoop running?

babylon.js: 6.1830078125 ms
w/o babylon.js: 3.323974609375 ms

Running on a MacBook M1 Max.

This babylon.js code:

 const valueCount = 20000;

  const inputDataArray = [];
  for (let i = 0; i < valueCount; i++) {
    inputDataArray.push(i);
  }
  const inputData = Float32Array.from(inputDataArray);

  const cs = new ComputeShader(
    "doubleValuesCompute",
    engine,
    { computeSource: computeShaderSource },
    {
      bindingsMapping: {
        values: { group: 0, binding: 0 },
      },
    }
  );

  const bufferValues = new StorageBuffer(engine, inputData.byteLength);
  bufferValues.update(inputData);

  cs.setStorageBuffer("values", bufferValues);

  await cs.dispatchWhenReady(inputData.length);
  await cs.dispatch(inputData.length);
  console.time("read the buffer");
  const bufferValuesView = await bufferValues.read();
  const results = new Float32Array(bufferValuesView.buffer);
  console.timeEnd("read the buffer");

Compute shader:

@binding(0) @group(0) var<storage, read_write> values : array<f32>;

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) gId : vec3<u32>) {
    let index : u32 = gId.x;
    values[index] = values[index] * 2.0;
}

Reads the output in:
read the buffer: 5.5830078125 ms


This code w/o babylon.js:

 const adapter = await navigator.gpu?.requestAdapter();
  const device = await adapter?.requestDevice();
  if (!device) {
    fail('need a browser that supports WebGPU');
    return;
  }

  const module = device.createShaderModule({
    label: 'doubling compute module',
    code: `
      @group(0) @binding(0) var<storage, read_write> data: array<f32>;

      @compute @workgroup_size(1) fn computeSomething(
        @builtin(global_invocation_id) id: vec3<u32>
      ) {
        let i = id.x;
        data[i] = data[i] * 2.0;
      }
    `,
  });

  const pipeline = device.createComputePipeline({
    label: 'doubling compute pipeline',
    layout: 'auto',
    compute: {
      module,
      entryPoint: 'computeSomething',
    },
  });

  const valueCount = 20000;

  const inputDataArray = [];
  for (let i = 0; i < valueCount; i++) {
    inputDataArray.push(i);
  }
  const input = Float32Array.from(inputDataArray);

  // create a buffer on the GPU to hold our computation
  // input and output
  const workBuffer = device.createBuffer({
    label: 'work buffer',
    size: input.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
  });
  // Copy our input data to that buffer
  device.queue.writeBuffer(workBuffer, 0, input);

  // create a buffer on the GPU to get a copy of the results
  const resultBuffer = device.createBuffer({
    label: 'result buffer',
    size: input.byteLength,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });

  // Setup a bindGroup to tell the shader which
  // buffer to use for the computation
  const bindGroup = device.createBindGroup({
    label: 'bindGroup for work buffer',
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: workBuffer } },
    ],
  });

  // Encode commands to do the computation
  const encoder = device.createCommandEncoder({
    label: 'doubling encoder',
  });
  const pass = encoder.beginComputePass({
    label: 'doubling compute pass',
  });
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(input.length);
  pass.end();

  // Encode a command to copy the results to a mappable buffer.
  encoder.copyBufferToBuffer(workBuffer, 0, resultBuffer, 0, resultBuffer.size);

  // Finish encoding and submit the commands
  const commandBuffer = encoder.finish();
  device.queue.submit([commandBuffer]);

  // Read the results
  console.time('read the buffer')
  await resultBuffer.mapAsync(GPUMapMode.READ);
  const result = new Float32Array(resultBuffer.getMappedRange().slice());
  resultBuffer.unmap();
  console.timeEnd('read the buffer')

reads the buffer nearly two times faster:
read the buffer: 3.323974609375 ms

Thank you!

You don’t measure the same thing, because StorageBuffer.read postpone the read to the end of the frame, so that it happens after the command buffer has been submitted:

1 Like

…and that’s why it doesn’t work when the render loop is not running. Now it makes sense.
Thank you!

Note that you can pass true for the noDelay parameter of StorageBuffer.read if you want to read the result right away without a render loop running.

1 Like