WebGPU compute shader babylon.js vs plain js speed comparison

roland · April 7, 2024, 9:28pm

Hi guys!

There is quite a difference reading the result from the storage buffer using babylon.js and w/o it. Does anybody know what adds the overhead in babylon.js comparing to the ‘native’ approach? Can it be because of the running renderLoop in babylon.js?

Another question: if I stop the renderLoop I can’t read the buffer. Is there a way to run a compute shader and read it’s output w/o the renderLoop running?

babylon.js: 6.1830078125 ms
w/o babylon.js: 3.323974609375 ms

Running on a MacBook M1 Max.

This babylon.js code:

 const valueCount = 20000;

  const inputDataArray = [];
  for (let i = 0; i < valueCount; i++) {
    inputDataArray.push(i);
  }
  const inputData = Float32Array.from(inputDataArray);

  const cs = new ComputeShader(
    "doubleValuesCompute",
    engine,
    { computeSource: computeShaderSource },
    {
      bindingsMapping: {
        values: { group: 0, binding: 0 },
      },
    }
  );

  const bufferValues = new StorageBuffer(engine, inputData.byteLength);
  bufferValues.update(inputData);

  cs.setStorageBuffer("values", bufferValues);

  await cs.dispatchWhenReady(inputData.length);
  await cs.dispatch(inputData.length);
  console.time("read the buffer");
  const bufferValuesView = await bufferValues.read();
  const results = new Float32Array(bufferValuesView.buffer);
  console.timeEnd("read the buffer");

Compute shader:

@binding(0) @group(0) var<storage, read_write> values : array<f32>;

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) gId : vec3<u32>) {
    let index : u32 = gId.x;
    values[index] = values[index] * 2.0;
}

Reads the output in:
read the buffer: 5.5830078125 ms

This code w/o babylon.js:

 const adapter = await navigator.gpu?.requestAdapter();
  const device = await adapter?.requestDevice();
  if (!device) {
    fail('need a browser that supports WebGPU');
    return;
  }

  const module = device.createShaderModule({
    label: 'doubling compute module',
    code: `
      @group(0) @binding(0) var<storage, read_write> data: array<f32>;

      @compute @workgroup_size(1) fn computeSomething(
        @builtin(global_invocation_id) id: vec3<u32>
      ) {
        let i = id.x;
        data[i] = data[i] * 2.0;
      }
    `,
  });

  const pipeline = device.createComputePipeline({
    label: 'doubling compute pipeline',
    layout: 'auto',
    compute: {
      module,
      entryPoint: 'computeSomething',
    },
  });

  const valueCount = 20000;

  const inputDataArray = [];
  for (let i = 0; i < valueCount; i++) {
    inputDataArray.push(i);
  }
  const input = Float32Array.from(inputDataArray);

  // create a buffer on the GPU to hold our computation
  // input and output
  const workBuffer = device.createBuffer({
    label: 'work buffer',
    size: input.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
  });
  // Copy our input data to that buffer
  device.queue.writeBuffer(workBuffer, 0, input);

  // create a buffer on the GPU to get a copy of the results
  const resultBuffer = device.createBuffer({
    label: 'result buffer',
    size: input.byteLength,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });

  // Setup a bindGroup to tell the shader which
  // buffer to use for the computation
  const bindGroup = device.createBindGroup({
    label: 'bindGroup for work buffer',
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: workBuffer } },
    ],
  });

  // Encode commands to do the computation
  const encoder = device.createCommandEncoder({
    label: 'doubling encoder',
  });
  const pass = encoder.beginComputePass({
    label: 'doubling compute pass',
  });
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(input.length);
  pass.end();

  // Encode a command to copy the results to a mappable buffer.
  encoder.copyBufferToBuffer(workBuffer, 0, resultBuffer, 0, resultBuffer.size);

  // Finish encoding and submit the commands
  const commandBuffer = encoder.finish();
  device.queue.submit([commandBuffer]);

  // Read the results
  console.time('read the buffer')
  await resultBuffer.mapAsync(GPUMapMode.READ);
  const result = new Float32Array(resultBuffer.getMappedRange().slice());
  resultBuffer.unmap();
  console.timeEnd('read the buffer')

reads the buffer nearly two times faster:
read the buffer: 3.323974609375 ms

Thank you!

Evgeni_Popov · April 8, 2024, 8:56am

You don’t measure the same thing, because StorageBuffer.read postpone the read to the end of the frame, so that it happens after the command buffer has been submitted:

github.com

BabylonJS/Babylon.js/blob/master/packages/dev/core/src/Engines/WebGPU/Extensions/engine.storageBuffer.ts#L60-L112


      
          WebGPUEngine.prototype.readFromStorageBuffer = function (

              storageBuffer: DataBuffer,

              offset?: number,

              size?: number,

              buffer?: ArrayBufferView,

              noDelay?: boolean

          ): Promise<ArrayBufferView> {

              size = size || storageBuffer.capacity;

          

              const gpuBuffer = this._bufferManager.createRawBuffer(size, WebGPUConstants.BufferUsage.MapRead | WebGPUConstants.BufferUsage.CopyDst, undefined, "TempReadFromStorageBuffer");

          

              this._renderEncoder.copyBufferToBuffer(storageBuffer.underlyingResource, offset ?? 0, gpuBuffer, 0, size);

          

              return new Promise((resolve, reject) => {

                  const readFromBuffer = () => {

                      gpuBuffer.mapAsync(WebGPUConstants.MapMode.Read, 0, size).then(

                          () => {

                              const copyArrayBuffer = gpuBuffer.getMappedRange(0, size);

                              let data: ArrayBufferView | undefined = buffer;

                              if (data === undefined) {

This file has been truncated. show original

roland · April 8, 2024, 11:01am

…and that’s why it doesn’t work when the render loop is not running. Now it makes sense.
Thank you!

Evgeni_Popov · April 8, 2024, 4:15pm

Note that you can pass true for the noDelay parameter of StorageBuffer.read if you want to read the result right away without a render loop running.

Topic		Replies	Views
How to quickly read data from a compute shader? Questions compute-shader	7	76	December 20, 2024
Data transfer from compute shader to renderer Questions compute-shader	18	1487	December 7, 2023
Question about GPU buffers with WebGPU Questions	6	638	April 29, 2023
Babylon React Native Compute Shaders Questions	5	755	January 23, 2023
Run babylon.js using GPU in nodejs? Questions	7	1509	May 11, 2020

WebGPU compute shader babylon.js vs plain js speed comparison

Related topics