WebGPU writeBuffer overWrites with zeros in special cases

Has anyone experienced this bug in real code?

I’ve been looking through code related to partial updates to VertexBuffer.

In particular, I’ve read the MDN Docs and the WebGPU specification docs and found that the MDN Docs related to GPU.Queue.writeBuffer are wrong with respect to alignment requirements.

This post is a just a reminder for me to delve deeper when I get some time later.

Lines around 85 of webgpuBufferManager.ts

The code protects against alignment of the byteLength, and copies the buffer twice if byteLength is not a multiple of 4.

The first copy appears to be a mistake, because it assigns a variable named tempView with a TypedArray slice. Slice actually does not create a view, it copies the array to a new array. What was probably intended was

const tempView = new Uint8Array(src.buffer, chunkStart, chunkEnd);

The code checks for alignment of the data length and copies extra zeros if needed to pad the copy to a 4-byte boundary.

If you have a copy of the GPUBuffer in memory and want to update 2 bytes, the current code will zero out two additional bytes in the GPUBuffer. This seems bad. Also if you try to update the second pair of bytes in the buffer, I expect the current code to fail because, according to the WebGPU specification, the dstOffset must also be aligned.

Better, I think, would be to extend both the beginning and the end of the range to copy additional data from the source to the destination. If the source is just a full copy of the GPUBuffer, then the result would at least not overwrite the GPUBuffer with zeros.

I haven’t worked out the full replacement code, but the last bit does something like this:

// we might copy more than requested to make sure the write is aligned
startPre = dstOffset & 3 // use 3 for 4-byte alignment. Use 7 for 8-byte alignment.
srcByteOffset -= startPre
dstByteOffset -= startPre
byteLength = (byteLength+startPre+3) & ~3
this._device.queue.writeBuffer(buffer, dstByteOffset, src, srcByteOffset, byteLength); // "offset" was removed because it's 0

This avoids multiple CPU-side copies of the data.

If writeBuffer is limited to 1024 * 1024 * 15 bytes, then the loop needs to be modified as well.

https://www.w3.org/TR/webgpu/#dom-gpuqueue-writebuffer

1 Like

cc @sebavan regarding the code, and most notably the comment // After Migration to Canary. I wonder if copying with writeBuffer by chunks is related to this comment?

Also, I don’t think we can correct the dstByteOffset property. For eg:

  • the user call setData with dstByteOffset=2 and srcData=[1,2]. So, the number of bytes to copy is 2
  • we correct dstByteOffset to be 0 and set the byte length to copy to 4
  • we do the copy: we write [1,2] at the start of the GPU buffer and undefined values at offsets 2 and 3, which is wrong

I think we should simply document the fact that dstByteOffset should be a multiple of 4 and let it crash at runtime if it isn’t.

I agree for slice, though, we should definitely use new Uint8Array(src.buffer, chunkStart, chunkEnd); to avoid a copy!

1 Like

If sourceOffset+dataBytes > the size of source, then writeBuffer throws an OperationError.

It makes sense to me to either

  1. Handle alignment issues in both dstOffset and dataBytes, or
  2. Let the user deal with all alignment.

And not

  1. Only handle dataBytes alignment correction and ignore dstOffset alignment.

Given the current code tries to handle alignment of dataBytes, my thought was to extend that to dstOffset as well.

In the case where srcData is a buffer of 2 bytes, writeBuffer() will always fail. The current code would copy the buffer to a 4 byte buffer and write zeros to the next two bytes in the GPUBuffer. This seems detrimental to the destination buffer and, perhaps, not what anyone would want.

My suggestion applies to a case where a full copy of the buffer is kept in an ArrayBuffer or TypedArray and only a small number of bytes are updated then sent. Only in this case does my suggestion work. And it is not relevant to buffers of Float32Array, as those are already 4-byte aligned. My workaround would come into play with updating a UInt8 or Uint16 buffer one element at a time (per frame, for example). Or it would come into play if the dstOffset were non 4-byte aligned.

If you were trying to update from a small data buffer and the dstOffest were not aligned, then to get the bytes you wanted to the location you wanted, the front byte(s) of the source must be padded. In summary, forcing alignment with zero padding doesn’t seem to work in cases when the source buffer is not a full copy of the GPUBuffer.

Instead, I pivoted to “let’s make it work for a different use case.” Namely, when buffers are a full copy of each other.

If we get rid of all alignment correction within this function, then adding alignment correction with a supporting function is also workable, and not too burdensome (an extra function call containing my alignment code above) and we save code cycles in the nominal cases when alignment is inherently correct (such as float32 buffers).

After a brief internet search, I also don’t see where the Canary/Chrome issue of max bufferWrite comes from.

If you wanted to handle it, wouldn’t you handle it by writing zeros before the dstOffset bytes just like you are doing with the bytes after the designated destination bytes. The buffer copy could create those bytes just like it does for the dataBytes alignment. Honestly, I don’t see a need for writing zeros in either case. It destroys data outside the designated bytes. That’s why I prefer to handle both or neither.

My thought is that writing zeros is wrong as well, aside from the fact that it would actually throw an error instead of writing “undefined values.”

The issue is coming from before WebGPU was available in Chrome Canary… I am amazed this code is stil here tbh :slight_smile:

Here’s my untested attempt. I hope it’s useful for discussion.

  • setSubDataRaw - the lowest level. implemening chunking and calling GPUBuffer.writeBuffer
  • setSubDataAlign - wrapper around setSubDataRaw() that aligns dstOffset and bytesLength and copies from src.buffer outside the specified range. Suitable if src is a full copy of dst.
  • setSubDataPadLength - wrapper around setSubDataRaw() duplicating original setSubData()

My preference is to never use setSubDataPadLength(), but I include it because I don’t know if I’ve persuaded you. To insert this code as a non-breaking change, rename setSubDataPadLength to setSubData (and remove the original setSubData).

To insert this code as a breaking change, rename setSubDataRaw as setSubData and change the calls in setSubDataAlign and setSubDataPadLength to use the new setSubData. My preference would be to replace the existing setSubData with (renamed) setSubDataRaw and udpdate setSubDataAlign to call (the new) setSubData and don’t bother adding setSubDataPadLength.

// This calls GPUBuffer.writeBuffer() with no alignment corrections
// dstByteOffset and byteLength must both be aligned to 4 bytes
// and bytes moved must be within src and dst arrays
// note that 4th and 5th arguments are in BYTES, not elements, unlike GPUBuffer.writeBuffer()
// set maxChunk = Infinity for no chunking
// chunking ON by default
public setSubDataRaw(dataBuffer: WebGPUDataBuffer, dstByteOffset: number, src: ArrayBufferView, srcByteOffset = 0, byteLength = 0, maxChunk = 1024 * 1024 * 15): void {
    const buffer = dataBuffer.underlyingResource as GPUBuffer;
    
    // this I think fixes an error in original setSubData() by including srcByteOffset
    byteLength = byteLength || src.byteLength-srcByteOffset;;
    
    // if src is ArrayBuffer, then src.byteLength is available, but byteOffset is not.
    // src.byteOffset will interpret null or undefined as NaN so use src.byteOffset??0
    // src.buffer??src will be src because .buffer is undefined in ArrayBuffer
    
    // make srcByteOffset relative to src.buffer (allowing for ArrayBuffer with ??...)
    srcByteOffset += src.byteOffset??0;
    // since we convert to ArrayBuffer, data specified beyond src TypedArray WILL be copied from src.buffer 
    // we can check for this with "if (srcByteOffset+src.byteLength>src.byteLength) throw..."
    src = src.buffer??src; // here, make sure we're using an ArrayBuffer so args to writeBuffer are in bytes
    
    // or use (this is a little cleaner)
    //if (TypedArray.isView(src)) srcByteOffset += src.byteOffset;
    //else src = src.buffer;
    
    while (byteLength > maxChunk) {
        this._device.queue.writeBuffer(buffer, dstByteOffset, src, srcByteOffset, maxChunk);
        dstByteOffset += maxChunk;
        srcByteOffset += maxChunk;
        byteLength -= maxChunk;
    }
    
    this._device.queue.writeBuffer(buffer, dstByteOffset, src, srcByteOffset, byteLength);
    

}

// assume src is a full copy of dataBuffer, so handle alignment by copying 
// data outside srcByteOffset through byteLength. Use data from src.buffer if needed.
// chunking OFF by default
// if dstByteOffset is not a multiple of 4, copy additional bytes from src.buffer prior to src[srcByteOffset]
// and writing to buffer prior to buffer[dstByteOffset].
// Pad byteLength (adjusted by dstOffset alignment) the same way:
// by copying data after src[srcByteOffset+byteOffset]
public setSubDataAlign(dataBuffer: WebGPUDataBuffer, dstByteOffset: number, src: ArrayBufferView, srcByteOffset = 0, byteLength = 0, maxChunk = Infinity): void {
    
    // we might copy more than requested to make sure the write is aligned
    startPre = dstOffset & 3 // use 3 for 4-byte alignment. Use 7 for 8-byte alignment.
    srcByteOffset -= startPre
    dstByteOffset -= startPre
    byteLength = (byteLength+startPre+3) & ~3
    
    setSubDataRaw(dataBuffer, dstByteOffset, src, srcByteOffset, byteLength, maxChunk);
}
// pad byteLength to 4-byte boundary with 0. copy src to tmp with byteLength aligned to 4 bytes.
// this matches original setSubData
public setSubDataPadLength(dataBuffer: WebGPUDataBuffer, dstByteOffset: number, src: ArrayBufferView, srcByteOffset = 0, byteLength = 0, maxChunk = 1024 * 1024 * 15): void {
    
    // if alignment is needed, pad with 0 by copying to a new larger TypedArray
    
    const newbyteLength = (byteLength+startPre+3) & ~3
    if (newbyteLength!=byteLength) {
        const tempView = new Uint8Array(src.buffer??src,src.byteOffset+srcByteOffset,byteLength);
        src = new Uint8Array(newbyteLength)
        src.set(tempView)
        srcByteOffset = 0
    }
    setSubDataRaw(dataBuffer, dstByteOffset, src, srcByteOffset, byteLength, maxChunk);
}

Here’s a demonstration that the current WebGPU updateDirectly is broken for transfer sizes less than 4. And a new setSubData that fixes it for one use case.

(this comes from another thread on modifying PointsCloudSystem to do a partial update on VertexBuffer.)

On lines 142 and 143, you can change testNewGPU and chunkSize.

  • testNewGPU - false uses standard VertexBuffer.updateDirectly. true uses my setSubDataAlign function (when playground is set for gpu).
  • chunkSize = use 1, 2, 4, or 12. splits the 12-byte update into chunks of this number of bytes.

Result:

WebGL2 updateDirectly: works with all chunk sizes (1, 2, 4, 12)
WebGPU updateDirectly: works with chunk sizes 4 and 12. fails to update GPUBuffer with sizes 1 or 2.
WebGPU setSubDataAlign: works with all chunk sizes (1, 2, 4, 12)

When testing setSubDataAlign, the numbers flickering on the lower left are the number of bytes transferred that are within the ArrayBuffer but outside the TypedArray overlaying it.

Note that setSubDataAlign is still likely to fail if dstOffset and/or byteLength are not aligned and there’s not enough data in the ArrayBuffer behind the TypedArray. There’s a way to fix this, but it would pad with zeros.

Edit: setSubDataAlign can replace setSubData on bufferManager (when the engine is WebGPU):

engine._bufferManager.setSubData = setSubDataAlign

There’s also a PointsCloudSystem problem on WebGPU where the size parameter is not respected. I’ll post more on that when I get time to debug in a week or so.

1 Like

Thanks for the detailed analysis!

My opinion is that we should use the version that correctly aligns the data at both ends, and document what happens if src is not a complete copy of the GPU buffer when the data is not aligned.

This would be a breaking change, but I don’t think copying unaligned data happens that often, and, as you said, it’s buggy anyway because the current implementation overwrites some GPU buffer data with zero. If this really is a problem (people complain about it), we can always add a static property to the class to allow the user to choose between the old and the new behavior.

That way, we wouldn’t have setSubDataPadLength and your setSubDataAlign method would be renamed to setSubData.

As for your implementation of setSubDataRaw, you can simplify it because src is an ArrayBufferView, so byteOffset always exists (in addition to byteLength). Also, I disagree with this:

byteLength = byteLength || src.byteLength-srcByteOffset

When the function is called, the user transmits the number of bytes to be copied, regardless of the offset in the source buffer. This offset only exists to allow the user to indicate the starting point of the data in src.buffer, but the number of bytes to be copied is always byteLength.

What do you think @sebavan and @Deltakosh? If it’s ok, I think @HiGreg (if you agree!) can make a PR for it?

[EDIT] I also think we’ll be able to remove the implementation with chunks, but I’ve asked the Dawn developers in the Matrix channel to be sure. I’m waiting for their reply.

Maybe it’s clearer if I include parenthesis:

byteLength = byteLength || (src.byteLength-srcByteOffset)

srcByteOffset is only subtracted if byteLength is 0 so the data from offset to the end is pushed (maybe I was wrong about order of ops; the parens fix that).

Agreed.

Only problem is I don’t have a full dev system and haven’t set up for a PR. Maybe next week I can get a PR set up.

My bad, I read it wrong!

No problem, I can do the PR if you prefer.

Feel free to!

Be sure to take the code from the Playground as the code there may have been updated.

Here’s the PR, tell me what you think:

Edit: I think your PR looks good. I wouldn’t replace setRawData as detailed below. Maybe be prepared to add setDataArrayBuffer below, but only if anyone needs it.

I question the utility of the existing function setRawData:

public setRawData(buffer: GPUBuffer, dstByteOffset: number, src: ArrayBufferView, srcByteOffset: number, byteLength: number): void {
    this._device.queue.writeBuffer(buffer, dstByteOffset, src.buffer, srcByteOffset, byteLength);
}

This doesn’t take into account the actual range of the TypedArray within its ArrayBuffer. For setRawData, under what conditions would you want srcByteOffset and the default byteLength to be relative to the underlying ArrayBuffer instead of relative to the TypedArray? setRawData is misleading at best. Only when the TypedArray overlaps the full ArrayBuffer is this useful.

I think this leads me to the conclusion that my setSubDataRaw should replace setRawData. If we want an ultra-lean writeBuffer, then maybe it’s worth adding a function that takes an ArrayBuffer directly. Then it takes up zero extra time: there are no additional calculations. Something like this:

public setDataArrayBuffer(buffer: GPUBuffer, dstByteOffset: number, src: ArrayBuffer | SharedArrayBuffer, srcByteOffset: number, byteLength: number): void {
    this._device.queue.writeBuffer(buffer, dstByteOffset, src, srcByteOffset, byteLength);
}

If the original functionality of setRawData is desired, then call setDataArrayBuffer with src.buffer.

I think this is fixed by the green line?

If users don’t want to start copying at offset 0 with a typed array, they must pass a suitable offset through srcByteOffset, by multiplying the number of elements to skip with the size of an element (same for byteLength).

Note: if possible, it would be better to comment directly on the PR.

In fact, we have a problem with non-aligned byteLength…

For example, when loading a glb file, an index buffer Int16Array is created with a length of 13347. The byte length is 26694, which is not aligned with 4 bytes. The GPU buffer is correctly created with a size of 26696, and when calling setSubData, byteLength is also correctly set to 26696, but the copy fails because the source buffer has a length of 26694 bytes. I can’t see any other way of solving the problem than by doing what we did before, i.e. creating a temporary buffer of the correct size…

What do you think?

Yes, the green line fixes the srcOffset issue, hence my edit.

The GPUBuffer (destination) must be 2 bytes larger (as you indicated). The restriction is on dstOffset and byteLength. And I don’t see how to write to those last (original) two bytes otherwise.

Either the TypedArray should mirror that size (with last two bytes padded with zero), or call setSubDataAlign twice. Perhaps the code creating the ArrayBuffer can be created aligned to 4 bytes. The TypedArray can overlay and be the “correct” two bytes shorter. If the underlying buffer is made larger and the TypedArray is not, then other code that looks at the TypedArray.byteLength still functions as expected. And the new setSubData (with alignment) will work as well. This might reveal other functions that assume the TypedArray is the full ArrayBuffer (which in my opinion should be fixed). I’m assuming the Int16 buffer is created before the GPUBuffer, so special reference to the engine (edit: to know what, if any, alignment is needed) when creating the Int16 buffer might be awkward.

If we must keep the ArrayBuffer unchanged then I think the choices are copying the entire array on every setSubDataAlign, or copy to a small buffer and call setSubDataAlign twice. If calling setSubDataAlign twice, the second uses a small buffer of four bytes containing the last two data bytes zero padded (and the extra call to writebuffer). Might perform better. Not sure. Seems better than copying the entire array.

  • call setSubDataAlign using byteLength-2,
  • create new ArrayBuffer(4)
  • newBuffer[0] = origbuffer[byteLength-2]
  • newBuffer[1] = origbuffer[byteLength-1]
  • setSubDataAlign src=newBuffer and dstOffset=byteLength-2

Or add a conditional within setSubDataAlign to handle that case. The conditional tests for srcOffset + byteLength (using the adjusted/aligned values) > src.buffer.byteLength

What do you think?

It’s not possible to require that the source buffer size always be aligned to 4 bytes, that would be a breaking change.

I’m in favor of not complicating things in the low-level implementation and letting the user handle borderline cases, knowing that even if the data isn’t aligned, if the source buffer is a mirror of the GPU buffer (which probably happens most of the time), everything will work as expected.

cc @sebavan and @Deltakosh for comments.

agreed! 10000%