Generating Slope maps and Noise maps

I’m interested in generating slope maps from height maps of 4096 x 4096 resolution as well as generating perlin noise maps for textures of the same resolution.

Doing so with a simple javascript solution takes 1.5 seconds which isn’t terrible, but also not great.

Here’s the breakdown in timings:
normalizeHeightMapInPlace took: 243 ms
createNoiseMapForGrid took: 804.9000000059605 ms
buildSlopeMap took: 403.59999999403954 ms

My goal is to continue doing these calculations on the fly rather than storing the results.

I attempted to make a compute shader that does this work for me, but attempting to dispatch 256 x 256 x 256 compute jobs crashes my 2017 Macbook Pro OS. If I try to dispatch 4096 x 4096, it instead only completes 256 x 256 of the total 4096 x 4096 jobs. ( Using @compute @workgroup_size(16, 16) )

I’m thinking I should spread the work out over multiple wasm modules that each execute on their own thread - but I just thought I’d ask the community here for any advice or tips first before doing this…

Thanks in advance!

You could try using only the first parameter for workgroup size and let the compiler/hardware figure it out. I think all cards are different for the actual hardware clusters of cores you get. I just double checked, nvidia uses 32 cores (nvidia term is warp) and amd uses either 32 or 64 (amd term = waves). Seems both will have 1024 limit for shared memory access between the warps or waves. Which explains why webgpu limits you to 1024. Tldr is, i think just use a single parameter for workgroup size of up to 1024 to make sure you arent idling cores. Maybe @Evgeni_Popov has some suggestions for best practice?

For cpu, Here is a way to stitch together multiple patches in workers. The gh pages are broken because shared array buffer, but it works. Its a pathtracer but the stitches logic is good for your needs i think? .https://github.com/01alchemist/as-smallpt/blob/69dfe1a3777fb339abff6616442c4f87c7254617/js/index.js#L71

Maybe u find this interesting also
GitHub - greggman/dekapng: Make giant PNG files in the browser , little helper to make even bigger images, lol

2 Likes

Tldr is, i think just use a single parameter for workgroup size of up to 1024 to make sure you arent idling cores.

Omg :sweat_smile:

This turned out to be the culprit. I was already suspicious that when dispatching 4096 x 4096, exactly (4096 / 16) x (4096 / 16) were getting a value filled out in the target storage buffer used to store the results. Since 16 is my work group size via @workgroup_size(16, 16).

I’m still reading through your wasm link and that giant PNG files in the browser link. Funny enough I actually have a use-case that was blocked by the size limit of PNG files the browser can create with the standard canvas to png approach. Seriously, thanks for this!

1 Like

As @jeremy-coleman said, the number of threads running simultaneously is usually 32 for Nvidia and 64 for Amd. That’s why you’ll often see workgroup sizes like (8,8,1), which should work fine in both cases. With this workgroup size, you should run a (512,512,1) dispatch. But (16,16,1) should also have worked (with a dispatch of (256,256,1)), so I’m not sure why it didn’t work for you…

Try looking at the limits for workgroup sizes in engine.supportedLimits :

Look at the values maxComputeWorkgroupSizeX/Y/Z (which are the maximum workgroup size limits) and maxComputeInvocationsPerWorkgroup, which is the maximum possible value for workgroup_size.x * workgroup_size.y * workgroup_size.z (the values you pass to @compute @workgroup_size(x, y, z)).

MaxComputeWorkgroupPerDimension is the maximum value for any of the x/y/z parameters you pass to the dispatch call.

Make sure you don’t pass any values outside the limits of your GPU (but I suppose you’ll get a browser error if you do…).

2 Likes

Oh that’s pretty cool, I was unaware of engine.supportedLimits and maxComputeWorkgroupSizeX/Y/Z that’s super useful.

As @jeremy-coleman mentioned, I was in fact not making correct use of the individual jobs being run on the cores after dispatch. I wasn’t idling them, but a weird bug I encountered when setting int values on a storage buffer and attempting to read them as u32 values from the compute shader was causing my program to error in a way that would look the same as idling cores. The fix was to read them in as f32 values from the compute shader.

One thing im unclear on is what the driver does when workgroup size exceeds warp size. Like,1024 works, but is obviously nonsensical because workgroup size is meant to be local to the warp/wave. But you can use it as hack kind of for consistent behavior. So, what happens? Does the gpu driver truncate the value to max local size (meaning we maybe need to increase dispatch) or does it get spread out and size x dispatch respected. (seems more likely). What happens on m2/arm/chiplet unified arch? Also, Idk what the point of using 3 or even 2 dimensions if you’re goal is 100% utilization (ie: (8,8,1) vs (64)). Same thing right? (No memory layout implications?) The only thing i could think of for this functionality would be purposefully spreading out compute at lower capacity in order to increase power efficiency. Like a (2,2,2) with max dispatch on embedded devices that run some continuous compute like jetson nanos doing vision stuff. Idk if thats the case, just only use case i could imagine. Anyway, it seems to me like we could focus on dispatch, which gives us control to match memory layout to job logic, and just always max out workgroup size.

One thing im unclear on is what the driver does when workgroup size exceeds warp size. Like,1024

Here’s what happens on my 2017 mac:

Lol 1600 x 16. 25600 cores in a single warp!! Workgroup size is supposed to be only a small portion of your hardware, like the little actual physical groups on the gpu chip (so max 32 or 64). Thats weird though, i thought everyone had 1024 available if only using 1 dimension since opengl defines that as minimum. (Still nonsense value though)

Lol, sorry I’m a bit jet lagged. Were you asking what would happen with a less ridiculous value?

If so, here’s two more examples:
@compute @workgroup_size(257)

@compute @workgroup_size(257, 1, 1)
Same exact error.

If you exceed the maximum in any one workgroup dimension, that mac aint haven’t it.

In my understanding:

  • having 3 dimensions in a compute shader is mostly for user convenience. It may be easier, depending on the task at hand, to have one, two or three values to identify a compute job. For eg, if you want to run a compute shader for each texel of a texture, using a two dimensional workgroup size may simplify your code a little (and/or make it more readable).
  • you can have a workgroup size bigger than a warp / wavefront (wavefront==warp in AMD land, but it is generally 64 threads instead of 32). In that case, multiple blocks will be created, each with “warp” threads. I think there’s a constraint for the underlying hardware that in this case the blocks will all be handled by the same streaming processor: It allows the blocks to use the same shared memory.
1 Like

Not related to anything mentioned yet, but i think we should clarify for anyone in the future that there is possibly some confusion related the shader code and api calls

defined in the shader
wlsl: @workgroup_size(x,y,z)
glsl: layout (local_size_x , local_size_y , local_size_z) in;

api invocation:
webgpu: dispatchWorkgroups(groups_x, groups_y, groups_z)
opengl: glDispatchCompute(groups_x, groups_y, groups_z)

^they seem to maybe reference the same thing, but they aren’t at all - probably unclear to most people at first glance, especially if webgpu is first intro to compute shaders. defining the proper workgroup size in the shader requires knowledge about the physical gpu hardware. the dispatch calls are what is relevant to your actual compute logic - BUT, if you exceed your hardware limits in the workgroup size definition, this implicitly can bleed over into your dispatch logic (as @arcman7 demonstrated above).

i also was googling just now and found this article, it actually kind of sums up some stuff pretty well.
.WebGPU Compute Shader Basics.

why not just always use `@workgroup_size(1, 1, 1)?
For one, threads in a workgroup often run in lockstep so running 16 of them is just as fast as running 1.

(so, this is how i understand it works as well, that if you dont use all the cores on the warp, its just pure waste from a performance standpoint, you cant use the idle cores for some other tasks - thats why i said above the only thing i can think of to not max out would be power efficiency)

Unfortunately, the perfect size is GPU dependent and WebGPU can not provide that info. The general advice for WebGPU is to choose a workgroup size of 64 unless you have some specific reason to choose another size. Apparently most GPUs can efficiently run 64 things in lockstep. If you choose a higher number and the GPU can’t do it as a fast path it will chose a slower path. If on the other hand you chose a number below what the GPU can do then you may not get the maximum performance.

(This is kind of my point that, why do we need workgroup size at all? we just want to max out the gpu caps. Does workgroup_size actually offer anything besides this?? this touches on the (8,8,1) vs 64 above, but doesn’t answer the question)

@Evgeni_popov

  1. Is there a difference between (8,8,1) vs 64? Is it STRICTLY user convenience? Is there any scenario in which its not? Can we come up with a scenario where (8,8,1) is not equal to (64)? Maybe suppose we’re working with a 3d texture. Would (8,8,1) allow us to operate on 8 layers in parallel, while 64 would be limited to 1 layer?

  2. Is there actually some memory implications for workgroup sizes larger than 64? I thought defining how shared memory is handled should be done in the dispatch call? (granted any overflow you have in the workgroup size is actually adding to your defined dispatch calls - ie: defining workgroup size of 128 would actually just 2x or 4x your dispatch calls and bump down the workgroup size to 32 or 64) . For shared memory, the warp/wave has shared local memory within the warp/wave, but i think multiple warps/waves also potentially get shared memory up to 1024 threads (32 warps or 16 waves) when using webgpu. Can anyone confirm this or am i making shit up? (i think using native compute or different compute model like cuda, a 3000 series nvidia can have 2048 threads with shared memory - just contextualizing physical hardware caps vs webgpu arbitrary limits).

so many words, im sorry

Please take what I say with a grain of salt, as I am not an expert in compute shaders!

Yes, as far as I’m concerned, it’s exactly the same (with the exception of the values of certain built-in variables that you can use in your shader). In your example, you’re mixing up the dimension of a 3D texture with the dimension of a workgroup: they are not related all.

The dimension of a workgroup defines only the number of threads that will be started for that workgroup (the workgroup is called a “thread block” in some documentations, which is clearer in my opinion). If your workgroup is (16,16,1), then the corresponding thread block will have 16*16*1=256 threads. Depending on the numbers you pass to the Dispatch call, a certain number of thread blocks will be created (number of created thread blocks = dispatch_size.x*dispatch_size.y*dispatch_size.z). And depending on this number of thread blocks, some will be started immediately and run in parallel, and others will be delayed because the GPU may not have enough resources (or for other reasons) to run everything in parallel by the time you call Dispatch.

What is guaranteed is that when a thread block is launched by the GPU, all the threads in that block run in parallel: there cannot be threads that are launched and others that are not. Of course, during the execution of a thread, a number of things can happen (reading a texture / some shared memory, executing different branches, etc.) that will throw the threads out of sync: they won’t necessarily always execute the same instruction in your code at the same time. This is why you have synchronization objects, which you can use to ensure that all threads in the thread block have reached the same point in their execution before continuing.

Let’s return to your 3D texture example. In a compute shader, you can retrieve/calculate an index that lets you know which thread within the thread block is currently executing your code and what the index of that thread block is (in case the dispatch call has created multiple thread blocks). From this index, you can calculate a texture coordinate (x,y,z) to access your 3D texture.

For example, suppose your 3D texture has the dimensions (16,8,4). Let’s also assume that you’ve defined a workgroup size of (32,1,1), i.e. 32*1*1 = 32 threads.

You want to run as many threads as the number of texels in your texture. You have 16*8*4=512 texels, so you need 512/32=16 thread blocks. This means you’ll call Dispatch with parameters (16,1,1).

In your computer shader, you can retrieve the thread index by doing global_invocation_id.x: you’ll get a value in the range [0, 512[ (512=workgroup_size.x (32) * dispatch_size.x (16)) - see WebGPU Shading Language for an explanation of global_invocation_id.

From this value, it’s easy to calculate the (x,y,z) coordinates:

let textureDim = vec3(16u, 8u, 4u);
var coord : vec3<uint>;

coord.z = global_invocation_id.x / (textureDim.x * textureDim.y);
coord.y = (global_invocation_id.x - coord.z * textureDim.x * textureDim.y) / textureDim.x;
coord.x = (global_invocation_id.x - coord.z * textureDim.x * textureDim.y - coord.y * textureDim.x);

(untested code, so probably wrong! but you get the idea)

I used a one-dimensional workgroup size because it’s easier to explain that way, but you could use a two-dimensional or three-dimensional workgroup size, and the explanations wouldn’t change, except for the calculation of the texture coordinates.

In fact, for the 3D texture example, it would be much easier to use a workgroup_size of (16,8,4) (and a Dispatch of size (1,1,1)): this way, global_invocation_id would be your texture coordinates directly!

Shared memory is something you define at workgroup level, not at warp/wavefront level. In WebGPU, it’s declared in your shader by something like var<workgroup> sharedMem : array<vec3<f32>, 128>;. It is not defined at dispatch time.

All threads in a workgroup (=thread block) will have access to this memory. If you have several thread blocks, each will have its own shared memory block, not visible to other thread blocks!

For this to work, all threads in a thread block must run on the same GPU “core” (I think these cores are called SMs (Streaming Multiprocessors), but I’m not sure). This is where certain limitations come into play:

  • (understandably) an SM can’t run an unlimited number of threads in parallel. You’re limited by your GPU’s maxComputeInvocationsPerWorkgroup limit (mine is 1024). So, workgroup_size.x*workgroup_size.y*workgroup_size.z cannot exceed this limit
  • the total amount of shareable memory is per SM and is given by your GPU’s maxComputeWorkgroupStorageSize limit (mine is 32k). Note that this is per SM and not per thread block (=workgroup)! This means that if your SM is running 8 thread blocks in parallel, for maximum efficiency, your compute shader shouldn’t use more than 32k/8=4k of shared memory per workgroup!

I hope things are clear and that I haven’t said something obviously wrong!

2 Likes

[gonna read this tomorrow and edit probably, bedtime now]

btw this post is my wip to gather some best practices, not a response to to anything or anyone specific, probably half of the content is directed to my future self or any interested onlookers and half to popov

not mixing up! I used a 3d texture as an example because we could theoretically split a 3d texture into layers based on z coordinate and operate independently on each xy partition. If the declaration of workgroup_size could have any benefit, i just thought this could be a scenario to answer the question “Is there a difference between (8,8,1) vs 64?” Maybe someone else can think of a scenario that should (theoretically, not necessarily in reality) have potential to benefit from being able to declare a 2d or 3d workgroup size?

see, it is a good example :slight_smile: You made it better though. We have several scenarios now.

1 - workgroup size of (32,1,1), Dispatch with parameters (16,1,1) - your original scenario
2 - workgroup size of (8,8,1), Dispatch with parameters (8,1,1) - same, account for amd
3 -workgroup size of (64,1,1), Dispatch with parameters (8,1,1) - same, but 8,8,1 vs 64
4 - workgroup size of (4,1,1), Dispatch with parameters (4,4,4) - poorly chosen workgroup size
5a - workgroup size of (16,8,4), Dispatch with parameters (1,1,1) - your final suggested scenario
5b - workgroup size of (16,8,4), Dispatch with parameters (1,1,1) - your final suggested scenario but remove any shared memory in shader
5c - workgroup size of (64,1,1), Dispatch with parameters (2,1,1) - your final suggested scenario, with shared memory as in 5a, but better sized workgroup considering shared memory. amd specific
5d - workgroup size of (32,1,1), Dispatch with parameters (2,1,1) - your final suggested scenario, with shared memory as in 5a, but better sized workgroup considering shared memory. nvidia specific
6 - workgroup size of (1024), Dispatch with parameters (1,1,1) - easy optimization test, no shared memory.

my expectations from those scenarios based on everything discussed so far:

1 - as fast as 2 and 3 on nvidia, much slower than 2 and 3 on amd.
2a - faster than 1 on amd. (proving amd with workgroup_size(64) is faster than workgroup_size(32)) - not actually sure if amd does or doesnt do well with workgroup_size(32) here, we should test a certainly bad workgroup size too.
2b - same perf as 3, proving (proving 8,8,1 vs 64 doesnt matter)
3 - same perf as 2 (proving 8,8,1 vs 64 doesnt matter)
4 - much worse performance than all other scenarios
5 - 5a will be slower than 5b, 5c, and 5d who all perform the same
6 - performs same as 2 and 3

Sorry, I was unclear. I equate workgroup with warp in my mind, but I should have said “For shared memory, the warp/wave workgroup has shared local memory (atomics) within the warp/wave workgroup, but i think multiple warp/wave workgroups also potentially get shared memory up to 1024 threads (32 warps or 16 waves) when using webgpu (aka maxComputeInvocationsPerWorkgroup).”

Anyway, I see what you mean. But also, id say the scope of shared memory is determined at dispatch (compiling a shader doesn’t allocate). Also, dispatch count is variable since its based on the end-user’s hardware and i guess i think of that as the top level context parameter.

i think SM is just the CUDA term for maxComputeInvocationsPerWorkgroup or like, how many compute units use the same memory stick. Can we restate your second bullet as "a compute shader shouldn’t use more memory than:
maxComputeWorkgroupStorageSize / ( maxComputeInvocationsPerWorkgroup / (32 or 64).

Finally, i think we should think about the problem in the context of actual hardware layout. these diagrams that i copy pasta’d from google images are a nice visual aid. I probably dont understand more than 1% of what these pictures imply, but still helpful to see different chip layouts and the physical access the compute units have to the same memory stick. this is my understanding of maxComputeInvocationsPerWorkgroup in picture form.

nvidia data center gpus


nokia phone

amd

apple

key takeaways (pending review!)

<insert notable scenario outcome>

(highly questionable hot take by me here)
If no shared memory besides atomics local to workgroup invocation, choose lesser of maxComputeInvocationsPerWorkgroup and maxComputeWorkgroupSizeX as 1d max workgroup_size and dispatch 1, the compiler should be able to optimize this.

(restating @Evgeni_Popov’s last bullet)
Compute shader shouldn’t use more memory than maxComputeWorkgroupStorageSize / ( maxComputeInvocationsPerWorkgroup / (32 or 64).

(webgpufundamentals).
Choose a workgroup size of 64 unless you have some specific reason to choose another size

Uniformity analysis explanation:

@arcman7 regarding texture vs storage, it seems that textures are always filtered, so image or storage is faster. However, there is a vulkan extension to read textures without sampling.

—-
Quick notes

Amd dev advocate said 32 wavefront is native and 64 causes 32x2 in lockstep on rdna2. If same for rdna3 , reevaluate assumption 64 is best, bc amd has 2 paths and webgpu driver could choose 64 and be perf loss

@Evgeni_Popov around 23 mins she says 64x1 vs 8v8 can be significantly different because it allows different access patterns which effects cache layout. At 48 mins she answers a question about use case for 3rd dim workgroup size. Lol she answered the question and says it can impact performance and he goes “hehe… why” .

2 Likes

This is an amazing summary. Seriously, once again, thank you @jeremy-coleman and @Evgeni_Popov for dropping the knowledge!

What do you mean as “benefit”? In my understanding, there’s no benefit at the hardware level, in both cases you will have a thread block of size 64. The only (small) benefit you could have is when computing a value to index some resources. Depending on your choices, this calculation could be a little easier in some cases (as demonstrated in my example, where in the second case global_invocation_id can be used directly).

Regarding you examples, here’s my thoughts regarding the workgroup size:

  1. the best possible performances are achieved if you choose a workgroup size the size of the warp/wavefront of your GPU. As you don’t/can’t know this value, the best value to use is 64 (being (64,1,1), (8,8,1), (4,4,4), etc., it does not matter)
  2. you may not be able to always this value, depending on the algorithm you have to implement.
  3. in any case, you should always use a multiple of this value, else you will have some unoccupied threads

Regarding 1., the loss in performance is because if a SM has only one warp available and your workgroup is 2*warp, another SM will have to be choosen, with at least 2 idle warps. This means the first SM may not be fully occupied (if there’s not a pending compute that only needs a single warp).

So, given what I said:

1 is faster (in theory) than 2 and 3 on Nvidia. On Amd, 1 is indeed much slower than 2 and 3 because half of the threads of a wavefront won’t be used.

Same thing for this:

Assuming 1024 threads is the maximum number of threads that a SM can launch, it means that the system can only select a SM that is idle to start your compute: if a single warp is currently running on a SM, it can’t be chosen. Whereas in the 2 and 3 cases, the workgroup size is 64, so it’s more likely that a SM is available to run the compute.

If you mean that multiple workgroups can share some memory as long as the sum of threads for these workgroups are less than maxComputeInvocationsPerWorkgroup, then I think it’s wrong. Shared memory can only be shared by the threads of a single workgroup. And the maximum size of this workgroup (=the maximum number of threads that can share this memory) is given by maxComputeInvocationsPerWorkgroup.

I think the right formula is:
maxComputeWorkgroupStorageSize / (maxComputeInvocationsPerWorkgroup / workgroupsize)

where workgroupsize = workgroup_size.x * workgroup_size.y * workgroup_size.z, ie. the number of threads in the workgroup.

See the answer from Plazmatic for additional complications regarding access to shared memory (at least on Nvidia):

https://www.reddit.com/r/vulkan/comments/lh9cu1/do_compute_shaders_only_parallelize_up_to_local/

I’m not sure to understand that one… If you use a single dimension for the size of the workgroup, use 64 for maximum performance, as you stated below. What’s the optimization you think the compiler is doing?

To clarify, the complier makes no optimization modifications to the code as far as GPU hardware processes are concerned, right? That is to say, the complier has no knowledge about the specifics of the hardware other than it’s hard limits.

The “high level” compilers (like DXC or FXC) don’t have such knowledge I think, but the drivers that translate to actual GPU code do.

1 Like

@arcman7
when i say compiler i could mean driver or webgpu runtime. some drivers compile textual shaders like opengl for glsl, whereas vulkan requires the runtime to compile it to spirv first. mostly i’ve meant wegpu runtime compiling the wgsl.

not related to anything, but i was watching a talk from a guy from utah graphics and he made the point that gpu runtimes should prefer textual shaders because they need to compile it for the exact hardware its running on. he made no mention or reference to spirv at all, but it made me wonder if the concept of spirv is fundamentally flawed.

@Evgeni_Popov
my computer took a dump, supposed to be getting a new one delivered today. planning to pick this up as first project on my new one:) 4090 yay (laptop though). You made a really good point that i hadn’t considered about being able to start the next compute workgroup sooner.

optimize probably isnt the right word, im just referring to the process of dividing the workgroup size into smaller chunks to match hardware. you’re right though, there is a flaw in my logic (which i also realized) of using workgroup size, since we can just use 64 always. i think i was originally thinking we didnt need engine caps check, but we do. so instead, i should check maxComputeWorkgroupStorageSize and do it in the dispatches. anyway, nothing concrete yet only putting thoughts to words

1 Like

Not sure if I should be starting a new thread for this, but this post is sort of a fun journal into our various thoughts regarding best practices for GPU shader programing -

Is storage buffer read access faster than texture sampling? The internet doesn’t give any definitive answers. I did find this github benchmark repo:

But I don’t know enough c++ to gain an understanding of the test results just by reading the code.

I think think this is the perfect place! Unknown on the answer but i think its probably too nuanced for a cut and dry answer, and certainly varies between hardware models. One thing to consider i think , is that compute and render can run at the same time, so keeping hardware utilization high is something to consider for the big picture, even if a microbench is faster. There are some graphs in gpu perf tools like nvidia nsight where you can see if they overlap, its something easier to check for too, since u can just look at charts lol. But one thing, isnt a sampler more equivalent to a compute kernel than a storage buffer? Gpus also have limited texture cache, like just number not data size, whereas storage buffer is bound to data size (2gb for me) and i think im limited to 32 textures

1 Like