Generating Slope maps and Noise maps

jeremy-coleman · September 11, 2023, 5:15am

[gonna read this tomorrow and edit probably, bedtime now]

btw this post is my wip to gather some best practices, not a response to to anything or anyone specific, probably half of the content is directed to my future self or any interested onlookers and half to popov

–

not mixing up! I used a 3d texture as an example because we could theoretically split a 3d texture into layers based on z coordinate and operate independently on each xy partition. If the declaration of workgroup_size could have any benefit, i just thought this could be a scenario to answer the question “Is there a difference between (8,8,1) vs 64?” Maybe someone else can think of a scenario that should (theoretically, not necessarily in reality) have potential to benefit from being able to declare a 2d or 3d workgroup size?

–

see, it is a good example You made it better though. We have several scenarios now.

1 - workgroup size of (32,1,1), Dispatch with parameters (16,1,1) - your original scenario
2 - workgroup size of (8,8,1), Dispatch with parameters (8,1,1) - same, account for amd
3 -workgroup size of (64,1,1), Dispatch with parameters (8,1,1) - same, but 8,8,1 vs 64
4 - workgroup size of (4,1,1), Dispatch with parameters (4,4,4) - poorly chosen workgroup size
5a - workgroup size of (16,8,4), Dispatch with parameters (1,1,1) - your final suggested scenario
5b - workgroup size of (16,8,4), Dispatch with parameters (1,1,1) - your final suggested scenario but remove any shared memory in shader
5c - workgroup size of (64,1,1), Dispatch with parameters (2,1,1) - your final suggested scenario, with shared memory as in 5a, but better sized workgroup considering shared memory. amd specific
5d - workgroup size of (32,1,1), Dispatch with parameters (2,1,1) - your final suggested scenario, with shared memory as in 5a, but better sized workgroup considering shared memory. nvidia specific
6 - workgroup size of (1024), Dispatch with parameters (1,1,1) - easy optimization test, no shared memory.

my expectations from those scenarios based on everything discussed so far:

1 - as fast as 2 and 3 on nvidia, much slower than 2 and 3 on amd.
2a - faster than 1 on amd. (proving amd with workgroup_size(64) is faster than workgroup_size(32)) - not actually sure if amd does or doesnt do well with workgroup_size(32) here, we should test a certainly bad workgroup size too.
2b - same perf as 3, proving (proving 8,8,1 vs 64 doesnt matter)
3 - same perf as 2 (proving 8,8,1 vs 64 doesnt matter)
4 - much worse performance than all other scenarios
5 - 5a will be slower than 5b, 5c, and 5d who all perform the same
6 - performs same as 2 and 3

–

Sorry, I was unclear. I equate workgroup with warp in my mind, but I should have said “For shared memory, the ~~warp/wave~~ workgroup has shared local memory (atomics) within the ~~warp/wave~~ workgroup, but i think multiple ~~warp/wave~~ workgroups also potentially get shared memory up to 1024 threads (32 warps or 16 waves) when using webgpu (aka maxComputeInvocationsPerWorkgroup).”

Anyway, I see what you mean. But also, id say the scope of shared memory is determined at dispatch (compiling a shader doesn’t allocate). Also, dispatch count is variable since its based on the end-user’s hardware and i guess i think of that as the top level context parameter.

–

Evgeni_Popov:

All threads in a workgroup (=thread block) will have access to this memory. If you have several thread blocks, each will have its own shared memory block, not visible to other thread blocks!

For this to work, all threads in a thread block must run on the same GPU “core” (I think these cores are called SMs (Streaming Multiprocessors), but I’m not sure). This is where certain limitations come into play:

(understandably) an SM can’t run an unlimited number of threads in parallel. You’re limited by your GPU’s maxComputeInvocationsPerWorkgroup limit (mine is 1024)

the total amount of shareable memory is per SM and is given by your GPU’s maxComputeWorkgroupStorageSize limit (mine is 32k). Note that this is per SM and not per thread block (=workgroup)! This means that if your SM is running 8 thread blocks in parallel, for maximum efficiency, your compute shader shouldn’t use more than 32k/8=4k of shared memory per workgroup!

i think SM is just the CUDA term for maxComputeInvocationsPerWorkgroup or like, how many compute units use the same memory stick. Can we restate your second bullet as "a compute shader shouldn’t use more memory than:
maxComputeWorkgroupStorageSize / ( maxComputeInvocationsPerWorkgroup / (32 or 64).

Finally, i think we should think about the problem in the context of actual hardware layout. these diagrams that i copy pasta’d from google images are a nice visual aid. I probably dont understand more than 1% of what these pictures imply, but still helpful to see different chip layouts and the physical access the compute units have to the same memory stick. this is my understanding of maxComputeInvocationsPerWorkgroup in picture form.

nvidia data center gpus

nokia phone

amd

apple

key takeaways (pending review!)

<insert notable scenario outcome>

(highly questionable hot take by me here)
If no shared memory besides atomics local to workgroup invocation, choose lesser of maxComputeInvocationsPerWorkgroup and maxComputeWorkgroupSizeX as 1d max workgroup_size and dispatch 1, the compiler should be able to optimize this.

(restating @Evgeni_Popov’s last bullet)
Compute shader shouldn’t use more memory than maxComputeWorkgroupStorageSize / ( maxComputeInvocationsPerWorkgroup / (32 or 64).

(webgpufundamentals).
Choose a workgroup size of 64 unless you have some specific reason to choose another size

Uniformity analysis explanation:

@arcman7 regarding texture vs storage, it seems that textures are always filtered, so image or storage is faster. However, there is a vulkan extension to read textures without sampling.

—-
Quick notes

Amd dev advocate said 32 wavefront is native and 64 causes 32x2 in lockstep on rdna2. If same for rdna3 , reevaluate assumption 64 is best, bc amd has 2 paths and webgpu driver could choose 64 and be perf loss

@Evgeni_Popov around 23 mins she says 64x1 vs 8v8 can be significantly different because it allows different access patterns which effects cache layout. At 48 mins she answers a question about use case for 3rd dim workgroup size. Lol she answered the question and says it can impact performance and he goes “hehe… why” .

Topic		Replies	Views
Mesh Building in Web Workers Questions	3	760	November 16, 2020
Displacement Map Docs and project ideas Questions	3	717	September 13, 2019
Best approach when building a complex hex map? Questions	3	936	July 7, 2020
Smooth heightmap Questions	10	1997	November 28, 2019
HeightMap Vertex Detail? Questions	29	3299	February 22, 2019

Generating Slope maps and Noise maps

key takeaways (pending review!)

Related topics