I’m writing some tests to find out, but I guess it will be necessary to test on a variety of machines to have a useful understanding. Graphics is hard
btw been focusing on native stuff the past couple weeks, but with this in mind. I want to set my foundation in native and translate it to web instead of the other way around. I’ve felt so limited with webgl and webgpu in regards to my progress with graphics and programming in general, and working towards rectifying that. Also, i remember seeing you expressing interest in mesh/geometry shaders - i’ve been thinking as part of the efforts here, we could work on a “ShaderGeometry” class . Its 100% possible we can emulate mesh shaders in compute shaders on webpgu, but performance is going be the limiting factor. I’m thinking that’s an actual useful thing we could do instead of some arbitrary tests. Also, fwiw we’re not limited to glsl, we can use hlsl if we want and recompile the tint wasm module to support both hlsl and glsl for babylon’s wgsl shader store. My reasoning for that is that wgsl only offers 3 shader stages (vertex, fragment, and compute) , which doesnt give us the option to learn native the pipeline, whereas if we use hlsl or glsl, we can just do some conditional compile and not arbitrarily limit ourselves. It just seems illogical to use wgsl to me. If I was already a cg master and knew vulkan 1.3, dx12, and probably worked at nvidia or amd, sure i would use wgsl because whatever its easier, but thats not the case for me.
Compiling shaders is only a part of the problem. If you want to use geometry, tessellation, mesh, … shaders, you will also need some plumbing around the shaders (a bit like ComputePipeline for compute shaders), and only the browser could provide it through specific APIs. The alternative is to write all your code in native (C++, rust, …) and translate it to WASM for js consumption.
Yea, i think thats super obvious but maybe others not idk. We we will have to fully reimplement emulated features through a compute pipleline. But, that is what i want to do. I want to write the native non browser compatible feature first - without regard to a browser, then completely rewrite it to a compute pipline with webgpu in mind. Compiling to wasm isnt an option, emscripten doesnt polyfill native dx12 or vk api to webgpu. It just does basic stuff like shimming glfw and sdl to window and web audio. Im sure u know , but i guess a lot people probably dont.
To illustrate how hard it can be to optimize compute shaders, here’s a paper from Nvidia:
In the end, they achieve a x30 performance boost compared to the first straightforward implementation!
I haven’t completely digested this yet, are there any common or distinct patterns or practices that stood out to you?
Actually, each new version of the algorithm addresses a specific problem, which you probably can/will encounter in other compute shaders (or at least some of them):
- v2: addresses divergent branches
- v3: addresses shared memory bank conflicts
- v4: addresses idle threads
- etc
The problem, I think, is to identify these problems in your own shaders and to be able to fix them…
- v4: addresses idle threads
This one I find particularly hard to think about because, on the one hand I think surely its more efficient to squeeze more tasks into a single compute shader and do more with each kernel, since each job dispatch has some auxiliary cost:
- Upload uniforms
- Time to setup the command buffer
- Dispatch the jobs
Also, if your hardware cant support that many jobs executing simultaneously, then it feels natural to try and make more use out of each job.
BUT if that extra work uses some conditionals then it’s very likely I have threads idling. This feels like some sort of of Tetris game where I’m trying to fit together various jobs execution in the most compact way. I’m picturing idle threads as empty space in the Tetris pile where shapes don’t fit will together.
Yes, those levels of optims are complete madness
@Evgeni_Popov btw, I noticed in this lastest chrome update, the uniform analysis requirement has been turned on. I found some places in my code where the conditional texture map sampling was being done, and moved all texture sampling outside of the conditionals. The warnings are gone and it runs as normal.
What I still can’t figure out is the root cause behind those bumps I mentioned here. It’s not a problem being caused by non-uniform operations on the GPU, which means the same GPU operations and draw calls are happening on the m1 mac, my pre-m1 mac, and a windows machine, but the way that those operations are getting executed is somehow different on the m1’s hardware & drivers.
In fact, uniform analysis was already enabled, but they added some more checks in the latest version / fixed some bugs that didn’t report uniform analysis errors properly in previous versions.
Regarding your bug, have you tested by forcing WebGL1 (if it’s currently running in WebGL2 mode)? Without a repro, though, it’s going to be hard to help more, unfortunately.
No I haven’t tested webGL1 yet, I’ve only tested on webGL2 and webGPU. And yeah, I haven’t been able to repo yet, but I think I have some ideas now that might allow me to setup a playground that displays the issue.
@Evgeni_Popov I’ve been testing a number of things to see where the differences in m1 behavior start. One of those things was to see if texture sampling would look any different on my windows vs the m1 mac, but I got the same results:
The bottom portion of the data is the height map data as read from javascript - where I take a height map image and parse each pixel to get the height value encoded in the rgba data. I read in the image using just the canvas api.
The top portion of the data is that same height map image, but read from a texture sampler in wgsl. You can see that they don’t agree about the edges of the height map, and even on the inner portions, the wgsl texture sampler always seems to be off by about 3 or 4. Not sure why - is this something to with the texture sampler using interpolation of a sort?
Yes, it’s probably the filtering method used. Try to use BABYLON.Constants.TEXTURE_NEAREST_SAMPLINGMODE
for the sampling mode of the texture.
That did it! The numbers I see when running that same test are identical. Thank you I’m still putting together what I’ve learned here… I’ll tag you in the other thread I originally started about the m1 rendering bug if I find a way to repo or even solve it.