Faster convolution or filter PostProcess?

I have worked out the equations for converting a convolution kernel to require fewer texture samples, taking advantage of (and requiring) bilinear sampling. For a linear kernel (single row or column) it reduces texture samples from kernel_size samples to (kernel_length + 1) / 2 samples.

I have also worked out the equations for 3x3 convolution filter to require 5 texture samples instead of 9.

To minimize the number of repeated identical calulations within the shader, the raw input is a set of sample offsets (vec2 array) and a set of coefficients (float array). Calculating the offsets and coefficients would be done a single time (in TypeScript/JavaScript during filter construction.

Each filter needing a different number of samples would be created with DEFINEs. A 5-sample 1x9 kernel I think could use the same DEFINEs as a 3x3 filter.

I’m not sure how it would interact with trilinear sampling, mipmaps, lod, etc.

I think it would speed up any filter using convolution.

Is there any interest in testing a custom postprocessor (not yet written)?

IIRC there is already a reduced number of taps for blur in the engine. I’m sure @sebavan remembers

This is what we use for our shadows for instance :slight_smile: Babylon.js/packages/dev/core/src/Shaders/ShadersInclude/shadowsFragmentFunctions.fx at master · BabylonJS/Babylon.js · GitHub

That’s awesome! I am not improving shadowsFragment. I saw the webpage referenced in the source code, but am using a different technique. In the generalized case, kernel elements have no pre-determined relationship. To apply bilinear sampling in both directions relies on 4 surrounding elements participating in each bilinear sample to have proportional ratios: left to right ratios are the same (or close enough) on both the top and bottom, and top to bottom ratios are the same on the left and right. The blur algorithm also ignores the outside kernel elements because they have minimal impact to the output. And because of the symmetry and algorithm, it doesn’t need a specific singular center sample.

My algorithm is different in that

  • It handles a general kernel with any coefficients
  • pre-calculates samples required based on specified kernel values.
  • I only group elements pairwise (not in groups of 4).
  • Can reproduce the kernel requested within numeric precision.
  • Because of limitions with bilinear sampling, adjacent oppositely-signed elements will convert to 2 samples instead of one.
  • Loop count required as a compile-time constant means changing a kernel may require a shader recompile.
  • Kernel element pairs that are both zero are not sampled at all.

I should probably create and test a post processor implemention soon. I just discovered the “opposite sign” issue in the last week. I hope there aren’t more errors in my assumptions or calculations.