Reducing VRAM usage for instances and thin instances

Currently instances and thin instances use 4 vec4 buffers for instanced matrices.
But as the matrices are composed from TRS, or multiplied with other matrices with TRS, the last row of it should always be 0, 0, 0, 1, and this last row is also transfered into the gpu.
If this last row is skipped, and reconstructed on gpu side, ~25% of VRAM used by instancing could be saved.
For performance,since opengl matrices are column-major, so removing last row would reduce the change of auto-vectorization by js engines, but I can not tell the exact impact unless some benchmarks are made.
For public apis like thinInstanceSetBuffer, more copy could be needed to copy every 4 vec4 to 4 vec3.
I know that this could break custom shaders with customized handling for instances, and does not expect this to land right now, just leave it as some open discussion, so feel free to move it to the “correct” category if I missed something.

attribute vec4 world0;
attribute vec4 world1;
attribute vec4 world2;

void main(void) {
vec4 instance0 = vec4(world0.xyz, 0);
vec4 instance1 = vec4(world0.w, world1.xy, 0);
vec4 instance2 = vec4(world1.zw, world2.x, 0);
vec4 instance3 = vec4(world2.yzw, 1);
mat4 instanceWorld = mat4(instance0, instance1, instance2, instance3);
}

It is an interesting idea. I agree that VRAm will be saved but I’m wondering at what cost from the rendering standpoint as we need to reconstruct on all frame for all vertices

Not skilled at mesuring performance on GPU, but some AI says “6-10 more instructions per vertex”.

Yeah..This is where I’m a bit defensive of our approach.. Not sure it is worth the cost.

Are we suffering for limited VRAM? Maybe in your use cases?

Mostly mobile, especially old ios devices, I can not even measure VRAM usage when safari crashed.