I love how clickbait my title is
Anyway! Yesterday, I fell headfirst into a glorious rabbit hole. I wanted to test the new matrix multiplication algorithm discovered by Google’s AlphaEvolve. This shiny new algorithm uses only 48 multiplications, compared to 49 in Strassen’s classic method—and 64 in Babylon.js’s good ol’ matrix code (which I wrote 11 years ago, so go easy on me, alright?).
On paper, I was thrilled. We use an absurd number of matrix multiplications everywhere, and shaving off even a few per operation could be a massive win long-term.
So I whipped up a simple test comparing our naive implementation with Strassen’s—figuring I’d plug in AlphaEvolve’s version later, once I dig it out of their repo:
Babylon.js Playground
The twist? Babylon.js was significantly faster than Strassen.
Wait, what? How is that even possible?
Here’s why:
algorithm | multiplies | adds/subs |
---|---|---|
naïve 4 × 4 | 64 | 48 |
Strassen | 56 | ≈ 100 – 120 |
Strassen only saves 8 multiplications (-12 %) but more than doubles the number of additions and temporary values.
On a modern CPU (or in a JS JIT) an x + y
is practically the same cost as x * y
, so the “saving” is a wash — and all the extra adds, temporaries and memory traffic become pure overhead.
Also!
Constant-factor overhead dominates
Strassen needs:
- Ten vector sums/differences (S1…S10).
- Seven internal 2 × 2 products.
- Four more vector combinations to build C.
- ~200 scalar loads/stores versus 128 in the naïve path.
All that shuffling trashes the CPU’s L1 cache and the JavaScript engine’s register allocator. The naïve version, by contrast, is one straight-line block that the JIT can happily keep in registers and even auto-vectorise (SIMD / FMA).
JIT & SIMD friendliness
- Naïve: tight, predictable access pattern → easy for V8/SpiderMonkey to hoist constants, unroll, and emit SIMD (FMA on x86:
c += a * b
in a single instruction). - Strassen: scattered temporaries, larger code size → less chance of SIMD and often falls back to plain scalar ops.
TL;DR
Strassen is an asymptotic win, not a micro-optimization. On a 4 × 4 matrix it:
- Saves only eight multiplies.
- Adds ~70 more additions and loads/stores.
- Torpedoes SIMD/JIT optimisations.
So the plain old “schoolbook” kernel stomps it. Congratulations to my naive self!