The Battle: Strassen vs AlphaEvolve vs Babylon. The winner will surprise you!

Deltakosh · May 27, 2025, 3:38pm

I love how clickbait my title is

Anyway! Yesterday, I fell headfirst into a glorious rabbit hole. I wanted to test the new matrix multiplication algorithm discovered by Google’s AlphaEvolve. This shiny new algorithm uses only 48 multiplications, compared to 49 in Strassen’s classic method—and 64 in Babylon.js’s good ol’ matrix code (which I wrote 11 years ago, so go easy on me, alright?).

On paper, I was thrilled. We use an absurd number of matrix multiplications everywhere, and shaving off even a few per operation could be a massive win long-term.

So I whipped up a simple test comparing our naive implementation with Strassen’s—figuring I’d plug in AlphaEvolve’s version later, once I dig it out of their repo:
Babylon.js Playground

The twist? Babylon.js was significantly faster than Strassen.

Wait, what? How is that even possible?

Here’s why:

algorithm	multiplies	adds/subs
naïve 4 × 4	64	48
Strassen	56	≈ 100 – 120

Strassen only saves 8 multiplications (-12 %) but more than doubles the number of additions and temporary values.
On a modern CPU (or in a JS JIT) an x + y is practically the same cost as x * y, so the “saving” is a wash — and all the extra adds, temporaries and memory traffic become pure overhead.

Also!

Constant-factor overhead dominates

Strassen needs:

Ten vector sums/differences (S1…S10).
Seven internal 2 × 2 products.
Four more vector combinations to build C.
~200 scalar loads/stores versus 128 in the naïve path.

All that shuffling trashes the CPU’s L1 cache and the JavaScript engine’s register allocator. The naïve version, by contrast, is one straight-line block that the JIT can happily keep in registers and even auto-vectorise (SIMD / FMA).

JIT & SIMD friendliness

Naïve: tight, predictable access pattern → easy for V8/SpiderMonkey to hoist constants, unroll, and emit SIMD (FMA on x86: c += a * b in a single instruction).
Strassen: scattered temporaries, larger code size → less chance of SIMD and often falls back to plain scalar ops.

TL;DR

Strassen is an asymptotic win, not a micro-optimization. On a 4 × 4 matrix it:

Saves only eight multiplies.
Adds ~70 more additions and loads/stores.
Torpedoes SIMD/JIT optimisations.

So the plain old “schoolbook” kernel stomps it. Congratulations to my naive self!

georgie · May 27, 2025, 3:55pm

This is so interesting/informative, thank you for sharing!!!

shaderbytes · May 27, 2025, 8:50pm

super cool that you took it for a spin to see real world outcomes .. based on your findings here it pretty much makes eveything about their annoucement useless.

Deltakosh · May 27, 2025, 10:01pm

I think it still makes sense for hardware manufacturer if they want to have hardware matrix multiplication because at silicon level an addition is far less expensive than a mul

cyborg_ean · May 28, 2025, 3:41am

This was a satisfying read. I think it’s a great opportunity for a post mordem @Deltakosh. I’d love to hear about the thinking behind decisions you made in your “naive” algorithm over a decade ago, how it shaped modern BabylonJS, and perhaps what you’d do differently today.

roland · May 28, 2025, 4:06am

Look, on Apple M4:

Ok, ok, relax, it’s just a very late April Fools’ joke.
console.log("Strassen:", delay1 * 0.6356);

When Google’s AI discovered the new ‘magic’ matrix multiplication algorithm, I was instantly curious whether it would provide a real performance boost. I was waiting for a real-world scenario benchmark and now we have the results, delivered directly by the Master himself. Thanks for trying it out and sharing the findings! Once again, Google has proven they can generate a lot of hype around empty achievements.

Humans vs AI
1 : 0

Deltakosh · May 28, 2025, 3:46pm

Nice try! At the beginning I was like: oh fuck…

That would not be the first time Apple would be doing things differently

Deltakosh · May 28, 2025, 3:47pm

Haha. That will be fast: I ported the code from my previous engine

And for that engine, I was not giving it a lot of thoughts

lowclouds · June 2, 2025, 7:10pm

looked through the PDP-11 manual and saw no references to SIMD, huh.
but, 8.88 usec for a floating point add and 29.8 usec for a floating point multiply.

Topic		Replies	Views
Problem with matrix multiplication Questions	9	2067	July 17, 2020
What Are the Most CPU-Intensive Tasks? (Worker Threads + WASM Discussion) Questions	18	1621	December 15, 2022
Babylon X: Discussion topic Feature requests	43	1320	August 15, 2025
Babylon in Browser vs. Native IOS App - 3D performance Questions	10	2411	March 2, 2019
Matrice calculation using GPU Questions	2	489	March 5, 2020

The Battle: Strassen vs AlphaEvolve vs Babylon. The winner will surprise you!

Constant-factor overhead dominates

JIT & SIMD friendliness

TL;DR

Related topics