Downsampled SphericalHarmonics Creation

kzhsw · March 20, 2026, 8:18am

Summary

Reduce the CPU cost of SphericalPolynomial generation by reading from a lower mip level of the cubemap instead of the full-resolution mip 0. For the synchronous CPU-data path, apply a box-filter downsample before integration. The optimization is opt-in via a static flag on CubeMapToSphericalPolynomialTools, defaulting to false (current behavior preserved).
This is crafted with the help of AI, but this idea comes from human .

Motivation

CubeMapToSphericalPolynomialTools.ConvertCubeMapTextureToSphericalPolynomial is one of the most expensive synchronous CPU operations triggered during texture load (EquiRectangular and HDR). It iterates every texel of all 6 cubemap faces, performing per-texel atan2 (x4), gamma correction (Math.pow x3), SH basis evaluation (x9), and vector accumulation. For a 2048x2048 cubemap this is ~25 million texels and ~100 million atan2 calls.

The GPU readback path compounds this: it calls flushFramebuffer() (pipeline stall) then readPixels on 6 full-resolution faces (384 MB for 2048 RGBA float), transferring data that will be reduced to 27 floats.

The output — 3rd-order spherical harmonics (9 RGB coefficients) — has an angular resolution of ~90 degrees. A 32x32 or 64x64 cubemap face contains more than enough information to produce identical SH3 coefficients. Everything above ~16x16 per face is oversampling for this purpose.

Approximate impact (single-threaded)

Face size	Texels (6 faces)	Time estimate
2048	25M	~300 ms
1024	6.3M	~80 ms
64	24K	<1 ms
32	6K	<0.3 ms

Design

Option flag

// cubemapToSphericalPolynomial.ts
export class CubeMapToSphericalPolynomialTools {
    /**
     * When true, the SH projection reads from a lower mip level (GPU path)
     * or downsamples the input data (CPU path) to reduce computation cost.
     * The target face size is controlled by SH_INTEGRATION_TARGET_SIZE.
     *
     * Default: false (preserves existing full-resolution behavior).
     */
    public static USE_DOWNSAMPLED_SH_PROJECTION = false;

    /**
     * Target face resolution (per side) when USE_DOWNSAMPLED_SH_PROJECTION is true.
     * Values of 32-64 are recommended. Must be a power of two.
     * Ignored when USE_DOWNSAMPLED_SH_PROJECTION is false.
     *
     * Default: 64.
     */
    public static SH_INTEGRATION_TARGET_SIZE = 64;
}

A static flag on CubeMapToSphericalPolynomialTools is chosen over a scene-level option because:

The tool class already carries static config (MAX_HDRI_VALUE, PRESERVE_CLAMPED_COLORS)
SH projection is a global math concern, not scene-specific
Keeps the surface area small — one place to set, applies everywhere

Path 1: GPU readback (`ConvertCubeMapTextureToSphericalPolynomial`)

When USE_DOWNSAMPLED_SH_PROJECTION is true and the texture has mipmaps (generateMipMaps === true):

Compute the mip level that brings each face closest to SH_INTEGRATION_TARGET_SIZE:

const level = Math.max(0, Math.round(Math.log2(size / targetSize)));
const mipSize = Math.max(1, size >> level);

Pass level to the existing readPixels(faceIndex, level, ...) calls.
Use mipSize as cubeInfo.size.

When the flag is true but the texture has no mipmaps, fall back to full-resolution (current behavior) with no change.

Path 2: CPU-data sync path (`ConvertCubeMapToSphericalPolynomial`)

When USE_DOWNSAMPLED_SH_PROJECTION is true and cubeInfo.size > targetSize:

Downsample each face array using a simple box filter before entering the integration loop.
A private static helper performs the downsample:
```
_DownsampleFace(data, srcSize, dstSize, stride): Float32Array
```
Averages non-overlapping (ratio x ratio) blocks. One pass, no allocations beyond the output array.

What does not change

The integration loop itself (addLight, _AreaElement, basis functions) is untouched.
Post-processing (correction factor, irradiance conversion, Lambert scaling) is untouched.
.env file loading is unaffected (polynomial is deserialized from the file header, never computed).
The generateHarmonics = false skip path is unaffected.
MAX_HDRI_VALUE / PRESERVE_CLAMPED_COLORS behavior is preserved.
Default behavior (USE_DOWNSAMPLED_SH_PROJECTION = false) is bit-identical to current code.

Quality Analysis

SH band 2 (the highest band in 3rd-order) corresponds to angular features of ~90 degrees. The Nyquist criterion for band-2 SH requires roughly 6x6 samples per face. A 64x64 face provides 10x oversampling even after downsampling.

Expected quality by target size:

Target	Samples per face	vs. Nyquist	Perceptual difference
64	4096	~113x	Imperceptible
32	1024	~28x	Imperceptible
16	256	~7x	Negligible, may show on extreme directional HDRIs
8	64	~1.8x	Noticeable on strong directional environments

The default target of 64 is conservative. Aggressive users can lower it to 32 with no visible difference.

The box filter introduces slight averaging versus point-sampling each texel, but this is beneficial: it suppresses high-frequency noise (bright point lights in HDRIs) that SH3 cannot represent anyway and that the existing MAX_HDRI_VALUE clamp was designed to mitigate.

Scope of Code Changes

Change	Location	Lines
Two static properties	`cubemapToSphericalPolynomial.ts`	~12 (with JSDoc)
Mip-level selection + guard in GPU path	`ConvertCubeMapTextureToSphericalPolynomial`	~8
Box-filter downsample helper	`_DownsampleFace` private static	~20
Downsample call in CPU path	`ConvertCubeMapToSphericalPolynomial`	~8
Total	1 file	~48

No new files. No API signature changes. No shader changes. No new dependencies.

Migration / Backward Compatibility

Default off: zero impact on existing users.
Opt-in: CubeMapToSphericalPolynomialTools.USE_DOWNSAMPLED_SH_PROJECTION = true at app startup.
No breaking API changes. The two new static properties are additive.
If a future major version wants to flip the default to true, that is a separate decision.

Alternatives Considered

Alternative	Why not (for now)
GPU Accelerated SH projection	High impact but large change, engine-backend-specific (WebGL vs WebGPU). Good follow-up, orthogonal to this proposal.
Web Worker offload	Unblocks main thread but does not reduce total CPU cost. The readback stall remains. Complementary, not a replacement.
Always skip polynomial, use irradiance texture	Changes visual output for all users. Not backward compatible as a default.
Accumulate directly into SP basis (TODO at line 137)	Saves one conversion step but the loop cost is identical. Negligible speedup.

Open Questions

Default target size: 64 is conservative. Should we use 32 for even more savings?
Future default flip: Should we plan to make this true by default in a future minor or major release once validated?
Per-texture override: Is a per-texture option needed, or is the global static sufficient for all known use cases?

And the full AI analysis in current babylon.js repo

What SphericalPolynomial Is

SphericalPolynomial is a compact representation of the low-frequency diffuse irradiance from an environment cubemap, using 3rd-degree spherical harmonics (9 coefficients, each an RGB Vector3 = 27 floats total). It captures the broad ambient lighting directions from an HDR environment so that PBR materials can compute diffuse IBL without sampling the full cubemap per pixel.

The pipeline:

Cubemap (6 × size² texels)
  → SphericalHarmonics (9 SH coefficients, accumulated per-texel)
  → cosine kernel convolution (incident radiance → irradiance)
  → Lambert 1/π scaling (irradiance → outgoing radiance)
  → SphericalPolynomial (9 polynomial coefficients for shader)

The shader then evaluates these 9 coefficients with the surface normal to get diffuse IBL contribution — essentially replacing a full cubemap lookup with a cheap polynomial evaluation.

Why It’s Expensive

There are two cost sources, and the first one is easily overlooked:

1. GPU Readback (the hidden killer)

In baseTexture.polynomial.ts:41, the lazy getter triggers ConvertCubeMapTextureToSphericalPolynomial, which at cubemapToSphericalPolynomial.ts:57-74:

texture.getScene()?.getEngine().flushFramebuffer();  // GPU pipeline stall!
const rightPromise = texture.readPixels(0, ...);     // GPU→CPU readback
const leftPromise  = texture.readPixels(1, ...);     // ×6 faces
// ...

This flushes the GPU pipeline and reads back 6 full-resolution faces. For a 2048x2048 RGBA float cubemap, that’s 6 × 2048 × 2048 × 16 bytes = 384 MB of GPU→CPU transfer. This alone can stall rendering for tens of milliseconds.

2. CPU Integration Loop

At cubemapToSphericalPolynomial.ts:132-212, the triple-nested loop:

Iterates 6 × size² texels
Per texel: 4× atan2 calls for solid angle (line 149-153) — these are the most expensive math ops
Per texel: new Color3(r, g, b) allocation (line 201) — GC pressure in hot loop
Per texel: addLight() does 9× SH basis evaluation + accumulation
Gamma-space textures add Math.pow() per channel (line 179-181)

Cost by texture size:

Size	Texels (6 faces)	atan2 calls	Rough time
256	393K	1.6M	~5ms
512	1.6M	6.3M	~20ms
1024	6.3M	25M	~80ms
2048	25M	100M	~300ms
4096	100M	402M	~1200ms

Is It Really Needed?

Yes, but only if you want accurate diffuse IBL. Its purpose is to tint surfaces with the correct ambient color from the environment. Without it:

PBR materials fall back to USESPHERICALFROMREFLECTIONMAP = false (line 1345-1346 of pbrBaseMaterial.ts), which is a shader fallback that skips the diffuse SH term entirely
The visual result: objects lose their ambient environment tinting — a metallic object in a sunset scene would lose the warm orange fill light

However, the polynomial only captures bands 0-2 (very low frequency). It’s basically the average color + a directional gradient + a slight quadratic variation. For many scenes, you can approximate this cheaply.

What Happens with a Dummy/Zero Polynomial

If you set sphericalPolynomial to an empty new SphericalPolynomial() (all zeros):

PBRMaterial isReady() at line 1129 — passes immediately (polynomial exists, promise is null)
Shader binding at line 307 — polynomials is truthy, USESPHERICALFROMREFLECTIONMAP define stays on
Rendering — all SH uniforms are zero, so diffuse IBL contribution = black

This means objects lit only by environment would appear too dark — they’d still get specular reflections from the cubemap but no diffuse fill. For scenes with strong directional lights, this might be barely noticeable. For IBL-only scenes, it would look obviously wrong.

Better dummy approach: Set l00 (the DC/average term) to the average color of your environment. This single coefficient represents the uniform ambient term and captures ~80% of the visual contribution. The other 8 coefficients add directional nuance.

What Happens with Downsampled Texture

This is the best optimization opportunity. SH band 2 (the highest band used) captures features at ~90-degree angular resolution. You need far fewer texels than the source cubemap:

Downsampled size	Texels	Quality loss	Speedup
64×64	24K	Imperceptible	~250× vs 1024
32×32	6K	Negligible	~1000× vs 1024
16×16	1.5K	Slight	~4000× vs 1024
8×8	384	Noticeable on strong directional envs	~16000× vs 1024

A 32×32 or 64×64 downsample would be visually indistinguishable from the full-resolution result for SH3. The math is rigorous here — SH band 2 has an angular resolution of ~90°, so anything above ~16×16 per face is oversampling for this purpose.

Other Performance Advice

1. Use `.env` files (pre-baked polynomial)

The .env format serializes the polynomial coefficients in the file header. On load, UploadEnvSpherical() just reads 27 floats — zero computation. If you control the asset pipeline, always convert HDR/DDS to .env format offline.

2. Skip generation when not needed

new HDRCubeTexture("env.hdr", scene, 256, false, /* generateHarmonics */ false);

Then manually set a pre-computed or approximated polynomial.

3. Fix the per-texel allocation

Line 201 creates new Color3(r, g, b) inside the hot loop. This could be hoisted to a reusable instance, eliminating millions of allocations. The codebase already uses TmpVectors for exactly this pattern in addLight().

4. Pre-compute solid angles

The _AreaElement calls (4× atan2 per texel) depend only on (u, v) grid positions, not on pixel color. For a given texture size, these could be computed once into a lookup table and reused across all 6 faces (all faces share the same (u,v) grid).

5. GPU-side computation

The entire SH projection is embarrassingly parallel — it’s a map-reduce over texels. A compute shader could do this in <1ms for any size. Several engines (Filament, three.js) have moved this to GPU. The TODO at line 137-139 hints at awareness of optimization opportunities here, though it suggests accumulating into SP basis directly rather than going to GPU.

6. Offload to a Web Worker

Since the heavy computation is purely CPU math on an ArrayBuffer, it could be posted to a Worker to avoid blocking the main thread. The async pattern is already in place (the promise-based lazy getter).

Summary

Approach	Effort	Speedup	Visual Impact
Use `.env` files	Asset pipeline	Eliminates computation	None
Downsample to 32×32	Moderate code change	~1000×	Imperceptible
Hoist Color3 alloc	Trivial one-liner	~10-20%	None
Pre-compute solid angles	Small code change	~30-40%	None
GPU compute shader	Large	Full-size in <1ms	None
Web Worker offload	Moderate	Unblocks main thread	None
Dummy (avg color only)	Trivial	Eliminates computation	Minor ambient inaccuracy

The highest-impact, lowest-effort win is downsampling the input to 32-64px per face before running the SH projection. It’s mathematically justified (SH3 can’t represent higher frequencies anyway) and would make the computation effectively free.

Deltakosh · March 20, 2026, 6:31pm

cc @Evgeni_Popov

Evgeni_Popov · March 23, 2026, 8:04am

That looks good to me!

However, I would prefer this property to be defined per instance rather than statically, because even though I agree that we generally want this setting to apply to all cubes, if someone wants a different setting for certain cubes (for whatever reason), it feels odd to modify a static property between calls.

And, at some point, we’ll probably need to port the code to the GPU (but that will be the subject of another PR).

kzhsw · March 23, 2026, 8:18am

does it mean as part of CubeTexture? SphericalPolynomial is created implictly inside textures

Evgeni_Popov · March 23, 2026, 8:52am

Yes, I think it can be a parameter at construction time?

kzhsw · March 25, 2026, 7:39am

Ok, here are estimated files to change:

baseTexture.polynomial.ts

Added _sphericalPolynomialTargetSize: number via module augmentation on BaseTexture
Initialized to 0 on prototype (default = full resolution, no behavior change)

cubeTexture.ts

Added sphericalPolynomialTargetSize?: number to ICubeTextureCreationOptions
Wired through constructor options handling

envCubeTexture.ts

Added sphericalPolynomialTargetSize = 0 constructor parameter
Stores on instance; passes to ConvertCubeMapToSphericalPolynomial in the eager (HDR/EXR) path

hdrCubeTexture.ts

Added sphericalPolynomialTargetSize = 0 constructor parameter, passes through to EnvCubeTexture

equiRectangularCubeTexture.ts

Added sphericalPolynomialTargetSize = 0 constructor parameter, stores on instance
Added side-effect import of baseTexture.polynomial

cubemapToSphericalPolynomial.ts (main logic)

GPU path (ConvertCubeMapTextureToSphericalPolynomial): reads texture._sphericalPolynomialTargetSize, if mipmaps available reads from computed mip level, otherwise falls back to CPU box-filter downsample after readback
CPU path (ConvertCubeMapToSphericalPolynomial): accepts optional targetSize parameter, downsamples each face before integration
_DownsampleFace: box-filter averaging helper, handles both Float32Array and Uint8Array
_NearestPow2Floor: clamps non-pow-2 target sizes to nearest power-of-two (rounding down) — so passing e.g. 100 silently becomes 64

And here is a playground, open devtools console to compare performance:

This is less precise, but even faster than the GPU accelerated impl

Evgeni_Popov · March 25, 2026, 9:03am

Great!

Note that these times are only approximate, as the code includes several asynchronous tasks. However, we still see a significant performance improvement depending on the target size!

Could you create a pull request for this?

kzhsw · March 25, 2026, 9:42am

I’d like to, but after 9.0 release as it’s near

kzhsw · March 27, 2026, 3:00am

Here it is:

github.com/BabylonJS/Babylon.js

Downsampled SphericalHarmonics Creation (#18186)

master ← kzhsw:patch-3

opened 02:59AM - 27 Mar 26 UTC

kzhsw

+143 -15

## Summary Reduce the CPU cost of `SphericalPolynomial` generation by reading… from a lower mip level of the cubemap instead of the full-resolution mip 0. For the synchronous CPU-data path, apply a box-filter downsample before integration. The optimization is opt-in via a instance flag on `BaseTexture`, defaulting to `0` (current behavior preserved). ## Motivation `CubeMapToSphericalPolynomialTools.ConvertCubeMapTextureToSphericalPolynomial` is one of the most expensive synchronous CPU operations triggered during texture load (EquiRectangular and HDR). It iterates every texel of all 6 cubemap faces, performing per-texel `atan2` (x4), gamma correction (`Math.pow` x3), SH basis evaluation (x9), and vector accumulation. For a 2048x2048 cubemap this is ~25 million texels and ~100 million `atan2` calls. The GPU readback path compounds this: it calls `flushFramebuffer()` (pipeline stall) then `readPixels` on 6 full-resolution faces (384 MB for 2048 RGBA float), transferring data that will be reduced to 27 floats. The output — 3rd-order spherical harmonics (9 RGB coefficients) — has an angular resolution of ~90 degrees. A 32x32 or 64x64 cubemap face contains more than enough information to produce identical SH3 coefficients. Everything above ~16x16 per face is oversampling for this purpose. ## Quality Analysis SH band 2 (the highest band in 3rd-order) corresponds to angular features of ~90 degrees. The Nyquist criterion for band-2 SH requires roughly 6x6 samples per face. A 64x64 face provides 10x oversampling even after downsampling. Expected quality by target size: | Target | Samples per face | vs. Nyquist | Perceptual difference | |--------|-----------------|-------------|----------------------| | 64 | 4096 | ~113x | Imperceptible | | 32 | 1024 | ~28x | Imperceptible | | 16 | 256 | ~7x | Negligible, may show on extreme directional HDRIs | | 8 | 64 | ~1.8x | Noticeable on strong directional environments | The default target of 64 is conservative. Aggressive users can lower it to 32 with no visible difference. The box filter introduces slight averaging versus point-sampling each texel, but this is beneficial: it suppresses high-frequency noise (bright point lights in HDRIs) that SH3 cannot represent anyway and that the existing `MAX_HDRI_VALUE` clamp was designed to mitigate. ## Migration / Backward Compatibility - **Default off**: zero impact on existing users. - **Opt-in**: `BaseTexture._sphericalPolynomialTargetSize = 0` at app startup. - No breaking API changes. The new instance properties and constructor arguments are additive with a default fallback. - If a future major version wants to flip the default, that is a separate decision. ## Scope of Code Changes 1. baseTexture.polynomial.ts - Added _sphericalPolynomialTargetSize: number via module augmentation on BaseTexture - Initialized to 0 on prototype (default = full resolution, no behavior change) 2. cubeTexture.ts - Added sphericalPolynomialTargetSize?: number to ICubeTextureCreationOptions - Wired through constructor options handling 3. envCubeTexture.ts - Added sphericalPolynomialTargetSize = 0 constructor parameter - Stores on instance; passes to ConvertCubeMapToSphericalPolynomial in the eager (HDR/EXR) path 4. hdrCubeTexture.ts - Added sphericalPolynomialTargetSize = 0 constructor parameter, passes through to EnvCubeTexture 5. equiRectangularCubeTexture.ts - Added sphericalPolynomialTargetSize = 0 constructor parameter, stores on instance - Added side-effect import of baseTexture.polynomial 6. cubemapToSphericalPolynomial.ts (main logic) - GPU path (ConvertCubeMapTextureToSphericalPolynomial): reads texture._sphericalPolynomialTargetSize, if mipmaps available reads from computed mip level, otherwise falls back to CPU box-filter downsample after readback - CPU path (ConvertCubeMapToSphericalPolynomial): accepts optional targetSize parameter, downsamples each face before integration - _DownsampleFace: box-filter averaging helper, handles both Float32Array and Uint8Array - _NearestPow2Floor: clamps non-pow-2 target sizes to nearest power-of-two (rounding down) — so passing e.g. 100 silently becomes 64 ## Alternatives Considered | Alternative | Why not (for now) | |-------------|-------------------| | [GPU Accelerated SH projection](https://forum.babylonjs.com/t/gpu-sphericalharmonics-creation/62341) | High impact but large change, engine-backend-specific (WebGL vs WebGPU). Good follow-up, orthogonal to this proposal. | | Web Worker offload | Unblocks main thread but does not reduce total CPU cost. The readback stall remains. Complementary, not a replacement. | | Always skip polynomial, use irradiance texture | Changes visual output for all users. Not backward compatible as a default. | | Accumulate directly into SP basis (TODO at line 137) | Saves one conversion step but the loop cost is identical. Negligible speedup. | ## Test and Benchmark Use this playground to compare performance and max diff after this against the default behavior: <https://playground.babylonjs.com/#2FDQT5#3075> Note that these times are only approximate, as the code includes several asynchronous tasks. ## Forum post <https://forum.babylonjs.com/t/downsampled-sphericalharmonics-creation/62858>

Topic		Replies	Views
Create Spherical Polynomial Lighting Questions	21	193	January 5, 2026
Loading a CubeTexture with cubemaps produces black skybox/specular Bugs	11	602	December 31, 2025
How to specify custom spherical harmonics? Questions	8	439	January 31, 2022
`sphericalPolynomial` really slow Bugs	7	280	November 18, 2022
Why it took 28s to set up the skybox Questions	3	202	March 30, 2022

Downsampled SphericalHarmonics Creation

Summary

Motivation

Approximate impact (single-threaded)

Design

Option flag

Path 1: GPU readback (ConvertCubeMapTextureToSphericalPolynomial)

Path 2: CPU-data sync path (ConvertCubeMapToSphericalPolynomial)

What does not change

Quality Analysis

Scope of Code Changes

Migration / Backward Compatibility

Alternatives Considered

Open Questions

What SphericalPolynomial Is

Why It’s Expensive

1. GPU Readback (the hidden killer)

2. CPU Integration Loop

Is It Really Needed?

What Happens with a Dummy/Zero Polynomial

What Happens with Downsampled Texture

Other Performance Advice

1. Use .env files (pre-baked polynomial)

2. Skip generation when not needed

3. Fix the per-texel allocation

4. Pre-compute solid angles

5. GPU-side computation

6. Offload to a Web Worker

Summary

Related topics

Path 1: GPU readback (`ConvertCubeMapTextureToSphericalPolynomial`)

Path 2: CPU-data sync path (`ConvertCubeMapToSphericalPolynomial`)

1. Use `.env` files (pre-baked polynomial)