Downsampled SphericalHarmonics Creation

Summary

Reduce the CPU cost of SphericalPolynomial generation by reading from a lower mip level of the cubemap instead of the full-resolution mip 0. For the synchronous CPU-data path, apply a box-filter downsample before integration. The optimization is opt-in via a static flag on CubeMapToSphericalPolynomialTools, defaulting to false (current behavior preserved).
This is crafted with the help of AI, but this idea comes from human :person: .

Motivation

CubeMapToSphericalPolynomialTools.ConvertCubeMapTextureToSphericalPolynomial is one of the most expensive synchronous CPU operations triggered during texture load (EquiRectangular and HDR). It iterates every texel of all 6 cubemap faces, performing per-texel atan2 (x4), gamma correction (Math.pow x3), SH basis evaluation (x9), and vector accumulation. For a 2048x2048 cubemap this is ~25 million texels and ~100 million atan2 calls.

The GPU readback path compounds this: it calls flushFramebuffer() (pipeline stall) then readPixels on 6 full-resolution faces (384 MB for 2048 RGBA float), transferring data that will be reduced to 27 floats.

The output — 3rd-order spherical harmonics (9 RGB coefficients) — has an angular resolution of ~90 degrees. A 32x32 or 64x64 cubemap face contains more than enough information to produce identical SH3 coefficients. Everything above ~16x16 per face is oversampling for this purpose.

Approximate impact (single-threaded)

Face size Texels (6 faces) Time estimate
2048 25M ~300 ms
1024 6.3M ~80 ms
64 24K <1 ms
32 6K <0.3 ms

Design

Option flag

// cubemapToSphericalPolynomial.ts
export class CubeMapToSphericalPolynomialTools {
    /**
     * When true, the SH projection reads from a lower mip level (GPU path)
     * or downsamples the input data (CPU path) to reduce computation cost.
     * The target face size is controlled by SH_INTEGRATION_TARGET_SIZE.
     *
     * Default: false (preserves existing full-resolution behavior).
     */
    public static USE_DOWNSAMPLED_SH_PROJECTION = false;

    /**
     * Target face resolution (per side) when USE_DOWNSAMPLED_SH_PROJECTION is true.
     * Values of 32-64 are recommended. Must be a power of two.
     * Ignored when USE_DOWNSAMPLED_SH_PROJECTION is false.
     *
     * Default: 64.
     */
    public static SH_INTEGRATION_TARGET_SIZE = 64;
}

A static flag on CubeMapToSphericalPolynomialTools is chosen over a scene-level option because:

  • The tool class already carries static config (MAX_HDRI_VALUE, PRESERVE_CLAMPED_COLORS)
  • SH projection is a global math concern, not scene-specific
  • Keeps the surface area small — one place to set, applies everywhere

Path 1: GPU readback (ConvertCubeMapTextureToSphericalPolynomial)

When USE_DOWNSAMPLED_SH_PROJECTION is true and the texture has mipmaps (generateMipMaps === true):

  1. Compute the mip level that brings each face closest to SH_INTEGRATION_TARGET_SIZE:
    const level = Math.max(0, Math.round(Math.log2(size / targetSize)));
    const mipSize = Math.max(1, size >> level);
    
  2. Pass level to the existing readPixels(faceIndex, level, ...) calls.
  3. Use mipSize as cubeInfo.size.

When the flag is true but the texture has no mipmaps, fall back to full-resolution (current behavior) with no change.

Path 2: CPU-data sync path (ConvertCubeMapToSphericalPolynomial)

When USE_DOWNSAMPLED_SH_PROJECTION is true and cubeInfo.size > targetSize:

  1. Downsample each face array using a simple box filter before entering the integration loop.
  2. A private static helper performs the downsample:
    _DownsampleFace(data, srcSize, dstSize, stride): Float32Array
    
    Averages non-overlapping (ratio x ratio) blocks. One pass, no allocations beyond the output array.

What does not change

  • The integration loop itself (addLight, _AreaElement, basis functions) is untouched.
  • Post-processing (correction factor, irradiance conversion, Lambert scaling) is untouched.
  • .env file loading is unaffected (polynomial is deserialized from the file header, never computed).
  • The generateHarmonics = false skip path is unaffected.
  • MAX_HDRI_VALUE / PRESERVE_CLAMPED_COLORS behavior is preserved.
  • Default behavior (USE_DOWNSAMPLED_SH_PROJECTION = false) is bit-identical to current code.

Quality Analysis

SH band 2 (the highest band in 3rd-order) corresponds to angular features of ~90 degrees. The Nyquist criterion for band-2 SH requires roughly 6x6 samples per face. A 64x64 face provides 10x oversampling even after downsampling.

Expected quality by target size:

Target Samples per face vs. Nyquist Perceptual difference
64 4096 ~113x Imperceptible
32 1024 ~28x Imperceptible
16 256 ~7x Negligible, may show on extreme directional HDRIs
8 64 ~1.8x Noticeable on strong directional environments

The default target of 64 is conservative. Aggressive users can lower it to 32 with no visible difference.

The box filter introduces slight averaging versus point-sampling each texel, but this is beneficial: it suppresses high-frequency noise (bright point lights in HDRIs) that SH3 cannot represent anyway and that the existing MAX_HDRI_VALUE clamp was designed to mitigate.

Scope of Code Changes

Change Location Lines
Two static properties cubemapToSphericalPolynomial.ts ~12 (with JSDoc)
Mip-level selection + guard in GPU path ConvertCubeMapTextureToSphericalPolynomial ~8
Box-filter downsample helper _DownsampleFace private static ~20
Downsample call in CPU path ConvertCubeMapToSphericalPolynomial ~8
Total 1 file ~48

No new files. No API signature changes. No shader changes. No new dependencies.

Migration / Backward Compatibility

  • Default off: zero impact on existing users.
  • Opt-in: CubeMapToSphericalPolynomialTools.USE_DOWNSAMPLED_SH_PROJECTION = true at app startup.
  • No breaking API changes. The two new static properties are additive.
  • If a future major version wants to flip the default to true, that is a separate decision.

Alternatives Considered

Alternative Why not (for now)
GPU Accelerated SH projection High impact but large change, engine-backend-specific (WebGL vs WebGPU). Good follow-up, orthogonal to this proposal.
Web Worker offload Unblocks main thread but does not reduce total CPU cost. The readback stall remains. Complementary, not a replacement.
Always skip polynomial, use irradiance texture Changes visual output for all users. Not backward compatible as a default.
Accumulate directly into SP basis (TODO at line 137) Saves one conversion step but the loop cost is identical. Negligible speedup.

Open Questions

  1. Default target size: 64 is conservative. Should we use 32 for even more savings?
  2. Future default flip: Should we plan to make this true by default in a future minor or major release once validated?
  3. Per-texture override: Is a per-texture option needed, or is the global static sufficient for all known use cases?
And the full AI analysis in current babylon.js repo

What SphericalPolynomial Is

SphericalPolynomial is a compact representation of the low-frequency diffuse irradiance from an environment cubemap, using 3rd-degree spherical harmonics (9 coefficients, each an RGB Vector3 = 27 floats total). It captures the broad ambient lighting directions from an HDR environment so that PBR materials can compute diffuse IBL without sampling the full cubemap per pixel.

The pipeline:

Cubemap (6 × size² texels)
  → SphericalHarmonics (9 SH coefficients, accumulated per-texel)
  → cosine kernel convolution (incident radiance → irradiance)
  → Lambert 1/π scaling (irradiance → outgoing radiance)
  → SphericalPolynomial (9 polynomial coefficients for shader)

The shader then evaluates these 9 coefficients with the surface normal to get diffuse IBL contribution — essentially replacing a full cubemap lookup with a cheap polynomial evaluation.

Why It’s Expensive

There are two cost sources, and the first one is easily overlooked:

1. GPU Readback (the hidden killer)

In baseTexture.polynomial.ts:41, the lazy getter triggers ConvertCubeMapTextureToSphericalPolynomial, which at cubemapToSphericalPolynomial.ts:57-74:

texture.getScene()?.getEngine().flushFramebuffer();  // GPU pipeline stall!
const rightPromise = texture.readPixels(0, ...);     // GPU→CPU readback
const leftPromise  = texture.readPixels(1, ...);     // ×6 faces
// ...

This flushes the GPU pipeline and reads back 6 full-resolution faces. For a 2048x2048 RGBA float cubemap, that’s 6 × 2048 × 2048 × 16 bytes = 384 MB of GPU→CPU transfer. This alone can stall rendering for tens of milliseconds.

2. CPU Integration Loop

At cubemapToSphericalPolynomial.ts:132-212, the triple-nested loop:

  • Iterates 6 × size² texels
  • Per texel: 4× atan2 calls for solid angle (line 149-153) — these are the most expensive math ops
  • Per texel: new Color3(r, g, b) allocation (line 201) — GC pressure in hot loop
  • Per texel: addLight() does 9× SH basis evaluation + accumulation
  • Gamma-space textures add Math.pow() per channel (line 179-181)

Cost by texture size:

Size Texels (6 faces) atan2 calls Rough time
256 393K 1.6M ~5ms
512 1.6M 6.3M ~20ms
1024 6.3M 25M ~80ms
2048 25M 100M ~300ms
4096 100M 402M ~1200ms

Is It Really Needed?

Yes, but only if you want accurate diffuse IBL. Its purpose is to tint surfaces with the correct ambient color from the environment. Without it:

  • PBR materials fall back to USESPHERICALFROMREFLECTIONMAP = false (line 1345-1346 of pbrBaseMaterial.ts), which is a shader fallback that skips the diffuse SH term entirely
  • The visual result: objects lose their ambient environment tinting — a metallic object in a sunset scene would lose the warm orange fill light

However, the polynomial only captures bands 0-2 (very low frequency). It’s basically the average color + a directional gradient + a slight quadratic variation. For many scenes, you can approximate this cheaply.

What Happens with a Dummy/Zero Polynomial

If you set sphericalPolynomial to an empty new SphericalPolynomial() (all zeros):

  1. PBRMaterial isReady() at line 1129 — passes immediately (polynomial exists, promise is null)
  2. Shader binding at line 307 — polynomials is truthy, USESPHERICALFROMREFLECTIONMAP define stays on
  3. Rendering — all SH uniforms are zero, so diffuse IBL contribution = black

This means objects lit only by environment would appear too dark — they’d still get specular reflections from the cubemap but no diffuse fill. For scenes with strong directional lights, this might be barely noticeable. For IBL-only scenes, it would look obviously wrong.

Better dummy approach: Set l00 (the DC/average term) to the average color of your environment. This single coefficient represents the uniform ambient term and captures ~80% of the visual contribution. The other 8 coefficients add directional nuance.

What Happens with Downsampled Texture

This is the best optimization opportunity. SH band 2 (the highest band used) captures features at ~90-degree angular resolution. You need far fewer texels than the source cubemap:

Downsampled size Texels Quality loss Speedup
64×64 24K Imperceptible ~250× vs 1024
32×32 6K Negligible ~1000× vs 1024
16×16 1.5K Slight ~4000× vs 1024
8×8 384 Noticeable on strong directional envs ~16000× vs 1024

A 32×32 or 64×64 downsample would be visually indistinguishable from the full-resolution result for SH3. The math is rigorous here — SH band 2 has an angular resolution of ~90°, so anything above ~16×16 per face is oversampling for this purpose.

Other Performance Advice

1. Use .env files (pre-baked polynomial)

The .env format serializes the polynomial coefficients in the file header. On load, UploadEnvSpherical() just reads 27 floats — zero computation. If you control the asset pipeline, always convert HDR/DDS to .env format offline.

2. Skip generation when not needed

new HDRCubeTexture("env.hdr", scene, 256, false, /* generateHarmonics */ false);

Then manually set a pre-computed or approximated polynomial.

3. Fix the per-texel allocation

Line 201 creates new Color3(r, g, b) inside the hot loop. This could be hoisted to a reusable instance, eliminating millions of allocations. The codebase already uses TmpVectors for exactly this pattern in addLight().

4. Pre-compute solid angles

The _AreaElement calls (4× atan2 per texel) depend only on (u, v) grid positions, not on pixel color. For a given texture size, these could be computed once into a lookup table and reused across all 6 faces (all faces share the same (u,v) grid).

5. GPU-side computation

The entire SH projection is embarrassingly parallel — it’s a map-reduce over texels. A compute shader could do this in <1ms for any size. Several engines (Filament, three.js) have moved this to GPU. The TODO at line 137-139 hints at awareness of optimization opportunities here, though it suggests accumulating into SP basis directly rather than going to GPU.

6. Offload to a Web Worker

Since the heavy computation is purely CPU math on an ArrayBuffer, it could be posted to a Worker to avoid blocking the main thread. The async pattern is already in place (the promise-based lazy getter).

Summary

Approach Effort Speedup Visual Impact
Use .env files Asset pipeline Eliminates computation None
Downsample to 32×32 Moderate code change ~1000× Imperceptible
Hoist Color3 alloc Trivial one-liner ~10-20% None
Pre-compute solid angles Small code change ~30-40% None
GPU compute shader Large Full-size in <1ms None
Web Worker offload Moderate Unblocks main thread None
Dummy (avg color only) Trivial Eliminates computation Minor ambient inaccuracy

The highest-impact, lowest-effort win is downsampling the input to 32-64px per face before running the SH projection. It’s mathematically justified (SH3 can’t represent higher frequencies anyway) and would make the computation effectively free.

2 Likes

cc @Evgeni_Popov

That looks good to me!

However, I would prefer this property to be defined per instance rather than statically, because even though I agree that we generally want this setting to apply to all cubes, if someone wants a different setting for certain cubes (for whatever reason), it feels odd to modify a static property between calls.

And, at some point, we’ll probably need to port the code to the GPU (but that will be the subject of another PR).

does it mean as part of CubeTexture? SphericalPolynomial is created implictly inside textures

Yes, I think it can be a parameter at construction time?