Summary
Reduce the CPU cost of SphericalPolynomial generation by reading from a lower mip level of the cubemap instead of the full-resolution mip 0. For the synchronous CPU-data path, apply a box-filter downsample before integration. The optimization is opt-in via a static flag on CubeMapToSphericalPolynomialTools, defaulting to false (current behavior preserved).
This is crafted with the help of AI, but this idea comes from human
.
Motivation
CubeMapToSphericalPolynomialTools.ConvertCubeMapTextureToSphericalPolynomial is one of the most expensive synchronous CPU operations triggered during texture load (EquiRectangular and HDR). It iterates every texel of all 6 cubemap faces, performing per-texel atan2 (x4), gamma correction (Math.pow x3), SH basis evaluation (x9), and vector accumulation. For a 2048x2048 cubemap this is ~25 million texels and ~100 million atan2 calls.
The GPU readback path compounds this: it calls flushFramebuffer() (pipeline stall) then readPixels on 6 full-resolution faces (384 MB for 2048 RGBA float), transferring data that will be reduced to 27 floats.
The output — 3rd-order spherical harmonics (9 RGB coefficients) — has an angular resolution of ~90 degrees. A 32x32 or 64x64 cubemap face contains more than enough information to produce identical SH3 coefficients. Everything above ~16x16 per face is oversampling for this purpose.
Approximate impact (single-threaded)
| Face size | Texels (6 faces) | Time estimate |
|---|---|---|
| 2048 | 25M | ~300 ms |
| 1024 | 6.3M | ~80 ms |
| 64 | 24K | <1 ms |
| 32 | 6K | <0.3 ms |
Design
Option flag
// cubemapToSphericalPolynomial.ts
export class CubeMapToSphericalPolynomialTools {
/**
* When true, the SH projection reads from a lower mip level (GPU path)
* or downsamples the input data (CPU path) to reduce computation cost.
* The target face size is controlled by SH_INTEGRATION_TARGET_SIZE.
*
* Default: false (preserves existing full-resolution behavior).
*/
public static USE_DOWNSAMPLED_SH_PROJECTION = false;
/**
* Target face resolution (per side) when USE_DOWNSAMPLED_SH_PROJECTION is true.
* Values of 32-64 are recommended. Must be a power of two.
* Ignored when USE_DOWNSAMPLED_SH_PROJECTION is false.
*
* Default: 64.
*/
public static SH_INTEGRATION_TARGET_SIZE = 64;
}
A static flag on CubeMapToSphericalPolynomialTools is chosen over a scene-level option because:
- The tool class already carries static config (
MAX_HDRI_VALUE,PRESERVE_CLAMPED_COLORS) - SH projection is a global math concern, not scene-specific
- Keeps the surface area small — one place to set, applies everywhere
Path 1: GPU readback (ConvertCubeMapTextureToSphericalPolynomial)
When USE_DOWNSAMPLED_SH_PROJECTION is true and the texture has mipmaps (generateMipMaps === true):
- Compute the mip level that brings each face closest to
SH_INTEGRATION_TARGET_SIZE:const level = Math.max(0, Math.round(Math.log2(size / targetSize))); const mipSize = Math.max(1, size >> level); - Pass
levelto the existingreadPixels(faceIndex, level, ...)calls. - Use
mipSizeascubeInfo.size.
When the flag is true but the texture has no mipmaps, fall back to full-resolution (current behavior) with no change.
Path 2: CPU-data sync path (ConvertCubeMapToSphericalPolynomial)
When USE_DOWNSAMPLED_SH_PROJECTION is true and cubeInfo.size > targetSize:
- Downsample each face array using a simple box filter before entering the integration loop.
- A private static helper performs the downsample:
Averages non-overlapping_DownsampleFace(data, srcSize, dstSize, stride): Float32Array(ratio x ratio)blocks. One pass, no allocations beyond the output array.
What does not change
- The integration loop itself (
addLight,_AreaElement, basis functions) is untouched. - Post-processing (correction factor, irradiance conversion, Lambert scaling) is untouched.
.envfile loading is unaffected (polynomial is deserialized from the file header, never computed).- The
generateHarmonics = falseskip path is unaffected. MAX_HDRI_VALUE/PRESERVE_CLAMPED_COLORSbehavior is preserved.- Default behavior (
USE_DOWNSAMPLED_SH_PROJECTION = false) is bit-identical to current code.
Quality Analysis
SH band 2 (the highest band in 3rd-order) corresponds to angular features of ~90 degrees. The Nyquist criterion for band-2 SH requires roughly 6x6 samples per face. A 64x64 face provides 10x oversampling even after downsampling.
Expected quality by target size:
| Target | Samples per face | vs. Nyquist | Perceptual difference |
|---|---|---|---|
| 64 | 4096 | ~113x | Imperceptible |
| 32 | 1024 | ~28x | Imperceptible |
| 16 | 256 | ~7x | Negligible, may show on extreme directional HDRIs |
| 8 | 64 | ~1.8x | Noticeable on strong directional environments |
The default target of 64 is conservative. Aggressive users can lower it to 32 with no visible difference.
The box filter introduces slight averaging versus point-sampling each texel, but this is beneficial: it suppresses high-frequency noise (bright point lights in HDRIs) that SH3 cannot represent anyway and that the existing MAX_HDRI_VALUE clamp was designed to mitigate.
Scope of Code Changes
| Change | Location | Lines |
|---|---|---|
| Two static properties | cubemapToSphericalPolynomial.ts |
~12 (with JSDoc) |
| Mip-level selection + guard in GPU path | ConvertCubeMapTextureToSphericalPolynomial |
~8 |
| Box-filter downsample helper | _DownsampleFace private static |
~20 |
| Downsample call in CPU path | ConvertCubeMapToSphericalPolynomial |
~8 |
| Total | 1 file | ~48 |
No new files. No API signature changes. No shader changes. No new dependencies.
Migration / Backward Compatibility
- Default off: zero impact on existing users.
- Opt-in:
CubeMapToSphericalPolynomialTools.USE_DOWNSAMPLED_SH_PROJECTION = trueat app startup. - No breaking API changes. The two new static properties are additive.
- If a future major version wants to flip the default to
true, that is a separate decision.
Alternatives Considered
| Alternative | Why not (for now) |
|---|---|
| GPU Accelerated SH projection | High impact but large change, engine-backend-specific (WebGL vs WebGPU). Good follow-up, orthogonal to this proposal. |
| Web Worker offload | Unblocks main thread but does not reduce total CPU cost. The readback stall remains. Complementary, not a replacement. |
| Always skip polynomial, use irradiance texture | Changes visual output for all users. Not backward compatible as a default. |
| Accumulate directly into SP basis (TODO at line 137) | Saves one conversion step but the loop cost is identical. Negligible speedup. |
Open Questions
- Default target size: 64 is conservative. Should we use 32 for even more savings?
- Future default flip: Should we plan to make this
trueby default in a future minor or major release once validated? - Per-texture override: Is a per-texture option needed, or is the global static sufficient for all known use cases?
And the full AI analysis in current babylon.js repo
What SphericalPolynomial Is
SphericalPolynomial is a compact representation of the low-frequency diffuse irradiance from an environment cubemap, using 3rd-degree spherical harmonics (9 coefficients, each an RGB Vector3 = 27 floats total). It captures the broad ambient lighting directions from an HDR environment so that PBR materials can compute diffuse IBL without sampling the full cubemap per pixel.
The pipeline:
Cubemap (6 × size² texels)
→ SphericalHarmonics (9 SH coefficients, accumulated per-texel)
→ cosine kernel convolution (incident radiance → irradiance)
→ Lambert 1/π scaling (irradiance → outgoing radiance)
→ SphericalPolynomial (9 polynomial coefficients for shader)
The shader then evaluates these 9 coefficients with the surface normal to get diffuse IBL contribution — essentially replacing a full cubemap lookup with a cheap polynomial evaluation.
Why It’s Expensive
There are two cost sources, and the first one is easily overlooked:
1. GPU Readback (the hidden killer)
In baseTexture.polynomial.ts:41, the lazy getter triggers ConvertCubeMapTextureToSphericalPolynomial, which at cubemapToSphericalPolynomial.ts:57-74:
texture.getScene()?.getEngine().flushFramebuffer(); // GPU pipeline stall!
const rightPromise = texture.readPixels(0, ...); // GPU→CPU readback
const leftPromise = texture.readPixels(1, ...); // ×6 faces
// ...
This flushes the GPU pipeline and reads back 6 full-resolution faces. For a 2048x2048 RGBA float cubemap, that’s 6 × 2048 × 2048 × 16 bytes = 384 MB of GPU→CPU transfer. This alone can stall rendering for tens of milliseconds.
2. CPU Integration Loop
At cubemapToSphericalPolynomial.ts:132-212, the triple-nested loop:
- Iterates
6 × size²texels - Per texel: 4×
atan2calls for solid angle (line 149-153) — these are the most expensive math ops - Per texel:
new Color3(r, g, b)allocation (line 201) — GC pressure in hot loop - Per texel:
addLight()does 9× SH basis evaluation + accumulation - Gamma-space textures add
Math.pow()per channel (line 179-181)
Cost by texture size:
| Size | Texels (6 faces) | atan2 calls | Rough time |
|---|---|---|---|
| 256 | 393K | 1.6M | ~5ms |
| 512 | 1.6M | 6.3M | ~20ms |
| 1024 | 6.3M | 25M | ~80ms |
| 2048 | 25M | 100M | ~300ms |
| 4096 | 100M | 402M | ~1200ms |
Is It Really Needed?
Yes, but only if you want accurate diffuse IBL. Its purpose is to tint surfaces with the correct ambient color from the environment. Without it:
- PBR materials fall back to
USESPHERICALFROMREFLECTIONMAP = false(line 1345-1346 of pbrBaseMaterial.ts), which is a shader fallback that skips the diffuse SH term entirely - The visual result: objects lose their ambient environment tinting — a metallic object in a sunset scene would lose the warm orange fill light
However, the polynomial only captures bands 0-2 (very low frequency). It’s basically the average color + a directional gradient + a slight quadratic variation. For many scenes, you can approximate this cheaply.
What Happens with a Dummy/Zero Polynomial
If you set sphericalPolynomial to an empty new SphericalPolynomial() (all zeros):
- PBRMaterial
isReady()at line 1129 — passes immediately (polynomial exists, promise is null) - Shader binding at line 307 —
polynomialsis truthy,USESPHERICALFROMREFLECTIONMAPdefine stays on - Rendering — all SH uniforms are zero, so diffuse IBL contribution = black
This means objects lit only by environment would appear too dark — they’d still get specular reflections from the cubemap but no diffuse fill. For scenes with strong directional lights, this might be barely noticeable. For IBL-only scenes, it would look obviously wrong.
Better dummy approach: Set l00 (the DC/average term) to the average color of your environment. This single coefficient represents the uniform ambient term and captures ~80% of the visual contribution. The other 8 coefficients add directional nuance.
What Happens with Downsampled Texture
This is the best optimization opportunity. SH band 2 (the highest band used) captures features at ~90-degree angular resolution. You need far fewer texels than the source cubemap:
| Downsampled size | Texels | Quality loss | Speedup |
|---|---|---|---|
| 64×64 | 24K | Imperceptible | ~250× vs 1024 |
| 32×32 | 6K | Negligible | ~1000× vs 1024 |
| 16×16 | 1.5K | Slight | ~4000× vs 1024 |
| 8×8 | 384 | Noticeable on strong directional envs | ~16000× vs 1024 |
A 32×32 or 64×64 downsample would be visually indistinguishable from the full-resolution result for SH3. The math is rigorous here — SH band 2 has an angular resolution of ~90°, so anything above ~16×16 per face is oversampling for this purpose.
Other Performance Advice
1. Use .env files (pre-baked polynomial)
The .env format serializes the polynomial coefficients in the file header. On load, UploadEnvSpherical() just reads 27 floats — zero computation. If you control the asset pipeline, always convert HDR/DDS to .env format offline.
2. Skip generation when not needed
new HDRCubeTexture("env.hdr", scene, 256, false, /* generateHarmonics */ false);
Then manually set a pre-computed or approximated polynomial.
3. Fix the per-texel allocation
Line 201 creates new Color3(r, g, b) inside the hot loop. This could be hoisted to a reusable instance, eliminating millions of allocations. The codebase already uses TmpVectors for exactly this pattern in addLight().
4. Pre-compute solid angles
The _AreaElement calls (4× atan2 per texel) depend only on (u, v) grid positions, not on pixel color. For a given texture size, these could be computed once into a lookup table and reused across all 6 faces (all faces share the same (u,v) grid).
5. GPU-side computation
The entire SH projection is embarrassingly parallel — it’s a map-reduce over texels. A compute shader could do this in <1ms for any size. Several engines (Filament, three.js) have moved this to GPU. The TODO at line 137-139 hints at awareness of optimization opportunities here, though it suggests accumulating into SP basis directly rather than going to GPU.
6. Offload to a Web Worker
Since the heavy computation is purely CPU math on an ArrayBuffer, it could be posted to a Worker to avoid blocking the main thread. The async pattern is already in place (the promise-based lazy getter).
Summary
| Approach | Effort | Speedup | Visual Impact |
|---|---|---|---|
Use .env files |
Asset pipeline | Eliminates computation | None |
| Downsample to 32×32 | Moderate code change | ~1000× | Imperceptible |
| Hoist Color3 alloc | Trivial one-liner | ~10-20% | None |
| Pre-compute solid angles | Small code change | ~30-40% | None |
| GPU compute shader | Large | Full-size in <1ms | None |
| Web Worker offload | Moderate | Unblocks main thread | None |
| Dummy (avg color only) | Trivial | Eliminates computation | Minor ambient inaccuracy |
The highest-impact, lowest-effort win is downsampling the input to 32-64px per face before running the SH projection. It’s mathematically justified (SH3 can’t represent higher frequencies anyway) and would make the computation effectively free.