Draft: Compact Animation System

Note

This is a very early draft, and could be changed in future, and might not be implemented.

Background

This is the redesign of the long-rejected proposal into a plugin.

Motivation

Currently animation system in babylon.js is very flexible, but not yet optimized for simple animation, like the GLTF ones.
GLTF animations are the main source of animation running in babylon.js, and gltf defines a much simplier animation model.
Sometimes a gltf model can contains more than 10k animation channels/samplers, and millions of keyframes, which can hit the limit of heap memory and CPU performance.
So users could be given the choice to tradeoff flexibility over performance, both for CPU and memory.

Goals

  • Compact, animations should be stored in binary format whenever possible, not only keyframes, but also samplers and runtime data.
  • WASM-first, main compute should be in wasm, and SIMD accelerated whenever possible.
  • One call per frame, all animation sampling should be done at one js-wasm call, no more.
  • Use js objects only if required, everything possible to go into the wasm heap should be there.
  • GLTF-compatible, it should support the animations decleared in core GLTF 2.0 spec, while KHR_animation_pointer kept for future.
  • Babylon.js-compatible, it can be run like an AnimationGroup in babylon.js (if advanced features not used).
  • Optional, user can choose to patch babylon.js to make it enabled by default, like gltf loader, serialization, but only after explicitly called by user.
  • Minimal runtime, the js runtime should be minimal (no emscripten), and the wasm ABI (and mem layout) should be stable. Also, if the wasm binary can be reduced to less than 4k, it can be created synchronously on chrome (the size limit have been increased a few versions ago, but it can not be assumed that all users have latest modern browsers).
    • To reduce it to less than 4k, libc’s math functions (acosf, sinf) can not be used, either a polynomial approximation being used, or entirely drops slerp and fully go to onlerp, both costs precision.
  • Self-contained, no external runtime dependency except for babylon.js
  • Immutable, the animation system is immutable after constructed.
  • Zero heap growth after animation after constructed.
  • Serializable, wasm heap can be serialized to base64 and loaded from, or optional raw data if user need to process the serialized data later.

Non-goals

It is not a goal to:

  • Replace the current animation system of babylon.js.
  • Compatible with the animation curve editor.
  • Be able to change channels/samplers/keyframe data at runtime.
  • Support old browsers without wasm support.
  • Make a multithreaded runtime, which suffers too much limit by browser vendors.
  • Optimize animation channels like resampling, duplicated frame cleaning, channel target merging, constant channel purging, which should all be done at the model level, via the gltfpack or gltf-transform tool, before it was imported. (Constant samplers, if detected, could be evaluated at construction time, and moved out of the per-frame update list, but it’s keeped in mem)
  • Align keyframe data, since unaligned access is pretty fast for modern browsers.
  • Have per-channel loopMode or animation offset.
  • Non-float inputs/outputs (should be dequantized/denormalized during construction if any)
  • Control the playback of every channel, all channels must start / stop / rest once.
  • Support all the advanced advanced animation features (blending, weights, etc.)
  • Run animations on GPU like Baked Texture Animations
  • Support sparse or interleaved accessor, the runtime sampler will only contain tightly packed values (stride == element size)

Data structure

// for each animation group, there should be an animation system like this
// 12 bytes on wasm32
struct animation_system_header {
    uint32_t version;
    uint32_t byte_length;  // total byte length of the continuous data block
    struct animation_system * animation_system;
};

// 44 bytes on wasm32
struct animation_system {
    uint32_t frame_data_length;
    float * sampler_frame_data;
    uint32_t sampler_count;  // total sampler count, = vec3_linear_count + quat_slerp_count + other_count
    struct animation_sampler *samplers;  // base pointer to all samplers (contiguous)
    uint32_t vec3_linear_count;
    struct animation_sampler *vec3_linear_samplers; // fast path for most-used samplers (branchless)
    uint32_t approximate_slerp; // zeux's onlerp (polynomial approximation of slerp via adjusted nlerp)
    uint32_t quat_slerp_count;
    struct animation_sampler *quat_slerp_samplers;
    uint32_t other_count; // fallback path (step, cubic spline, vec4, weights)
    struct animation_sampler *other_samplers;
};

// should mostly be cgltf compatible
typedef enum animation_interpolation_type {
    animation_interpolation_type_linear,
    animation_interpolation_type_step,
    animation_interpolation_type_cubic_spline,
    // cgltf_interpolation_type_max_enum, used to represent const sampler
    // const samplers are pre-evaluated at construction, curr_value is set,
    // and they are excluded from the three processing lists
    animation_interpolation_type_const
} animation_interpolation_type;

typedef enum animation_value_type {
    animation_value_type_vec3,
    animation_value_type_quaternion,
    animation_value_type_vec4,
    animation_value_type_weights
} animation_value_type;

// Per-sampler runtime state (hot-write, contiguous stream).
// Split from animation_sampler to separate hot-write state from readonly metadata,
// reducing false sharing and improving cache utilization (see cache-analysis.md).
// Allocator over-allocates curr_value to value_size floats (padded to 4 for SIMD).
struct animation_sampler_state {
    // current keyframe index hint for temporal-coherent linear scan
    // process_frame walks forward/backward from this index
    uint32_t curr_key_index;
    // set to 1 when the sampled value changes during the current process_frame()
    // call, explicitly reset to 0 when the sampled output is unchanged
    uint32_t value_changed;
    float curr_value[0];  // interpolated output (flexible array, size = value_size)
};

// 36 bytes on wasm32 (was 44 before split-state)
// Readonly after construction — all per-frame mutation goes through state pointer.
struct animation_sampler {
    animation_interpolation_type interpolation;
    animation_value_type value_type;
    uint32_t frame_count;
    // output element count per keyframe:
    //   linear/step: component count (3 for vec3, 4 for quat, N for weights)
    //   cubic_spline: 3 * component count (in-tangent + value + out-tangent per GLTF spec)
    uint32_t value_size;
    float min_frame;
    float max_frame;
    // pointer to runtime state (curr_key_index, value_changed, curr_value)
    struct animation_sampler_state *state;
    // input: keyframe timestamps (float seconds, sorted ascending)
    float *frames;
    // output: keyframe values (tightly packed, stride = value_size)
    // vec3 linear SIMD fast paths use 16-byte glmm_load(), so construction must
    // guarantee one extra float of safe overread at the end of the vec3 stream
    float *values;
};

Recommended vec3 overread-safe packing strategy

For a first TS implementation, the simplest safe rule is:

  1. Keep vec3 keyframes tightly packed as 3 floats per keyframe for ABI compatibility.
  2. When allocating the values block for a vec3 linear sampler, reserve one extra float after the final keyframe.
  3. Initialize that extra float to 0.
  4. Do this per vec3 sampler, not just once globally, so every sampler remains independently relocatable and serializable.

This preserves the current runtime ABI while making every 16-byte glmm_load() stay within allocated memory.

Sampler array layout

The samplers array is contiguous and sorted by evaluation category:

samplers[0 .. vec3_linear_count-1]                              → vec3_linear
samplers[vec3_linear_count .. vec3_linear_count+quat_slerp_count-1] → quat_slerp
samplers[... .. ...+other_count-1]                              → other

The three sub-pointers point into this array:

  • vec3_linear_samplers = &samplers[0]
  • quat_slerp_samplers = &samplers[vec3_linear_count]
  • other_samplers = &samplers[vec3_linear_count + quat_slerp_count]
  • sampler_count = vec3_linear_count + quat_slerp_count + other_count

This contiguous-array invariant is required by relocate(): it iterates samplers[0..sampler_count) and assumes the three category pointers are subranges into that single array, not separately allocated sampler blocks.

Const samplers (animation_interpolation_type_const) are pre-evaluated at construction time: their curr_value is set once and they are excluded from all three processing lists. They remain in the samplers array for relocation but are never re-evaluated.

Data layout (low-high)

Inside wasm linear memory, each animation system occupies a contiguous block:

[stack]         ← WASM stack (256 bytes, addresses [0, 256), grows downward)
[header]        ← animation_system_header (12 bytes)
[frame data]    ← shared keyframe timestamps (sampler_frame_data)
[state stream]  ← contiguous sampler states (16-byte aligned entries)
[system]        ← animation_system struct (44 bytes)
[samplers]      ← sampler array (sorted: vec3_linear, quat_slerp, other)
[keyframes]     ← per-sampler frames[] and values[] arrays

The WASM stack is configured to 256 bytes (-z stack-size=256), used only for float prev[4] arrays in vec3/quat evaluators (64-byte stack frame) when comparing old vs new values via pointer-taking vec3_equal/vec4_equal. The evaluate_other SIMD path avoids the stack entirely by caching old values in v128 locals (wasm operand stack → JIT registers).

All parts of a single animation system must be in a contiguous memory block. Multiple animation systems can share the same wasm memory (and instance) as long as their blocks don’t overlap. Data starts at address 256 (after the stack).

Api designing

C api

// Evaluate all samplers at curr_frame (in seconds).
// Returns number of samplers whose value_changed flag was set to 1 during this
// call; samplers whose sampled output is unchanged are explicitly reset to 0.
// Processes vec3_linear, quat_slerp, then other. Const samplers are skipped.
uint32_t process_frame(float curr_frame, struct animation_system *sys);

// Adjust all internal pointers by offset (new_base - old_base).
// Used after deserializing/copying a memory block to a different address.
// Fixes: animation_system ptr, sampler_frame_data, all sampler sub-arrays,
// and for each sampler: state, frames, values pointers.
// Requires sys->samplers to be the base of one contiguous sampler array that
// contains all vec3_linear/quat_slerp/other sampler subranges.
void relocate(struct animation_system_header *header, intptr_t offset);

And js should fetch data directly from heap.

JS API

class CompactAnimationSystem {
    private _instance: WebAssembly.Instance;
    private _memory: WebAssembly.Memory;
    private _headerPtr: number;   // pointer to animation_system_header in wasm memory
    private _systemPtr: number;   // pointer to animation_system
    private _targets: NodeTarget[];

    // Shared typed array views over wasm memory (invalidated on memory.grow)
    private _u32: Uint32Array;
    private _f32: Float32Array;

    // Called by RuntimeAnimation.setValue via the animation's value setter
    set frame(value: number): void {
        // calls wasm process_frame, then iterates targets
    }

    // Owned AnimationGroup created by createAnimationGroup().
    // Null before creation and after disposal.
    animationGroup: AnimationGroup | null;
}

// Not a class, to avoid per-instance overhead
// IMPORTANT: Create() mutates the target's sampler index fields (translation,
// rotation, scale, weights) to store sorted indices. Callers must pass
// transient/cloned target objects, not shared originals.
interface NodeTarget {
    node: Node;
    morph?: MorphTargetManager;
    // sampler index into the global `samplers` array, -1 if no channel.
    // These are NOT indices local to vec3_linear/quat_slerp/other subarrays.
    translation: number;
    rotation: number;
    scale: number;
    weights: number;
}

Ownership / disposal semantics

createAnimationGroup() establishes bidirectional lifetime coupling between the returned Babylon.js AnimationGroup and the CompactAnimationSystem:

  • Disposing the CompactAnimationSystem disposes its owned AnimationGroup
  • Disposing the returned AnimationGroup also disposes the CompactAnimationSystem
  • The coupling must be recursion-safe: CompactAnimationSystem.dispose() marks the system disposed before it calls AnimationGroup.dispose(), while the hooked group-dispose path only calls back into the system when the system is not already disposing
  • This is required for GLTF loader integration, because loader users normally only receive the AnimationGroup
  • This is also required for scene teardown, because Babylon.js scene disposal releases animation groups through AnimationGroup.dispose() and does not know about the compact-system WASM allocation directly

If createAnimationGroup() is called more than once on the same system, it returns the already-owned group instead of creating a second AnimationGroup. This preserves the one-system ↔ one-group ownership model.

As a consequence, releasing the returned AnimationGroup is sufficient to free the compact animation WASM block and target references.

For glTF weights, the current loader implementation may create multiple weight-only targets that all reference the same compact sampler index — one per primitive mesh with a compatible morphTargetManager. This preserves Babylon.js glTF-loader behavior where one node-level weights track fans out to every compatible primitive under that node.

Serialization

AnimationSystem should be serializable, where used memory block and base pointer serialized, when deserialized, the memory block is put into the new area, and the relocate function is used to move the pointers in the memory block.

Deserialization is not supported before explicitly called by user to patch AnimationGroup.Parse.

Animation process

scene._animate()
animatable._animate()
RuntimeAnimation.animate()
animation._interpolate() (This makes a dummy animation whose frame is babylon.js frame, and value is gltf frame ) →
RuntimeAnimation.setValue()
AnimationSystem.set frame() (setter implicitly called by setValue) →
kernel.process_frame()
Iterate targets and fetch sampler value and set to babylon.js object

Concept Mapping

1 WebAssembly.Memory – 1 WebAssembly.Instance – 1 or many AnimationSystem

1 GLTF animation – 1 AnimationGroup – 1 TargetedAnimation { target: 1 AnimationSystem, animation: 1 Animation } – 1 RuntimeAnimation – 1 Animatable

Also note that if the gltf contains currently unsupported channels or samplers, the AnimationGroup might contains more BABYLON.TargetedAnimation for unsupported channels or samplers.

WASM Feature Flags

  • SIMD128: 128-bit SIMD for vectorized interpolation (vec3_lerp, quat_slerp, vecn_lerp, cubic_spline)
  • Bulk Memory Operations: memory.copy replaces byte-loop memcpy, used in value comparison and state updates
  • Non-trapping Float-to-Int Conversions: i32.trunc_sat_f32_s eliminates trap-check branches in float-to-int casts
  • Sign-extension Operators: i64.extend_i32_s etc. for efficient sign-extended loads

All four features are baseline and supported in all modern browsers (Chrome 91+, Firefox 89+, Safari 15+).

Benchmarking

Use the stress test model with minimal draw calls, collect FPS and heap memory (firefox heap memory can only be measured via devtools, chrome can use performance.memory api.

2 Likes
  • Could advanced animation features (blending, weights, etc.) still work?
  • Any clue how this will compare versus VATs (perf, memory)?

This could be a problem. In the current state, if you have a mixamo rig and like 50 animation groups (running, walking, jumping, etc.), you cannot load all these animations on startup. This is taking way too long. So you end up requesting single animation on demand (which loads surprisingly fast). **If I remember correctly I was at like 12sec per skeleton with then maybe 50 animations.

Do I read the bullet point right in that lazy-loading animations is not possible anymore?

I don’t think so, actually I’ve never used these features in prod. But since the animation group is still a babylon.js AnimationGroup, it’s possible to combine the compat animation system and babylon.js animation channels with these advanced animation features in one animation group.

This is planned to support all core animations in glttf 2.0 spec, while Baked Texture Animations supports only the skeleton animations.
This is more like a fast path for most common animation cases.

Yeah, that makes sence, let’s change it to one animation system per animation group, and the wasm heap is append-only. Users can choose to use one wasm heap and append animation groups to it (and can not be deleted later since wasm spec does not allow to free/discard allocated memory), or use one wasm instance and heap for each animation group, less efficent but more control.

Very cool project, I like the idea of trying to compact and optimize everything in BJS hehe. I’m curious about your SIMD implementation in C - are you going to provide runtime checks for processor capabilities? Are you going to bootstrap from a library that abstracts intrinsics or target each processor type for SSE2, AVX2, AVX-512 etc?

I’ve been working with SIMD with a few projects recently. One big WIP i have going is called “ShaderObject” which has one underlying buffer/arena that projects into WASM through dynamic assemblyscript emission/compilation and structs in webgl2 and webgpu. Assemblyscript is very quick to work with, I’m surprised by its efficiency… Can run ops on 150k instances in a hot loop in under 1ms. I’m also working in Rust in another project and using the wide crate which sits on top and has more out of the box control with runtime checking and provides the intrinsics.

Here’s one part of this for reference, the parent class does ASC compilation on the fly here.

On web thing goes simple, since simd128 is the only simd widely supported on browsers, there is few thing that can be manually optimized.
The C impl would likely be cglm, a lib supported simd128 for years.