Rendering models based on MediaPipe Hand tracking model data

Hello babylonJS community! We are struggling while doing an application using babylonJs + MediaPipe, i will appreciate all answers that may help us to get right direction!

What’s my goal?

  • I am trying to use babylonJS paired with MediaPipe to render a video with models on top of it.
    My output should be: Mobile camera video with rendered models on our fingers (nails).
  • Media pipe renders video in < video > element, < video > and < canvas > must be layered so canvas is on top of < video >. Or if there is a possible way to render a live video directly in canvas/WebGL (BabylonJS)

What’s the problem?

  • I can not transform coordinates that mediaPipe returns to display them in BabylonJS at right place.
  • What is best camera type for this solution?
  • Should we use WebXR instead ?
  • How to transform normalized x,y, z coords with Width/Height of < video element> into babylonJS scene

MediaPipe returns:

  • normalized coords X,Y,Z relative to our screen (for each joint)
  • World coords with the origin at the hand’s approximate geometric center.

MULTI_HAND_LANDMARKS

Collection of detected/tracked hands, where each hand is represented as a list of 21 hand landmarks and each landmark is composed of x , y and z . x and y are normalized to [0.0, 1.0] by the image width and height respectively. z represents the landmark depth with the depth at the wrist being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x .

MULTI_HAND_WORLD_LANDMARKS

Collection of detected/tracked hands, where each hand is represented as a list of 21 hand landmarks in world coordinates. Each landmark is composed of x , y and z : real-world 3D coordinates in meters with the origin at the hand’s approximate geometric center.

MULTI_HANDEDNESS

Collection of handedness of the detected/tracked hands (i.e. is it a left or right hand). Each hand is composed of label and score . label is a string of value either "Left" or "Right" . score is the estimated probability of the predicted handedness and is always greater than or equal to 0.5 (and the opposite handedness has an estimated probability of 1 - score ).

Source: Hands - mediapipe

Tagging @RaananW - not sure if you are familiar with MediaPipe, but this seems to be XR-related

Hello babylonJS community! I found partial answer to my question but still strugling with depth (Z) coords.

I used normalized values taken from MediaPipe and multiplied them by videoWidth and videoHeight.

const coords = {
   // Selfie mode so we substract from videoWidth
    x: video.videoWidth - result.multiHandLandmarks[hand][i].x * video.videoWidth,
    y: result.multiHandLandmarks[hand][i].y * video.videoHeight,
    z: result.multiHandLandmarks[hand][i].z * video.videoWidth / 0.4
}

// Viewport is reference to FreeCamera with coords (0,0,0)
viewport.position.z = -100;

const vector = Vector3.Unproject(
    new Vector3(coords.x, coords.y, 1),
    video.videoWidth,
    video.videoHeight,
    Matrix.Identity(),
    viewport.getViewMatrix(),
    viewport.getProjectionMatrix());

    spheresLeft[i].isVisible = true;
    spheresLeft[i].position.x = vector.x / cameraHeight;
    spheresLeft[i].position.y = vector.y / cameraHeight;

Now my objects are placed on the hand correctly, problem is with Z coords, we can not find out how to set Z coordination correctly. Media pipe returns Z coordination which is not [0, 1] but

Normalized Z where z-origin is relative to the wrist z-origin. I.e if Z is positive, the z-la ndmark coordinate is out of the page with respect to the wrist. Z is negative, the z-landmark coordinate is into the page with respect of the wrist.

Z coord values are about 5.283884E-7 and -5.283884E-7 (they can be even higher)

Is there a way to transform Z coord into babylonJs world based on camera?

source: Hand tracking landmarks - Z value range · Issue #742 · google/mediapipe · GitHub

Hello just checking in, are you still having issues? @Martin_Nascak