I ported a 3D pose detection demo into a Babylon Playground. The pre-trained model uses MediaPipe. The detector takes images from your webcam as input and then returns an array of landmarks with x,y,z coordinates with a confidence score. I massaged these points a bit and draw red or green boxes for each landmark to distinguish left from right. Then I use the detector’s confidence score to set the box visibility. To have your whole body fit in the video, you have to stand pretty far back.
I’m wondering if I can get some form of pose estimation just from the WebXR head and hand locations. For example, take the data that is streaming out of the webcam → pose detector, while simultaneously recording head and hand controller positions from WebXR, then train a model that uses the controllers as input (x) and the pose detection output data as the supervised output (y).
One immediate obstacle I see is “normalizing” these two sets of data so that they fit on top of each other. The raw data are in completely different “spaces” and orientations and scale. Does anyone have any expertise in this area?
the hand / controller data is being transposed by the underlying system, based on your current space (in XR). Babylon offers the viewer space (in the XR session manager), that you can use to get a “normalized” position of those (in world coordinates). Those will not change even if the user teleports itself somewhere else. You will, however, need to move them to the right position if you want to change the position in which they are rendered.