MediaPipe · Webcam · CodePen

MediaPipe Hands: When Your Webcam Becomes a 3D Skeleton

May 19, 2026

mediapipe
computer-vision
javascript
codepen
creative-coding
webcam
machine-learning
MediaPipe Hands: When Your Webcam Becomes a 3D Skeleton

MediaPipe Hands


There is a particular magic in watching your own fingers move in real space while a browser—not a native app, not a GPU workstation in the basement—draws a skeleton that follows.

That is what this CodePen is for: MediaPipe Hands running entirely in the tab. Your webcam feeds the model. The main canvas shows the 2D overlay—bones and joints painted on the video. A small panel in the corner hosts a 3D landmark grid where the same joints float in world coordinates, wired together like a miniature rig you can orbit with your hands.

No install. No Python server. Just models pulled from a CDN, a control panel, and the quiet thrill of pose estimation at desk scale.

Try it live

Allow the camera when prompted (the embed must request camera access—your browser will ask). Wave one hand, then two. Toggle Selfie Mode so the mirror feels natural. Watch the FPS counter settle once the WASM graph warms up.

Camera blocked or blank embed? Open the demo in a new tab or try the CodePen fork.

This page hosts a blog-side copy in assets/demo/ with the same UI as the CodePen: collapsible control panel (hamburger), live FPS graph, selfie toggle, camera picker, hand-count and confidence sliders, loading spinner, 3D landmark grid, and MediaPipe branding. The CodePen is still the forkable source; the local copy is what runs in the iframe above.

Two views of the same hands

Most webcam toys stop at stickers on a flat image. This one deliberately splits the story:

ViewWhat you seeCoordinates
Main canvasVideo frame + colored connectors and joint dotsNormalized image space (good for drawing on pixels)
Corner gridWireframe skeleton in a 3D boxWorld landmarks in meters (good for depth and spatial reasoning)

Your brain reads the video overlay as “it sees my hand.” The grid answers the next question: “where is my hand in space?” That second channel is what makes the demo feel like a pipeline rather than a filter.

Left and right hands get distinct colors on the 2D layer—green versus red—so when both appear you can still parse ownership. The 3D grid merges multiple hands into one landmark list and offsets bone indices so connections do not cross between hands.

What MediaPipe Hands is doing

Under the hood, the solution is a compact graph:

webcam frame → hand detection → 21 landmarks per hand → 2D + 3D outputs

For each detected hand you receive:

  • multiHandLandmarks — 21 points in normalized image coordinates (wrist, knuckles, fingertips)
  • multiHandWorldLandmarks — the same topology in a metric-ish 3D space relative to the camera
  • multiHandedness — left vs right classification (note: in selfie mode “left” is from the model’s perspective, not always your mirror intuition)

The drawing path on each frame is straightforward: clear the canvas, blit the camera image, loop hands, stroke HAND_CONNECTIONS, paint landmarks. Depth even modulates dot size—landmarks closer to the camera grow slightly larger via a lerp on the z component:

drawingUtils.drawLandmarks(canvasCtx, landmarks, {
  color: isRightHand ? '#00FF00' : '#FF0000',
  fillColor: isRightHand ? '#FF0000' : '#00FF00',
  radius: (data) => {
    return drawingUtils.lerp(data.from.z, -0.15, .1, 10, 1);
  }
});

That tiny detail matters: without an explicit depth cue, the overlay can feel like flat clipart. With it, the skeleton breathes as you lean toward the lens.

The 3D grid: cooking two hands into one rig

LandmarkGrid from MediaPipe’s control utilities expects a single landmark array and a connection list. When two hands are active, the demo concatenates world landmarks and reindexes edges:

const landmarks = results.multiHandWorldLandmarks.reduce(
    (prev, current) => [...prev, ...current], []);
let connections = [];
for (let loop = 0; loop < results.multiHandWorldLandmarks.length; ++loop) {
  const offset = loop * mpHands.HAND_CONNECTIONS.length;
  const offsetConnections = mpHands.HAND_CONNECTIONS.map(
      (connection) =>
          [connection[0] + offset, connection[1] + offset]);
  connections = connections.concat(offsetConnections);
  // ... per-hand color metadata for left/right
}
grid.updateLandmarks(landmarks, connections, colors);

When no hands are present, the grid clears—no stale floating bones. The range is tight (0.2 meters) so motion stays readable in the inset; labels carry an m suffix to remind you these are world-space numbers, not arbitrary shader units.

The control panel (the knobs that matter)

The panel is not decoration—it is how you stress-test the tracker:

ControlEffect
Selfie ModeHorizontally flips the hidden video element so movement matches a mirror
Max Number of HandsUp to four—useful for “clap then separate” moments and party tricks
Model ComplexityLite vs Full—latency vs robustness on difficult poses
Min Detection ConfidenceHow sure the model must be before it claims a new hand
Min Tracking ConfidenceHow sure it must stay before it drops tracking

SourcePicker wires the webcam through hands.send({ image }) and resizes the output canvas to the viewport while preserving aspect ratio—landscape phones and ultrawide monitors both get a sane frame without stretching fingers into noodles.

Models load from jsDelivr with a version pinned to the Hands package:

const config = {
  locateFile: (file) => {
    return `https://cdn.jsdelivr.net/npm/@mediapipe/hands@${mpHands.VERSION}/${file}`;
  }
};
const hands = new mpHands.Hands(config);
hands.onResults(onResults);

The first seconds show a spinner; when results arrive, document.body gains loaded and the spinner fades—small UX, but it sets expectations while WASM initializes.

Why this still belongs in a creative-coding notebook

Hand tracking is the hello world of spatial interfaces. Once landmarks stream at 30fps in the browser, you can imagine:

  • Gesture shortcuts — pinch to confirm, palm to cancel, thumbs-up as a hotkey
  • Music and art — fingertips as oscillators, wrist height as a filter cutoff
  • Accessibility experiments — camera-as-input when keyboard or mouse is awkward
  • AR prototypes — world landmarks feed Three.js bones without leaving the tab
  • Sign-language sketches — not production-ready recognition, but an honest sandbox for motion capture

MediaPipe does the heavy lifting: robust hand boxes, consistent topology, reasonable world coordinates. Your job is the semantics on top—what a curl of the index finger means in your app.

The pen stays Chrome-first (with a polite device-detector warning elsewhere) because that is where the graph was tested hardest when the demo was written. Fork it anyway—just know you are trading support for curiosity.

Hand tracking vs cousins

ApproachBest for
MediaPipe Hands (this pen)Fast 21-point skeleton, browser, multiple hands
MediaPipe Holistic / PoseFull body + face mesh when hands are not enough
WebXR hand inputVR controllers with platform APIs—no webcam
Depth cameras (RealSense, etc.)Metric room-scale scenes; hardware cost
Training your own modelCustom gestures or domains MediaPipe never saw

Reach for Hands when you want finger-level interaction without shipping a native binary. Reach for pose when the story is shoulders and gait. Reach for hardware depth when millimeters in a room matter.

What I would change today

If I refreshed the pen now:

  • Tasks Vision API — migrate from the classic @mediapipe/hands graph to the newer Hand Landmarker task API where maintenance is headed
  • npm / Vite module — same logic, but imports instead of globals on window
  • Privacy copy — one line that frames are processed locally and never uploaded (true for this pen; say it out loud)
  • HTTPS-only assets — drop legacy http:// logo links that browsers may block on secure embeds
  • Gesture layer — a tiny state machine on landmark angles (pinch distance threshold) to prove the skeleton is input, not just decoration
  • Three.js hand mesh — drive a rig from multiHandWorldLandmarks in the main viewport, not only the inset grid

None of that changes the core lesson: the browser can estimate 3D structure from a flat camera feed, and when you show both the video overlay and the world grid, people immediately understand what “landmarks in space” means.

The lesson I still keep

The best perception demos do not lecture about tensors. They let you wiggle your fingers and watch something faithful follow.

This pen is not a productized sign-language engine or a VR shell. It is a well-lit window into MediaPipe’s hand graph: confidence sliders, selfie flip, FPS tick, and that corner grid reminding you that every green dot on the video has a coordinate in meters somewhere off your wrist.

Fork it, log the landmarks to the console, map a pinch to play/pause on your favorite track. You will stop thinking of computer vision as a cloud API and start thinking of it as geometry arriving frame by frame—which is exactly what it is.


Demo: assets/demo/ · CodePen: codepen.io/maggiben/pen/MWvVYqy · MediaPipe: developers.google.com/mediapipe