The Handpose feature in ml5.js allows us to use a pre-trained model to detect a hand in an image (including static images, video, or a webcam feed). The key output is a set of 21 points representing the positions of the fingers (finger tips and knuckles) and the palm of the hand.
Handpose is a feature of ml5.js that provides access to a model which performs hand detection in images. In essence it lets us provide an image (via an image or video file, or through the webcam) and the model will locate a hand and label an impressive set of points on that hand (fingertips, every knuckle, and the base of the palm).
The best way to understand this feature is, unsurprisingly, to read its documentation.
If we go to the Handpose reference page we’ll find ml5.js’s standard presentation of its API. As with most API documentation, it has its own particular approach but follows general principles of providing specific descriptions of methods and properties as well as examples. Let’s look at the component pieces here…
The page begins with an image that shows us precisely what the feature offers. It can locate a hand in serious detail! We see all the little data points it can associate with and track on a hand! It is quite impressive!
Handpose is a machine-learning model that allows for palm detection and hand-skeleton finger tracking in the browser. It can detect a maximum of one hand at a time and provides 21 3D hand keypoints that describe important locations on the palm and fingers.
Sounds good! There are a couple of things here worth paying attention to:
The Quickstart section shows us a minimal amount of code required to load the Handpose model and begin recognizing hands. This is clearly useful because it shows us the usage of the model in situ, but can be awkward if we’re not used to the specific way it’s being achieved.
For example, we may be used to working in p5.js, but the example may present the use of the feature in a different context (such as based on the DOM). With time, we’ll be able to interpret this kind of “bare bones” example and translate it to our own way of working.
At the very least we can identify the key instructions and think about how we might incorporate them into our own framework.
In the Usage section we see the actual details of the API. This is especially useful as a place to refer back to when we start using Handpose and might want to know more about how the core methods work. Like many of the ml5.js features, it’s chiefly broken down into an initialization method to get it started and then prediction methods that allow us to request its hand predictions.
There are also a number of properties we could potentially access to check things like the current configuration of the model or whether the model is ready yet.
We should note that we can either request an individual prediction with .predict()
or we can continuously request predictions with the event handler .on('predict', callback)
. Which one we use will depend particularly on whether we’re doing something in realtime (in which case we’d use the event handler) or at specific moments (in which case we’d use the .predict()
method).
Importantly, we should always pay attention to the output of the prediction methods, which is described as:
Returns an array of objects describing each detected hand. You can see all of the supported annotation in the Tensorflow source code.
If we click through to the Tensorflow source code we can see the basic structure being offered, but we don’t really ever get a truly precise description of the data format that model outputs. Once again, we’ll probably have to work it out ourselves by using the Handpose feature and outputting the data so we can read through it.
ml5.js is great about providing examples of using its features. Even better, the examples are provided in different frameworks, including p5.js and plain JavaScript. This makes our lives a lot easier as we can get a better sense of how to actually use any given feature.
We’ll return to the examples shortly as they’re a key way to get up and running with this feature. Let’s keep reading for now.
No demo for this feature as yet.
No tutorials for this feature as yet.
A wonderful thing about the ml5.js API is that it at least tries to provide a discussion of the data for every model it uses. This is useful and worth a look for several reasons:
However, at the moment, Handpose doesn’t have its model and data provenance available yet. A shame, but nobody’s perfect.
People to include in your prayers of thankfulness.
If you really want to go there, you can actually look at how this feature is written! (It can be surprisingly “simple” or at least short because it’s generally “just” an intermediary talking to TensorFlow.js.)
Now that we’ve got our overview of what this feature is and roughly how it works, we can turn our attention to perhaps the key key learning resource, which is the examples section.
In particular, looking at the versions of the examples written in the p5 web editor is a good option because it’s immediately available and runnable. In this case we have examples of detecting a hand in a static image and live via the webcam. So, we should look at them!
The great thing here is that we can both run the example and look at the code that makes it work at the same time. Let’s focus on the webcam version. Importantly this tells us a couple of things worth remembering:
setup()
)In essence these examples can give us some basic structure to work with in these scenarios (static image or webcam). We can largely mimic this basic setup and then change what we do with the data in order to make our projects work with this feature.
With this kind of technology it’s really crucial to understand how the data is being provided to your program as the output of the model. This data is the stuff we can actually use to do interesting things in our projects.
There isn’t explicit documentation in the Handpose API that tells us with total precision the nature of the output, but there are a number of useful clues.
In the example code we see that they draw circles at each point on the hand identified by Handpose and that they do this by going through a landmarks
array. Each element in the landmarks
array is another array and they’re using the first (0
) and second (1
) elements as x
and y
coordinates to draw the points.
That’s useful but makes it kind of difficult to understand exactly which point is which within the landmarks
array. Which one is the thumb? Which one is the index finger?
Here we can return to the documentation for .predict()
and at least get a basic picture of the object representing a detected hand. This is pretty useful, but still doesn’t give us a precise map of the specific data being used to identify each point on the hand.
In the end, the best way to see the available data is often to look at it explicitly in a running program. In the ml5.js example we can use console.log()
to actually display the results
array that comes with a prediction. We can do this in the p5 web editor and then view the object in our JavaScript console.
This shows us the entire contents of the hand data, including
annotations
which is an object containing properties for each finger and the palmboundingBox
which contains the top left and bottom right coordinates of the hand’s bounding boxhandInViewConfidence
which tells us how sure Handpose is that it can see a handlandmarks
which also contains the hand data as in annotations
but just in an array of 21 points without information about what they areSo the annotations
object seems like the most easily accessible form of data output, but it’s still kind of obscure, because if we look at something like annotations.indexFinger
, for instance, we’ll see it contains an array of four elements.
We can see each element contains coordinates for a point on the index finger presumably, but which point? At this stage we basically need to play around to figure out that in order in the array annotations.indexFinger
we have points referring to the base of the finger (or thumb), the two knuckles, and then the tip of the finger. (One way to find this out would be to number the different points explicitly on the canvas while displaying the image feed.)
So if you want to get the tip of the index finger it’s a the position specified in annotations.indexFinger[3]
. Yikes. Nobody said data was always organized in an easy-to-understand way.
At least we know now.
Generally speaking the Handpose data is pretty straightforward in terms of confidence or bounding boxes or specific points on the hand.
Perhaps the key thing to keep in mind is the way that Handpose (and most ml5.js models) uses arrays to represent coordinates. So for a point in 3D space it will have a three element array where the first element (position 0
) is the x coordinate, the second element (position 1
) is the y coordinate, and the third element (position 2
) is the z coordinate.
Why do they do this? Quite possibly a matter of efficiency, but it’s hard to say.
Now we’re ready to start a project with this feature in our own setup. So, we’d need to:
Here’s a fairly detailed suggested framework for using Handpose and the webcam:
/**
Handpose Framework
Pippin Barr
A skeleton framework for using ml5.js's Handpose feature. Includes a
loading screen followed by a live webcam feed with a circle drawn at
the tip of the user's index finger.
*/
"use strict";
// Current state of program
let state = `loading`; // loading, running
// User's webcam
let video;
// The name of our model
let modelName = `Handpose`;
// Handpose object (using the name of the model for clarity)
let handpose;
// The current set of predictions made by Handpose once it's running
let predictions = [];
/**
Starts the webcam and the Handpose
*/
function setup() {
createCanvas(640, 480);
// Start webcam and hide the resulting HTML element
video = createCapture(VIDEO);
video.hide();
// Start the Handpose model and switch to our running state when it loads
handpose = ml5.handpose(video, {
flipHorizontal: true
}, function() {
// Switch to the running state
state = `running`;
});
// Listen for prediction events from Handpose and store the results in our
// predictions array when they occur
handpose.on(`predict`, function(results) {
predictions = results;
});
}
/**
Handles the two states of the program: loading, running
*/
function draw() {
if (state === `loading`) {
loading();
}
else if (state === `running`) {
running();
}
}
/**
Displays a simple loading screen with the loading model's name
*/
function loading() {
background(255);
push();
textSize(32);
textStyle(BOLD);
textAlign(CENTER, CENTER);
text(`Loading ${modelName}...`, width / 2, height / 2);
pop();
}
/**
Displays the webcam.
If there is a hand it outlines it and highlights the tip of the index finger
*/
function running() {
// Display the webcam with reveresd image so it's a mirror
let flippedVideo = ml5.flipImage(video);
image(flippedVideo, 0, 0, width, height);
// Check if there currently predictions to display
if (predictions.length > 0) {
// Technically there will only be ONE because it only detects ONE hand
// Get the hand predicted
let hand = predictions[0];
// Highlight it on the canvas
highlightHand(hand);
}
}
/**
Provided with a detected hand it highlights the tip of the index finger
*/
function highlightHand(hand) {
// Display a circle at the tip of the index finger
let index = hand.annotations.indexFinger[3];
let indexX = index[0];
let indexY = index[1];
push();
fill(255, 255, 0);
noStroke();
ellipse(indexX, indexY, 50);
pop();
}
At this point we have the ability to start the model and access the data it provides. It’s now up to us to use that for our wildest machine learning dreams! What if you…
And so on. And it doesn’t even have to just be cats!
So that’s the Handpose feature of ml5.js. There’s a lot going on here and it is very, very powerful as a technology. It’s exciting to think we get to use this technology! On the web! There are a lot of skills we need to employ here, but the result is access to something genuinely incredible.