Imagine strapping on a virtual reality headset, then using your hands to pick up a sword and swing it around your head. Imagine a hazard team able to defuse a complicated bomb from a mile away, just by controlling a robot's hand as effortlessly as your own. Imagine painting a picture on your computer just by waving a brush in front of your screen. Or, if you prefer, imagine using a computer like in Minority Report, whisking away pages and files just by grabbing them with your hands.
Handpose, a new innovation by Microsoft Research, could make all that possible, giving computers the ability to accurately and completely track the precise movement of your hands through a Microsoft Kinect, right down to the finger wiggle. While not the first project to make progress in this space—a Leap Motion 2 hooked up to an Oculus Rift can already do this—Microsoft's software innovation promises to be faster, and it can work from as far away as across the room, on existing hardware or, eventually, on mobile phones.
Using the Handpose software, the first thing a user does is scan his or her hand by holding it up in front of the Kinect to create a 3-D model. In the lab, the process currently takes about a second, which is less than it takes an iPhone Touch ID sensor to accurately measure your fingerprint. Once the system has created a 3-D model of your hand, Handpose allows you to control it on the screen in real-time, at around 30 frames per second. From there, you can use the on-screen hand as if it was a doppelganger of your own.
The Microsoft Kinect was relatively good at detecting your body gestures from the start, says Andrew Fitzgibbon, principal researcher in the machine learning and perception group at Microsoft Research Cambridge. That includes the motion of your legs, your head, and your arms. But one area where the Kinect and other motion and depth-sensing devices are rubbish is figuring out what you're doing with your hands.
"It can tell roughly where your palm and wrist is, but that's it," Fitzgibbon tells me. At best, it can tell if you're waving at it, but can't even do something as simple as detect if you're doing a thumbs up or thumbs down. "We believe that if you could accurately track the positions of a user's hands, right down to the angle of every knuckle and every digit, we believe motion-sensing technology would give rise to a whole new class of user interface." Fitzgibbon calls this new class of UI a Direct Physical Interface: one where users could interact with virtual objects just by reaching out and grabbing them as if it were physical.
The problem is extraordinarily complicated. Fitzgibbon says that for any motion-tracking system to be able to identify what the hand is doing, it needs to be able to detect 30 different points on the human hand. That doesn't sound like much, but how those 30 different points move together spawns trillions of possible combinations. To brute force the calculation would take an "infinite" amount of computing power, says Fitzgibbon, and that's ignoring the fact that the Microsoft Kinect can't actually see all of your fingers, because many of them are hidden from the sensor during certain gestures (for example, crossing your fingers, or folding your folding your hands). So even inaccurate hand gesture recognition is brutally slow.
What Handpose's algorithm does is vastly speed up a computer's ability to accurately recognize hand gestures, up to 10 times faster. It does this by using what Fitzgibbon calls particle swarm optimization, an algorithm that reduces the Kinect's trillions of initial guesses about where your hand is into a pool of 200 likely guesses. That is then further refined until it finds a good enough match.
Fitzgibbon reckons the difference between existing hand-recognition systems and what Handpose can do is the difference between using Graffiti on Palm OS back in the mid-'90s (essentially, a symbolic language of crude gestures that didn't actually mimic what it's like to write with a pen) and modern handwriting recognition systems, which can understand cursive, calligraphy, and more.
Fitzgibbon is cautious to note that Handpose isn't ready for retail, yet. He says Handpose will probably be good enough for totally accurate hand gesture recognition when it's twice as fast as it is now. When that happens, he says, expect it to change our interactions with everything from computers, video games, virtual reality, television sets, and robots.
And as for when that will be? "I think it was Bill Gates who once said that you overestimate what you can do in a year, and underestimate what you can do in 10," Fitzgibbon laughs. "So let's say somewhere in the middle, maybe five."