Picking up on body language is a skill people hone from a lifetime of experience. But researchers are trying to bring robots up to speed on reading human social cues – in no time at all.
A group at MIT’s Computer Science and Artificial Intelligence Laboratory has used hundreds of videos from past episodes of popular television shows – like The Office, Big Bang Theory and Desperate Housewives – to train their algorithm to read human body language and predict what people are going to do next.
“Humans automatically learn to anticipate actions through experience, which is what made us interested in trying to imbue computers with the same sort of common sense,” Carl Vondrick, a Ph.D. candidate and first author on the research, said in a statement. “We wanted to show that just by watching large amounts of video, computers can gain enough knowledge to consistently make predictions about their surroundings.”
The researchers zeroed in on greetings. They tested their program, after the 600-hour marathon of video input, to see whether it could detect how people were going to greet each other beforehand: with a handshake, a hug, a kiss or a high-five.
At first, the folks at MIT tested how the algorithm would fare anticipating how people would greet each other one second in advance, and it was correct more than 43 percent of the time. For context, existing programs’ predictions have a 36-percent accuracy rate. Meanwhile, living, breathing humans were able to correctly predict the greeting 71 percent of the time, so there’s room for error.
“There’s a lot of subtlety to understanding and forecasting human interactions,” Vondrick said. “We hope to be able to work off of this example to be able to soon predict even more complex tasks.”
Right now, the algorithm still needs some work. Once it becomes more accurate, future iterations of the program could be used in robots to make them react and interact with the human world more appropriately. But first, the program has a lot more video-watching ahead of it.
“I’m excited to see how much better the algorithms get if we can feed them a lifetime’s worth of videos,” says Vondrick. “We might see some significant improvements that would get us closer to using predictive-vision in real-world situations.”