When Roger Ebert of the Chicago Sun-Times lost his ability to speak due to cancer, he began a journey to find his own voice by way of computer-generated speech. In a 2010 interview on The Oprah Winfrey Show, the movie critic demonstrated how a Scottish-based company created a computer program, using recordings of Ebert’s own voice, to speak like him. “You’ll know it’s a computer,” Ebert told the audience. “But it will sound like me.”
The Johns Hopkins Workshop on Human Language Technology, entering its 17th year, gathers experts on language and speech processing each summer to tackle vexing problems such as how to generate personalized-sounding speech. In July and August, roughly 50 international experts came to the Homewood campus to push science forward in several areas, including the generation of emotional speech.
“The summer workshops have led to a number of significant breakthroughs in our field,” says Sanjeev Khudanpur, an associate professor of electrical and computer engineering and a member of the Whiting School’s Center for Language and Speech Processing. The workshops—funded by federal agencies, companies such as Google, and academic institutions—draw renowned researchers, postdocs, graduate students, and carefully selected undergrads. The result: an intense scientific boot camp that produces a year’s worth of work in six weeks.
The quest to build a “human speaking box” dates to the 1700s. But it wasn’t until the early 20th century that an actual speech synthesizer (as opposed to a speech recording) was exhibited at the New York World’s Fair. By the 1990s, hand-held electronic devices could generate speech, but voices remained robotic and stilted. “If we want to have robots that can interact with us, we need speech synthesizers that speak with emotion, with personality,” says Alan W. Black, of Carnegie Mellon University, in his own lilting Scottish accent.
During the summer workshop, Black’s 10-member team was tasked with the job of generating emotive computer-speak.
Khudanpur describes the need: “If I’m speeding along the highway, and my monotone GPS says, ‘Slow down, please. The back door is open’…. I really want it to say, ‘SLOW DOWN! The BACK door is OPEN!’ Or: “If I have a robot out on a battlefield, maybe, I want it to be able to say, ‘DROP YOUR GUN. NOW!’”
To date, scientists have focused on using standard techniques of voice spectrum (vibrations) to generate speech. Black’s team turned to advances in articulatory features (computerizing how words are formed with the lips, tongue, and mouth) to create more emotive speech. “We’ve come up with a new way to generate speech using these articulatory features,” Black says.
The advances, he says, could lead to computer software that can accomplish tasks such as taking a man’s voice, putting it into a synthesizer, and having it generate a female voice. Or, taking an individual’s voice and having the computer translate it into another language such as German. “It might still have a slight English accent,” he says. “But it would sound like you.”
This, Black says, has huge potential for applications worldwide. For example, if a female American doctor was treating a female patient in, say, Afghanistan, it would be preferable for the speech synthesizer (with translation) to generate a female voice, equipped with emotion and intonation. “You may want compassion; otherwise, you’ll get a disconnect.”
How close are we? “We’re just scratching the surface,” Black says. He returns to the example of Ebert, and others such as physicist Stephen Hawking, who also uses a speech synthesizer. Ongoing research should pave the way for computer scientists to build individual voice generators for people without having to rely on hours of that person’s own recorded voice.
If a speech synthesizer were to sum that up, it might exclaim: “WOW!”