PARTS OF SPEECH
You may never have heard of Sanjeev Khudanpur, but if you own a phone, you’ve definitely met some of the women in his line of work. They’re universally perky and helpful (or at least they promise to be), and go by a series of All-American Anglo names: Mary, Linda, and perhaps this region’s best known example, Julie … as in “Hi, I’m Julie from Amtrak. Where would you like to go?”
This is the data-intensive world of speech and text recognition software that Khudanpur plies, and “Julie” is just the latest example. “Actually, the earliest was a toy, Radio Rex, from the 1960s. It was a plastic dog, it sat in its little dog house, and if you said ‘REX!’ loudly enough, it jumped out of the house,” laughs Khudanpur, an associate professor of electrical and computer engineering at the Whiting School. Rex was, in his own electronic way, a little hard of hearing. “It wasn’t really recognizing ‘Rex.’ You could say ‘hex’ or ‘sex’—anything that had the ‘eh’ sound in it—and it would respond. If you said ‘baby’ or ‘dog’ it wouldn’t do anything.”
By comparison to Rex’s crude Texas Instruments chip, Julie may seem a veritable chatterbox, but to Khudanpur’s mind, there’s plenty of room for technological improvement in that she works in “a limited domain. If you start asking Julie whether she likes Phantom of the Opera, she has no opinion on it, she just wants to figure out where it is and whether you want to go there.”
Khudanpur’s work at the Whiting School’s Center for Language and Speech Processing takes him in the opposite direction, expanding the boundaries of speech and text recognition to create programs that more accurately translate languages. That involves delving into how people really speak to each other—so-called “conversational speech”—versus the way most of us interact with Julie. “We talk to each other very differently than we talk to a computer,” says Khudanpur, who has researched the nuances. “People tend to be more cooperative when they’re talking to a computer, more measured in their speech.”
By contrast, Khudanpur’s current speech recognition investigation “is focused on [quantifying] conversational speech, pronunciation variations, dialectical variability, accents, and so on.” He notes that people have the ability to discern sloppy pronunciations of words during conversation, to realize that, in a chat about their beloved pet, they may say something that sounds a lot like “diskette” but it’s understood by all in the room to mean “this cat.” But how do you get a computer to make the right call?
The answer, says Khudanpur, is twofold; one aspect involves sampling thousands of sounds used in speech, looking for the common patterns that can be programmed into a model (think of a computer that could quickly recognize a thick Boston or Baltimore accent, and adjust its translations accordingly). The other side involves creating mathematical formulas that look at millions of complete sentences from which a computer could deduce the likely translation of a given word. In the example above, Khudanpur says the computer would know “diskette” really means “this cat” if it recognized the context of the adjacent words, as in “The black fur on (‘diskette’) really makes her green eyes stand out.”
Khudanpur says the applications of such work are numerous. On a national security front, software can tap international calls in a way far beyond human manpower, looking for certain words and phrases that could be tip-offs of planned attacks. On a far more benign front, Khudanpur imagines a chip that could search TV not by programming but by spoken content, referring viewers to news and talk shows that offer the most references to a given subject.
For all his accomplishments, Khudanpur says his most interesting work still lies ahead of him—the world of text-to-text translation. As anyone familiar with the phrase “lost in translation” has learned, using a computer to move from one language to another is problematic at best. Khudanpur may be on the way to solving that contextual gap. By pouring thousands of parallel sentences in multiple languages into a database (think of numerous UN translators simultaneously translating the same speech), Khudanpur is creating language matrices where a mathematical process of elimination narrows down a word in one language to its counterpart in another.
“It’s like having millions of necklaces, all with many different colored beads, and I asked you to pick out a subset of necklaces that had one bead of your favorite color,” he says. “If you gave me that subpile, I’d look for the color they all had in common … and that would be your favorite.”
In any language.