Gained in Translation

Fall 2010

If you’ve ever used Google Translate, you know that you can instantly get a surprisingly serviceable translation between major languages such as English, Spanish, and French. That’s because the search engine giant has billions of dollars and a huge corpus of words to draw upon.

But for languages with fewer speakers—or merely less available written material—computer translations become dicey.

Ann Irvine, a third-year PhD student at the Center for Language and Speech Processing (CLSP), tackled the problem of these “low resource” languages using a combination of Wikipedia, computing power, Internet connectivity, and human brainpower. “The goal of the work was to put humans in the loop,” Irvine says.

One big problem that Irvine and her colleagues face is getting good machine translations for “low resource” languages. For instance, although 260 million people speak Bengali, and millions of those people also speak English, Bengali has relatively few resources available—especially electronic files of sentence-by- sentence translations that can be compared with the originals.

So Irvine and Alexandre Klementiev, a postdoctoral fellow at the CLSP, mined Wikipedia for articles in 42 foreign languages—everything from Maltese and Tamil to Albanian and Basque. Then they used a computer program to compare the foreign language articles to English Wikipedia articles on the same subject, and to draw up lists of word pairs that were likely good translations. Finally they sent the word pairs to Amazon’s Mechanical Turk (MTurk), an Internet “marketplace” that pays people to undertake tasks that can be accomplished over the Internet. Participants earned 10 cents for each group of 10 word pairs they checked. To ensure accuracy, three different workers checked each word pair.

The method worked well for most of the languages. The computers drew up previously unknown (to computers) word pairs, which were confirmed as accurate by the human workers. However, some less common languages like Kazakh and Tigrinya found no human takers.

Eventually, relatively similar languages like French and English or Spanish and Italian will probably be translated by machines idiomatically and with near perfect accuracy. But very good translations for extremely dissimilar languages, like Mandarin and English, might never happen, Irvine says.

“I think for dissimilar languages, the best we can hope for is to be able to translate basic language in a way that sounds like a human might have done it. For a random sentence out of a newspaper, I think we will be able to do that for any pair of languages eventually … [but] Chinese poetry to English? Forget it. That’s a hard problem,” she says.