“Recognize Speech vs. Wreck a Nice Beach.” That phrase, part of a lecture title, sums up the enormous challenge of automatic speech recognition. To decipher the subtleties of language, computers must sift through accents, grammatical errors, ambient noise, and language oddities. Pair or pear? Sex or sects? Ta-may-toe or toe-mah-toe? Even more of a digital puzzle is programming a search engine to find an image of a juicy, ripe Big Boy if the image isn’t labeled “tomato.”
For 10 years, the Whiting School of Engineering’s Center for Language and Speech Processing (CLSP) Summer Workshop on Language Engineering has been pairing senior researchers from academia and industry with top students to address just those kinds of challenges. Their mission is to advance the field in the areas of speech recognition, trans-lingual information detection and extraction, machine translation, speech synthesis, information retrieval, topic detection and tracking, text summarization, and question answering.
For the 2004 workshop’s 37 participants, this 10th anniversary milestone most likely passed with little fanfare. Instead, it was business as usual as they worked together in Schaffer Hall’s computer labs—often late into the night—to find cutting-edge solutions.
The workshop “has become the program for which the CLSP is famous,” says Frederick R. Jelinek, the Center’s director and the Julian Sinclair Smith Professor in the Department of Electrical and Computer Engineering. When he joined the Center in 1993, he saw an opportunity to bring its holistic approach to a six-week, intensive summer research experience. “My idea was to develop teams working together on a few projects,” Jelinek explains.
Other workshops existed but did not involve such teams. CLSP inaugurated the summer workshop with funding from the Department of Defense. The National Science Foundation began support in 1998.
“Think of the workshop as the first feasibility studies for the field,” Jelinek says. “There are various things developed here that will be used by the field forever.” Contributions include data sets available to industry and academia via the CLSP web site (www.clsp.jhu.edu), numerous journal articles, conference presentations, and technological advances. The creation of Gazelle software for machine translation of natural languages was the 1999 workshop project led by Kevin Knight, senior research scientist at the Information Sciences Institute at the University of Southern California. “The main goal was to build a generic tool that the whole research community could use,” says Knight. “Today, all kinds of research groups are using Gazelle in their translation projects. It’s become a staple in the field.”
The research community also values highly the two-week pre-workshop training session to bring undergraduates up-to-speed in language engineering. “It is considered so good that the Association for Computational Linguistics sends 10 students,” says Jelinek.
To develop its workshop projects, CLSP sponsors an annual fall proposal submission and two-day presentation process at Hopkins. After two to four topics are chosen, each team leader selects approximately seven team members, including senior researchers, three graduate students (at least one from Hopkins), and two undergraduates. The undergraduate component was added in 1998; these students are chosen via a national search.
This past summer’s workshop featured three projects. One team focused on developing a general framework to model phonetic, lexical, and pronunciation variability in dialectal Chinese automatic speech recognition. The project leader was Richard W. Sproat, professor of linguistics and of electrical and computer engineering at the University of Illinois at Urbana-Champaign (UIUC). Sproat’s team explored techniques for improving recognition of accented speech by using a standard Chinese recognizer as a baseline system to study Shanghainese, the street dialect in China’s largest city.
Team member David W. Kirsch, a double major in computer science and cognitive science at Lehigh University, was one of the six undergraduates last summer. “This workshop was the first time since I started college that I really felt academically at home,” says Kirsch. “I was with the people I was supposed to be with and doing the work I was supposed to be doing.”
Kirsch, who now intends to pursue doctoral studies in computational linguistics and plans to apply to Hopkins, is continuing his workshop research. “Some algorithms had occurred to me during the last two weeks of the workshop,” he says. “We never got time to test them, so I presented my idea of finding a way to classify accents on a sliding scale.” The workshop is funding his year-long research through the $100,000 it awards competitively each year, distributed among two to four undergraduate and graduate student participants.
The second workshop project last summer examined landmark-based speech recognition. By bringing together new ideas in linguistics, especially nonlinear phonology, with recent advances in artificial intelligence, the team explored how better to match human speech recognition performance. Mark Hasegawa-Johnson, assistant professor of electrical and computer engineering at UIUC, led the team. The technological goal was to create a model that would identify acoustic phonetic landmarks (the minimal elements that one needs to insert into a sound to make it intelligible) and piece them together to recognize words.
The third project, joint visual text models, took computer recognition of words and speech to the next step: recognizing objects and images. Giri Iyengar of the IBM T.J. Watson Research Center led the team. “We’re making inroads into a problem that has defied us for decades,” observes Sanjeev Khudanpur, assistant professor of Electrical and Computer Engineering and a member of CLSP (the “Recognize Speech vs. Wreck a Nice Beach” title was from his workshop lecture in 2001). Current search engines can only look at how an image is named. This technology would enable image-driven web searches. “While none of the algorithms is perfect, retrieval is much better,” says Khudanpur. “This is very far into the future, but the tool also could have national security implications by searching on names and faces.
“All research is gambling,” adds Khudanpur, who first participated in the workshop as a graduate student in 1995 and now lends his expertise as a senior adviser. “You put ideas in and see what comes up and hope for the big pay-off. The sure bet of the workshop is to bring good people together who will continue to collaborate. It creates a lot of understanding.”