Whiting School of Engineering 1996 Annual Report

Cover Page

Table of Contents

Report from the Dean

Highlights

Statistical Profile

Awards and Distinctions

Biomedical Engineering

Chemical Engineering

Civil Engineering

Computer Science

Electrical and Computer Engineering

Geography and Environmental Engineering

Materials Science and Engineering

Mathematical Sciences

Mechanical Engineering

Center for Language and Speech Processing

Center for Nondestructive Evaluation

Chemical Propulsion Information Agency

Instructional Television Facility

Part-Time Programs in Engineering and Applied Science

Teaching and Research Initiatives

Reasons to Celebrate

Corporation, Foundation, and Organization Support

Grants and Contracts

Publications

Administration and Committees

Center for Language and Speech Processing
Workshop Draws Internationally Known Experts
A Collector of Words
Center Facts

Workshop Draws Internationally Known Experts
For six weeks during July and August 1995, the Center for Language and Speech Processing (CLSP) was home to 24 scientists who attended LM95, a language modeling research workshop. A language model is that component of an automatic speech recognizer (transcriber) which, knowing (hypothesizing) what was said before, predicts what is likely to be said next. The workshop was the third in a series sponsored by the federal government and the first hosted by CLSP. The participants represented academia, industry, and the U.S. government and included scholars from France, Germany, and Spain. Frederick Jelinek, director of the Center, chaired the workshop, and Eric Brill, assistant professor of computer science, and William Byrne, associate research scientist in CLSP, also represented the Whiting School as participants.

Participants at LM95 were divided into four teams with each team having an assigned project leader and associated research goal: Spanish: explore language modeling techniques for the recognition of unrestricted, conversational Spanish over telephone channels; Language Modeling for Spontaneous Speech: study systematically the baseline system and associated error analysis; Fast Training and Portability: make better use of small data sets; and Phrase Structure Language Models: improve the language model by incorporating linguistic structure. In addition, two special weeks were organized with dedicated topics and invited experts in the fields of linguistics and information theory. Lastly, eight guest participants gave presentations at various times during the workshop.

A Collector of Words
David Yarowsky, assistant professor of computer science and member of the Center for Language and Speech Processing (CLSP), needs words. Lots of them. Three to five billion, in fact. Along with Frederick Jelinek, CLSP director, Eric Brill, also an assistant professor of computer science, and Sanjeev Khudanpur, a CLSP research scientist, Yarowsky will use the tremendous word bank he is developing and sophisticated programming to improve the performance of language modeling systems.

In the past, researchers have used limited models of simple word sequences to guide systems such as speech recognizers. The CLSP team is now developing methods that use far richer contextual clues, including long-distance word dependencies, syntactic structure, and topic analysis to predict which words a person has spoken. This research is driven by the word association patterns observed in the huge text database. Yarowsky is also investigating ways in which the semantic classification of words in context can be learned automatically from these patterns.

With the text database currently holding about 600 million words, Yarowsky isn’t terribly restrictive about the sources of information that pour into it, but he does seek a balance of styles. “Can you imagine trying to predict the flow of a natural conversation if all you were exposed to were scientific journal articles?” Yarowsky asks. The word bank contains newswire reports, conversational speech transcriptions, email messages, text from the Internet, and scientific papers. Yarowsky hopes to make an annotated version of the collection available to researchers and others via the World Wide Web.

Established 1992

Phone 410-516-4237

Email clsp@jhu.edu

WWW http://www.clsp.jhu.edu/

Affiliated Researchers
Frederick Jelinek, Director
Paul Smolensky, Assistant Director
Biomedical Engineering
John Heinz, Murray B. Sachs, Eric D. Young
Cognitive Science
Michael R. Brent, Luigi Burzio, Robert Frank,
Paul Smolensky
Computer Science
Eric Brill, Steven L. Salzberg, David Yarowsky
Electrical and Computer Engineering
Andreas G. Andreou, William J. Byrne, Gert Cauwenberghs, Frederick Jelinek, Sanjeev Khudanpur
Mathematical Sciences
Lenore J. Cowen, Carey E. Priebe, Colin O. Wu
Psychology
Peter Jusczyk

Research Areas
Analog and Digital VLSI
Computational Foundations of Grammatical Theory
Corpus Based Natural Language Processing
Information Theory
Language Acquisition and Computational Psycholinguistics
Language Modeling
Machine Learning of Natural Language
Neural Auditory Processing
Pronunciation Modeling
Semantic Analysis and Classification of Text
Speech Signal Processing
Statistical and Combinatorial Data Clustering