Statistical machine translation (SMT) systems rely on sentence aligned parallel (bilingual) text to extract phrase level translation pairs as well as to estimate feature scores associated with phrase pairs. In this talk, I will present research on estimating the parameters of a phrase-based SMT system using monolingual corpora, which is available in larger quantities and in more languages than parallel text. We show that more than 80% of the performance drop that results from removing bilingually estimated phrase pair features can be recovered with the use of features estimated over monolingual corpora. I will also present results on identifying phrase translation pairs themselves from monolingual text.
Program in Linguistics and Cognitive Science
Events are free and open to the public unless otherwise noted.