Home

Learning a monolingual language model from a multilingual text database


Author(s) : Rosie Jones Rayid Ghani, 
Publisher : N/A
Publication Date : 2000
ISSN : N/A
Abstract : Language models are of importance in speech recognition, document classification, and database selection algorithms. Traditionally language models are learned from corpora specifically acquired for the purpose. Increasingly, however, there is interest in constructing language models for specific languages from heterogeneous sources such as the web. Querybased sampling has been shown to be effective for gauging the content of monolingual heterogeneous databases. We propose evaluating an extension to this approach by considering the case of learning a monolingual language model from a multi-lingual database, and extensions to the querybased sampling algorithm to handle this case. We test our approach on a corpus collected from the WWW and show that our proposed methods perform accurately and efficiently for learning a language model of Tagalog, when these documents are only 2.5 % of the documents in a collection. 1.,