Home

Minimizing Manual Annotation Cost in Supervised Training from Corpora


Author(s) : Sean P. Engelson Ido Dagan, 
Publisher : N/A
Publication Date : 1996
ISSN : N/A
Abstract : Corpus-based methods for natural lan-guage processing often use supervised training, requiring expensive manual an-notation of training corpora. This paper investigates methods for reducing annota-tion cost by sample selection. In this ap-proach, during training the learning pro-gram examines many unlabeled examples and selects for labeling (annotation) only those that are most informative at each stage. This avoids redundantly annotating examples that contribute little new infor-mation. This paper extends our previous work on committee-based sample selection for probabilistic classifiers. We describe a family of methods for committee-based sample selection, and report experimental results for the task of stochastic part-of-speech tagging. We find that all variants achieve a significant reduction in annota-tion cost, though their computational effi-ciency differs. In particular, the simplest method, which has no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduc-tion in the size of the model used by the tagger. 1,