|
Abstract : |
The segmentation of a text into sentences is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. This is a non-trivial task, however, since end-ofsentence punctuation marks are ambiguous. A period, for example, can denote a decimal point, an abbreviation, the end of a sentence, or even an abbreviation at the end of a sentence. To disambiguate punctuation marks most systems use brittle, special-purpose regular expression grammars and exception rules. Such approaches are usually limited to the text genre for which they were developed and cannot be easily adapted to new text types. They can also not be easily adapted to other natural languages. As an alternative, I present an efficient, trainable algorithm that can be easily adapted to new text genres and some range of natural languages. The algorithm uses a lexicon with part-of-speech probabilities and a feedforward neural network for rapid training. The method described requires, |