Home

Named entity extraction from speech: Approach and results using the TextPro system


Author(s) : David Martin, 
Publisher : N/A
Publication Date : 1999
ISSN : N/A
Abstract : This paper describes the application of the TextPro system to the task of recognition of named entities in speech. TextPro is a lightweight engine for interpreting cascaded finite-state transducers. Although originally intended for processing text, the experience of this evaluation demonstrates the system can easily be adapted to processing transcripts generated by a speech recognizer as well. 1. THE TEXTPRO EXTRACTION SYSTEM For its participation in the Hub4 named-entity identification task, SRI International employed a newly developed information extraction system called TextPro. TextPro is a lightweight interpreter of cascaded finite-state transducers that is based on the TIPSTER Document Manager architecture [Grishman et al., 1996] and the TIPSTER Common Pattern Specification Language 1 (CPSL). TextPro finite-state transducers accept and produce sequences of annotations on the document conforming to the structure specified by the TIPSTER document manager architecture. The transducers themselves are expressed by finite-state rules written in CPSL. The grammars employed by the Hub4 name recognizer specified the creation of ENAMEX, NUMEX, and TIMEX annotations, as well as other annotations used by the system internally. After having run each of the cascaded transducers over an input text, a postprocessor would insert SGML markup as required by the rules of the named-entity task. TextPro was originally developed to process text documents, and to test alternative specifications for CPSL. The first author participated in the design committee for CPSL under the TIPSTER program. The program runs on PowerPC Macintosh computers, and is freely downloadable from the World Wide Web. 2 Although originally developed for limited objectives, experience led us to conclude that TextPro was a very useful 1 Because of the premature end of the TIPSTER program, the specifications for the Common Pattern Specification Language were never finalized or published. Further information is obtainable from the authors.,