|
Abstract : |
The automated categorization (or classification) of texts into pre-specified categories, although dating back to the early ?60s, has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on the application of machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. In this survey we look at the main approaches that have been taken towards automatic text categorization within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation. Categories and Subject Descriptors: H.3.1 [Information storage and retrieval]: Content analysis and indexing?Indexing methods; H.3.3 [Information storage and retrieval]: Information search and retrieval?Information filtering; H.3.3 [Information storage and retrieval]: Systems and software?Performance evaluation (efficiency and effectiveness); I.2.3 [Artificial, |