Home

Classification of news stories using support vector machines


Author(s) : Robert Cooley, 
Publisher : N/A
Publication Date : 1999
ISSN : N/A
Abstract : Given a data set and a data mining task such as classification, there are two main reasons for performing feature space reduction. The first is to improve the accuracy of the algorithm. In a domain such as text mining, the common technique of parameterizing each document as a vector of words results in thousands of dimensions. The performance of many learning algorithms decreases as the dimensionality of the input space increases. Support Vector Machines (SVMs) [Vap95], which are based on Vapnik's statistical learning theory, can be used as a classification technique and have been shown by Joachims [Joa98] to be reasonably immune to the high dimensionality of text feature spaces. The second reason for performing feature space reduction is to decrease the overall size of the data set in order to conserve storage space and minimize the amount of time required to handle the data and run the mining algorithms. Even with SVMs, very large data sets may warrant feature space reduction because of this second class of problems. This paper describes the results of an experiment to train SVMs to classify print, television, and radio news sources. Tests were performed to compare full text versus feature space reduction using a natural language processing technique and reduction using information,