Comparative Study for Text Document Classification Using Different Machine Learning Algorithms
Keywords:Classification, Text Mining, Classification Methods, Enron Email Dataset.
Classification is a supervised learning method: the goal is finding the labels of the unknown object. In the real world, the tedious amounts of manual works are required to label the unknown documents. The system is initially trained by labeled documents by using one of the supervise machine learning algorithm and then applied trained model to predict the label of the unknown documents. The framework of text document classification consists of: input text document, pre-processing, feature extraction and classification. The analysis four common classification methods are performed: Naïve Bayes, Decision Tree, Support Vector Machine and K-nearest neighbors for text document classification. The main focus of this paper is to present comparative study of different exiting classification methods for text document classification. The experiment performed different classification methods on the Enron Email Dataset and measure classification accuracy, true positive, true negative, false positive and false negative to compare the performance of different classification methods.
Isa, D., Lee, L.H., Kallimani, V.P. and Rajkumar, R., 2008. Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data engineering, 20(9), pp.1264-1272.
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), pp.1-47.
Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T., 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Nigam, K., McCallum, A.K., Thrun, S. and Mitchell, T., 2000. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3), pp.103-134.
Forman, G., 2003. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3(Mar), pp.1289-1305.
Filannino, M., 2011. DBWorld e-mail classification using a very small corpus. The University of Manchester.
How to Cite
Authors who submit papers with this journal agree to the following terms.