Comparative Study for Text Document Classification Using Different Machine Learning Algorithms

Yin Min Tun; Phyu Hnin Myint

Authors

Yin Min Tun University of Computer Studies, Mandalay (UCSM), Mandalay and 05013, Myanmar
Phyu Hnin Myint University of Computer Studies, Mandalay (UCSM), Mandalay and 05013, Myanmar

Keywords:

Classification, Text Mining, Classification Methods, Enron Email Dataset.

Abstract

Classification is a supervised learning method: the goal is finding the labels of the unknown object. In the real world, the tedious amounts of manual works are required to label the unknown documents. The system is initially trained by labeled documents by using one of the supervise machine learning algorithm and then applied trained model to predict the label of the unknown documents. The framework of text document classification consists of: input text document, pre-processing, feature extraction and classification. The analysis four common classification methods are performed: Naïve Bayes, Decision Tree, Support Vector Machine and K-nearest neighbors for text document classification. The main focus of this paper is to present comparative study of different exiting classification methods for text document classification. The experiment performed different classification methods on the Enron Email Dataset and measure classification accuracy, true positive, true negative, false positive and false negative to compare the performance of different classification methods.

References

Isa, D., Lee, L.H., Kallimani, V.P. and Rajkumar, R., 2008. Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data engineering, 20(9), pp.1264-1272.

Sebastiani, F., 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), pp.1-47.

Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T., 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Nigam, K., McCallum, A.K., Thrun, S. and Mitchell, T., 2000. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3), pp.103-134.

Forman, G., 2003. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3(Mar), pp.1289-1305.

Filannino, M., 2011. DBWorld e-mail classification using a very small corpus. The University of Manchester.

Comparative Study for Text Document Classification Using Different Machine Learning Algorithms

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Make a Submission

Information

Browse

Current Issue