Comparative Study for Text Document Classification Using Different Machine Learning Algorithms

  • Yin Min Tun University of Computer Studies, Mandalay (UCSM), Mandalay and 05013, Myanmar
  • Phyu Hnin Myint University of Computer Studies, Mandalay (UCSM), Mandalay and 05013, Myanmar
Keywords: Classification, Text Mining, Classification Methods, Enron Email Dataset.

Abstract

Classification is a supervised learning method: the goal is finding the labels of the unknown object. In the real world, the tedious amounts of manual works are required to label the unknown documents. The system is initially trained by labeled documents by using one of the supervise machine learning algorithm and then applied trained model to predict the label of the unknown documents.  The framework of text document classification consists of: input text document, pre-processing, feature extraction and classification. The analysis four common classification methods are performed: Naïve Bayes, Decision Tree, Support Vector Machine and K-nearest neighbors for text document classification. The main focus of this paper is to present comparative study of different exiting classification methods for text document classification. The experiment performed different classification methods on the Enron Email Dataset and measure classification accuracy, true positive, true negative, false positive and false negative to compare the performance of different classification methods.

References

Isa, D., Lee, L.H., Kallimani, V.P. and Rajkumar, R., 2008. Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data engineering, 20(9), pp.1264-1272.

Sebastiani, F., 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), pp.1-47.

Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T., 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Nigam, K., McCallum, A.K., Thrun, S. and Mitchell, T., 2000. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3), pp.103-134.

Forman, G., 2003. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3(Mar), pp.1289-1305.

Filannino, M., 2011. DBWorld e-mail classification using a very small corpus. The University of Manchester.

Published
2019-04-19
How to Cite
Min Tun, Y., & Hnin Myint, P. (2019). Comparative Study for Text Document Classification Using Different Machine Learning Algorithms. International Journal of Computer (IJC), 33(1), 19-25. Retrieved from https://ijcjournal.org/index.php/InternationalJournalOfComputer/article/view/1388
Section
Articles