Exploiting Class Label Frequencies for Text Classification
Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort. In the vast majority of document classification techniques a document is represented as a bag of words consisting of all the individual terms making up the document together with the number of times each term appears in the document. The number of term occurrences is known as local term frequencies and it is very common to make use of the local term frequencies at the price of some added information in the classification model. In this work, we extend our previous work on medical article classification [1,2] by simplifying the weighting scheme in the ranking process using class label frequencies to device a simple weighting formula inspired from traditional information retrieval task. We also evaluate the proposed approach using more research experimental data. The method we propose here, called CLF KNN first, it uses a lexical approach to identify frequency terms in the document texts and then, it uses this information coupled with class label information in corpus in a sophisticated way to devise a weighting ranking scheme in classification decision process. The evaluation experiments on two collections: The Ohsumed collection of medical documents and the 20 Newsgroup messages collection, show that the proposed method significantly outperforms traditional KNN classification.
K. Fragos, C. Skourlas. “Smoothing Class Frequencies for KNN Medical Article Classification,” in Proc of 20th Pan-Hellenic Conference on Informatics. PCI 16, 2016, Article No. 79.
K. Fragos, C. Skourlas. “Ranking tokens with class label frequencies for medical article classification,” in Proc of 19th Panhellenic Conference on Informatics, 2015 pp. 359-360.
A. Wulamu et al. “A Robust Text Classifier Based on Denoising Deep Neural Network,” Analysis of Big Data Scientific Programming Vol 33, 2017, Article ID 3610378, 10 pages: https://doi.org/10.1155/2017/3610378.
GH. A. Z. Mohammed and A. B. Can. “ROLEX-SP: Rules of lexical syntactic patterns for free text categorization,” journal of Knowledge-Based Systems, Elsevier vol 24, 58-65, 2011.
P. Semberecki and H. Maciejewski. “Deep learning methods for subject text classification of articles,” presented at Federated Conference on Computer Science and Information Systems (FedCSIS), Prague, Czech Republic, 3-6 Sept. 2017.
L. H. Lee, D. Isa, W. O. Choo, and W. Y. Chue, “High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic” Expert Systems with Applications, vol. 39, no. 1, pp. 1147–1155, 2012.
K. Fragos, C. Skourlas. “Towards Improving Classification of Real World Biomedical Articles.” Presented at 18th Int. Conf. of Panhellenic Conference Informatics, Harokopio University of Athens, 2nd - 4th October, 2014.
S. Jaffali and S. Jamoussi. “Principal Component Analysis neural network for textual document categorization and dimension Reduction,” presented at 6th International Conference on Sciences of Electronics, Technologies of Information and Telecom, IEEE Xplore, 2012.
D. W. Aha, K. Dennis and A. K. Marc. “Instance-Based Learning Algorithms,” Machine Learning, vol. 6, pp. 37-66, 1991.
Q. Zhang, et al. “Machine Learning Methods for Medical Text Categorization,” in Proc. Pacific-Asia Conference on Circuits, Communications and Systems, 2009, pp. 494 – 497.
H. Parvin, H. Alizadeh and B. Minaei-Bidgoli. “Modification on K-Nearest Neighbor Classifier,” Global Journal of Computer Science and Technology, vol.10, pp. 37-41, Nov. 2010.
H. Parvin, H. Alizadeh and B. Minaei-Bidgoli. “MKNN: Modified K-Nearest Neighbor,” presented at the World Congress on Engineering and Computer Science 2008 WCECS, October 22 - 24, San Francisco, USA, 2008.
S. Belongie, J. Malik, and J. Puzicha. “Shape matching and object recognition using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24 (4), pp. 509–522, 2002.
P. Y. Simard, Y. LeCun, and J. Decker. “Efficient pattern recognition using a new transformation distance,” in S. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems 6, pp. 50–58, San Mateo, CA, Morgan Kaufman, 1993.
R. O. Duda, P. E. Hart and D. G. Stork. “Pattern Classification,” John Wiley & Sons, 2000.
A. Lambda, D. Kumar. “Survey on KNN and Its Variants,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 5, Issue 5, May 2016
G. Salton and C. Buckley. “Term-weighing approaches in automatic text retrieval,” Information Processing & Management, vol. 24(5), pp. 513-523, 1988.
A. Berger, et al. “Bridging the Lexical Chasm: Statistical Approaches to Answer Finding,” in Proc. Int. Conf. Research and Development in Information Retrieval, 2000, pp. 192-199.
K. Sparck Jones. “Document retrieval systems, A statistical interpretation of term specificity and its application in retrieval”, Taylor Graham Publishing, pp. 132–142, London, UK, UK, 1988.
W. Hersh, C. Buckley, T. Leone and D. Hickman. “OHSUMED: An interactive retrieval evaluation and new large text collection for research,” in Proc. 17th ACM International Conference Research and Development in Information Retrieval 1994, pp. 192–201.
A. McCallum and K. Nigam. “A comparison of event models for naive bayes text classification,” presented at AAAI-98 Workshop on Learning for Text Categorization, 1998.
S. Eyheramendy and D. Lewis. “On the Naive Bayes model for text categorization,” in Proc. Ninth International Workshop on Artificial Intelligence & Statistics, Key West, FL, 2002
M. D. J. Rennie, et al. “Tackling the poor assumptions of naive Bayes text classifiers,” presented at Twentieth International Conference on Machine Learning, Washington, DC, 2003.
E. R. Madsen, et al. “Modeling word burstiness using the Dirichlet distribution,” presented at 22nd International Conference on Machine Learning. Bonn, Germany, ACM Press, 2005.
K. A. McCallum. “Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering,” presented at Carnegie Mellon University, 1996.
L. Baoli, et al. “An Improved k-Nearest Neighbor Algorithm for Text Categorization,” presented at the 20th International Conference on Computer Processing of Oriental Languages, Shenyang. China. 2003.
Authors who submit papers with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
- By submitting the processing fee, it is understood that the author has agreed to our terms and conditions which may change from time to time without any notice.
- It should be clear for authors that the Editor In Chief is responsible for the final decision about the submitted papers; have the right to accept\reject any paper. The Editor In Chief will choose any option from the following to review the submitted papers:A. send the paper to two reviewers, if the results were negative by one reviewer and positive by the other one; then the editor may send the paper for third reviewer or he take immediately the final decision by accepting\rejecting the paper. The Editor In Chief will ask the selected reviewers to present the results within 7 working days, if they were unable to complete the review within the agreed period then the editor have the right to resend the papers for new reviewers using the same procedure. If the Editor In Chief was not able to find suitable reviewers for certain papers then he have the right to reject the paper.
- Author will take the responsibility what so ever if any copyright infringement or any other violation of any law is done by publishing the research work by the author
- Before publishing, author must check whether this journal is accepted by his employer, or any authority he intends to submit his research work. we will not be responsible in this matter.
- If at any time, due to any legal reason, if the journal stops accepting manuscripts or could not publish already accepted manuscripts, we will have the right to cancel all or any one of the manuscripts without any compensation or returning back any kind of processing cost.
- The cost covered in the publication fees is only for online publication of a single manuscript.