Comparison of feature selection methods in Kurdish text classification
Abstract
The aim of this study is to investigate the impact of feature selection (FS) on the performance of classifiers for text classification (TC) in Kurdish. TC accuracy can be adversely affected by the high dimensionality of the feature space. Hence, FS is employed to reduce the feature space and improve accuracy. This study evaluates several FS methods, including discriminative feature selection (DFSS), Chi-squared (CHI2), Discriminative power measure (DPM), Gini index, Distinguishing feature selector (DFS), Comprehensively measure feature selection (CMFS), and Correlation coefficient (CC), on two Kurdish datasets (KDC-4007 and KNDH). Multinomial naive Bayes (MNB) and Support vector machines (SVMs) are employed to evaluate the accuracy and F measure of FS. The experiment tests nine subsets of features (50, 100, 250, 500, 750, 1000, 2000, 3000, and 4000). The study finds that the FS methods CHI2 and DPM exhibit superior F measure and accuracy scores for SVM, while the CHI2 and CMF methods are superior for MNB. Importantly, most FS methods have only been applied to English texts, with little or no investigation of the Kurdish language. Therefore, this study fills an important gap in the literature by evaluating the effectiveness of various FS methods for Kurdish language TC.
Links
The aim of this study is to investigate the impact of feature selection (FS) on the performance of classifiers for text classification (TC) in Kurdish. TC