论文部分内容阅读
Imbalanced data is one type of datasets that are frequently found in real-world applications,e.g.,fraud detection and cancer diagnosis.For this type of datasets,improving the accuracy to identify their minority class is a critically important issue.Feature selection is one method to address this issue.An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class.A decision tree is a classifier that can be built up by using different splitting criteria.Its advantage is the ease of detecting which feature is used as a splitting node.Thus,it is possible to use a decision tree splitting criterion as a feature selection method.In this paper,an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed.Its comparison results with Chi2,F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected.As the number of selected features increases,our proposed method has the highest probability of achieving the best performance.The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria.Experimental results with two datasets show that ROC AUC performance can be high,even if only a few features are selected and used,and only changes slightly as more and more features are selected.However,the performance of F-measure achieves excellent performance only if 20% or more of features are chosen.The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.