论文部分内容阅读
提出了一种基于偏最小二乘判别分析和F-score的特征筛选方法,并将其用于蛋白质组学质谱数据分析。该方法主要包含3个步骤:(1)用LIMPIC算法对原始数据进行预处理;(2)计算每个变量的F-score值并将所有变量按F-score值降底的顺序排列;(3)采用偏最小二乘判别分析交互检验按前向选择法选择最佳变量子集。用本方法对一组结肠癌数据进行分析,最终从原始的16331个质荷比变量中选择了8个特征质荷比作为潜在的生物标记物。用所选择的特征对独立测试集的样本进行判别,其灵敏度和特异性分别达到了95.24%和100%。结果表明,所提出的方法可用于蛋白质组学质谱数据的特征筛选及样本分类。
A method of feature selection based on partial least-squares discriminant analysis and F-score is proposed and applied to proteomics mass spectrometry data analysis. The method mainly consists of three steps: (1) pre-processing the original data with the LIMPIC algorithm; (2) calculating the F-score value of each variable and arranging all the variables in the descending order of the F-score; (3) ) Partial Least Squares Discriminant Analysis Interactive Test Select the best subset of variables by the forward selection method. A set of colon cancer data was analyzed using this method, and eight characteristic mass-to-charge ratios were finally selected as potential biomarkers from the original 16,341 mass-to-charge ratios. The sensitivity and specificity of the selected test set were 95.24% and 100%, respectively, for discriminating samples from independent test sets. The results show that the proposed method can be used for the characterization and sample classification of proteomics mass spectrometry data.