基于支持向量机递归特征消除和特征聚类的致癌基因选择方法

(厦门大学 信息科学与技术学院,福建省智慧城市感知与计算重点实验室,福建 厦门 361005)

基因表达谱; 特征选择; K均值聚类; 支持向量机

Cancer Gene Selection Algorithm Based on Support Vector Machine Recursive Feature Elimination and Feature Clustering
YE Xiaoquan,WU Yunfeng*

(Fujian Key Laboratory of Sensing and Computing for Smart City,School of Information Science and Engineering,Xiamen University,Xiamen 361005,China)

gene expression profile; feature selection; K-means; support vector machine

DOI: 10.6043/j.issn.0438-0479.201803022

备注

癌症通常由基因发生突变引起,因此从大量基因中有效地识别出少量致癌基因具有重要意义.针对基因表达谱数据高维小样本的特点,将支持向量机递归特征消除(SVM-RFE)和特征聚类算法相结合,提出一种新的基因选择方法:K类别SVM-RFE(K-SVM-RFE).该算法通过特征排序算法去除大量无关基因,利用K均值聚类算法将相似基因聚为一类,并通过两次SVM-RFE算法精选致癌基因.随后将K-SVM-RFE算法应用于多个基因表达谱数据集,并对其中的关键参数设置进行了讨论.实验结果表明K-SVM-RFE算法所选基因较已有方法在分类准确率上有显著提高,特别是在选择少量致癌基因上效果提升更为明显.

Cancer is usually caused by mutations in genes.It is significant to effectively identify a small number of pathogenic genes from numerous genes.Based on characteristics of gene expression profile data,a novel algorithm(K- SVM-RFE)of gene selection is proposed by combining SVM-RFE with feature clustering algorithm.First,irrelevant genes were removed by feature ranking algorithm.Then,these genes were clustered by K-means and the SVM-RFE algorithm was applied twice to select key genes.We conducted experiments on some real-world data sets and discussed the parameter settings in our method.Results show that,compared with the existing methods,genes selected by the K-SVM-RFE algorithm have significantly improved the classification accuracy,especially in selecting a few key genes.