本体辅助的先验知识融入生物信息数据挖掘的方法研究

来源 :上海交通大学 | 被引量 : 2次 | 上传用户:jzg8888
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Technology now allows us to capture and store vast quantities of data. Within these masses of data lies hidden information of strategic importance. Data mining is a widely interested topic. However, with the accumulation of knowledge, data mining is not an isolated mission. It is necessary to be integrated into prior knowledge. Clarifying what we have known is important before to discover new knowledge. On the other hand, information technology has been collaborating with traditional industries extensively and deeply, data mining needs previous understanding of domain specific knowledge. The research of organizing the preexisting knowledge with suitable data structures and integrating them into data mining is called“integrating prior knowledge into data mining”.Ontology provides the backbone for sharing domain knowledge among distributed users and applications and hence can be a solid foundation for accumulating knowledge. Presently, since some large knowledge bases are built on ontology, for example Gene Ontology, it is necessary to query or retrieve knowledge in the knowledge bases with assistance of ontology. What’s more, the sharable knowledge bases built on ontology can be used to provide prior knowledge automatically by reasoning on the ontology. On the other hand, ontology representation language and reasoning tools are sophisticated developed in the field of semantic web. It is easier to build an inference system than to build a traditional expert system. So the ontology aided methods are promising to overcome the disadvantages of traditional methods for integrating prior knowledge into data mining.The most urgent requirements and most practical applications of integrating prior knowledge into data mining are in the field of bioinformatics, because of the two reasons in the following. Firstly, genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. For the knowledge of gene and protein roles in cells can be shared, it is possible to build a knowledge base in the field of molecular biology to accumulate knowledge. Dada analysis of molecular biology needs the preexisting knowledge. For example, biologists can utilize the knowledge learned from yeast to analysis the cell of human being. Secondly, the availability of complete genome sequences provides the necessary information to start analyzing the living cell as a whole. Systems biology is an emergent field, which aims at system-level understanding of biological systems. When consider system-level data analysis, domain knowledge play more important roles. How to integrate the domain specific knowledge into data mining is a challenge.Consequently, "The key to bioinformatics is integration, integration, integration," says bioinformatics expert Jim Golden at Curagen spin-off 454 Corporation in Branford, Connecticut. Actually, former researches of integration in bioinformatics are mainly focused on integrations of data, web service or knowledge. In the dissertation, the key idea is how to organize the useful preexisting knowledge as prior knowledge in the form of suitable data structures and integrate the prior knowledge into data mining for more effective or more accurate prediction.Our contribution can be summed up in the following:To illustrate the ontology-aided method for integrating prior knowledge into data mining, In chapter 3, we present a simple demonstration. The demonstration is valuable for the research of association rules mining. We can draw some conclusions from the study. Firstly, the ontology-aided method of integrating prior knowledge in metadata into data mining has many advantages. All the process can be completed automatically. It can be used generally without dependent on specific database. Since the method is built on the base of ontology, many advantages of ontology technology can be imported. For example, the ontology converted from metadata can import other ontology to extend the data model and it is easy to build reasoner with semantic web applications.In the research of prediction of protein subcellular locationn, to enhance the prediction accuracy of SCL and to explore the biological mechanism of protein SCL, we analyzed the features extracted from protein sequences by Fourier transform. Results have meanings in both computational and biological view. It can used to reduce the dimension of features extracted from sequence using Fourier transform and also give some clues to discover the mechanism of protein’s SCL. It also shows that the frequency domain analysis is a valuable tool in the research of prediction of protein SCL. However, from this study, we recognized that the prediction accuracy couldn’t be improved tremendously only by analyzing sequence information, so we resort to prior knowledge in the next subsection. In 4.3, we present a novel method to extract features from Gene Ontology for prediction of SCL by semantic similarity measurement. Demonstration on a public available dataset shows satisfied results. To predict gene functions from its expression patterns in microarray dataset, in chapter 5 we present a novel analysis method by incorporating the Gene Ontology to the construction of classification models. The method presented in the chapter can also be generalized to similar scenarios to construct data analysis models aided by ontology.In the research of integrating metadata into association rules mining, prior knowledge was integrated into the output of data mining. In the research of integrating prior knowledge into feature vector for prediction of protein SCL, prior knowledge was integrated into the input of data mining. The two parts can be sum up as integrating prior knowledge into the process of data mining. In the research of microarray data analysis, prior knowledge was used to construct data mining models. Summarily, the dissertation gives a comprehensive description of ontology-aided method to integrating prior knowledge into data mining.From the bioinformatics data mining research perspective, the prediction of protein SCL is typical problem of sequence analysis and the microarray data analysis is belonging to systems biology. The methods presented in the dissertation can be generalized to similar problems, so the ontology-aided method of integrating prior knowledge into data mining is widely used and valuable method for bioinformatics.Bioinformatics is a field of knowledge denseness. Ontology is the most powerful tool for knowledge management in the time. It is valuable to build ontologyis in the filed. But how can the ontologies improve predictions? This dissertation presents the answer.Each of our methods provides promising solution to relative problems. Some results are presented on prestigious international conferences and journals.
其他文献
文章从法律的角度对如何认定黑社会性质犯罪进行了探索,在此基础上,提出预防措施,试图为有效打击此类犯罪提供法律武器。
目的:探讨神经内科护士实行连续排班模式的效果。方法:根据护士的能力对护士实行层级管理,按传统的三班制排班,实行三班交接,合理使用人力资源,持续提高护理质量。结果:连续排班
通过对甘肃省白银市会宁县、平凉市崆峒区、武威市凉州区部分中小学生校园霸凌行为的问卷调查统计分析发现,小学生、农村学生、男生、家庭经济较差的学生、父亲职业为农民的
幼儿园与小学语文课程衔接的课程目标应为:使儿童在活动和生活中乐意运用语言进行交往,掌握语文基础知识和简单读写技能,通过接触优秀的文学作品,发展其语言能力、思维能力,
回 回 产卜爹仇贱回——回 日E回。”。回祖 一回“。回干 肉果幻中 N_。NH lP7-ewwe--一”$ MN。W;- __._——————》 砧叫]们羽 制作:陈恬’#陈川个美食 Back to yield
全景图象在虚拟现实中有重要的价值。本文使用立方体表面来映射全方位全景图象,采用基于图象的绘制方法(Image Based Rendering,简称IBR),在基本器材(即普通照相机和个人计算机)
<正>河北农业大省,也是农作物秸秆产量大省,农作物秸秆作为重要的生物质资源,其年产量达6600万吨以上,主要包括玉米、小麦、水稻、谷物、棉花、油料作物、薯类秸秆等,秸秆可
多车场车辆路径问题是一类实用性很高的NP难解问题。针对标准粒子群算法易早熟、收敛速度慢的缺陷,提出了一种新的改进算法,该算法采用协同进化思想,同时在搜索陷入局部最优
本文介绍一个实用化的专业问答式索引擎系统一“小灵通”。它接受用户的自然汉语查询,并把包含答案的Web页面返回给用户。由于它是面向专业领域服务的,所以它的回答的准确性得
几何造型主要研究在计算机系统的环境下,对于几何形体的表示、设计、显示和分析。细分方法作为曲线曲面的离散化造型方法,是多边形网格表示方法和参数表示方法的有机结合,自从上