论文部分内容阅读
Technology now allows us to capture and store vast quantities of data. Within these masses of data lies hidden information of strategic importance. Data mining is a widely interested topic. However, with the accumulation of knowledge, data mining is not an isolated mission. It is necessary to be integrated into prior knowledge. Clarifying what we have known is important before to discover new knowledge. On the other hand, information technology has been collaborating with traditional industries extensively and deeply, data mining needs previous understanding of domain specific knowledge. The research of organizing the preexisting knowledge with suitable data structures and integrating them into data mining is called“integrating prior knowledge into data mining”.Ontology provides the backbone for sharing domain knowledge among distributed users and applications and hence can be a solid foundation for accumulating knowledge. Presently, since some large knowledge bases are built on ontology, for example Gene Ontology, it is necessary to query or retrieve knowledge in the knowledge bases with assistance of ontology. What’s more, the sharable knowledge bases built on ontology can be used to provide prior knowledge automatically by reasoning on the ontology. On the other hand, ontology representation language and reasoning tools are sophisticated developed in the field of semantic web. It is easier to build an inference system than to build a traditional expert system. So the ontology aided methods are promising to overcome the disadvantages of traditional methods for integrating prior knowledge into data mining.The most urgent requirements and most practical applications of integrating prior knowledge into data mining are in the field of bioinformatics, because of the two reasons in the following. Firstly, genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. For the knowledge of gene and protein roles in cells can be shared, it is possible to build a knowledge base in the field of molecular biology to accumulate knowledge. Dada analysis of molecular biology needs the preexisting knowledge. For example, biologists can utilize the knowledge learned from yeast to analysis the cell of human being. Secondly, the availability of complete genome sequences provides the necessary information to start analyzing the living cell as a whole. Systems biology is an emergent field, which aims at system-level understanding of biological systems. When consider system-level data analysis, domain knowledge play more important roles. How to integrate the domain specific knowledge into data mining is a challenge.Consequently, "The key to bioinformatics is integration, integration, integration," says bioinformatics expert Jim Golden at Curagen spin-off 454 Corporation in Branford, Connecticut. Actually, former researches of integration in bioinformatics are mainly focused on integrations of data, web service or knowledge. In the dissertation, the key idea is how to organize the useful preexisting knowledge as prior knowledge in the form of suitable data structures and integrate the prior knowledge into data mining for more effective or more accurate prediction.Our contribution can be summed up in the following:To illustrate the ontology-aided method for integrating prior knowledge into data mining, In chapter 3, we present a simple demonstration. The demonstration is valuable for the research of association rules mining. We can draw some conclusions from the study. Firstly, the ontology-aided method of integrating prior knowledge in metadata into data mining has many advantages. All the process can be completed automatically. It can be used generally without dependent on specific database. Since the method is built on the base of ontology, many advantages of ontology technology can be imported. For example, the ontology converted from metadata can import other ontology to extend the data model and it is easy to build reasoner with semantic web applications.In the research of prediction of protein subcellular locationn, to enhance the prediction accuracy of SCL and to explore the biological mechanism of protein SCL, we analyzed the features extracted from protein sequences by Fourier transform. Results have meanings in both computational and biological view. It can used to reduce the dimension of features extracted from sequence using Fourier transform and also give some clues to discover the mechanism of protein’s SCL. It also shows that the frequency domain analysis is a valuable tool in the research of prediction of protein SCL. However, from this study, we recognized that the prediction accuracy couldn’t be improved tremendously only by analyzing sequence information, so we resort to prior knowledge in the next subsection. In 4.3, we present a novel method to extract features from Gene Ontology for prediction of SCL by semantic similarity measurement. Demonstration on a public available dataset shows satisfied results. To predict gene functions from its expression patterns in microarray dataset, in chapter 5 we present a novel analysis method by incorporating the Gene Ontology to the construction of classification models. The method presented in the chapter can also be generalized to similar scenarios to construct data analysis models aided by ontology.In the research of integrating metadata into association rules mining, prior knowledge was integrated into the output of data mining. In the research of integrating prior knowledge into feature vector for prediction of protein SCL, prior knowledge was integrated into the input of data mining. The two parts can be sum up as integrating prior knowledge into the process of data mining. In the research of microarray data analysis, prior knowledge was used to construct data mining models. Summarily, the dissertation gives a comprehensive description of ontology-aided method to integrating prior knowledge into data mining.From the bioinformatics data mining research perspective, the prediction of protein SCL is typical problem of sequence analysis and the microarray data analysis is belonging to systems biology. The methods presented in the dissertation can be generalized to similar problems, so the ontology-aided method of integrating prior knowledge into data mining is widely used and valuable method for bioinformatics.Bioinformatics is a field of knowledge denseness. Ontology is the most powerful tool for knowledge management in the time. It is valuable to build ontologyis in the filed. But how can the ontologies improve predictions? This dissertation presents the answer.Each of our methods provides promising solution to relative problems. Some results are presented on prestigious international conferences and journals.