论文部分内容阅读
Traditional IR systems based on syntactic search are unable to address basicissues like synonymy or polysemy. To overcome these issues the semantic web hasbeen presented as major tool. It achieves this goal with an excellent precision.However a big part of the data remains unstructured. It cost a great effort tomanually annotate such big data. Concept based information retrieval is approachthat can manage directly unstructured data and address the issues. Our conceptbased method is constructed using semantic relatedness. The major research andcontributions are presented as follows:(1) In order to find a unique measure to compute the distance between queries,concepts and documents we have implemented a new semantic similarity measure.Similarity measures are the most important tools in information retrieval andnatural language processing. Sentence similarities are of capital importance inonline translation. Words-to-document similarity is the key factor to computequery relevance. Text similarities play a big role in data mining. A lot of similaritymeasures have been used in different domains. Most of the measures are corpusdependent or language dependent. When some measures are good to computeword-to-document matching, they are unable to compute document-to-documentsimilarity and vice versa. In this paper we present a method that can be used tocompute any kind of semantic similarity. The method is neither corpus dependentnor language dependent, and gives a way to compare more accurately semanticrelatedness. (2) Traditional search engines, based on words to documents matching, are knownto present extremely low precision. They do not address some key problems likesynonymy or polysemy. To deal with these issues semantic search has beenproposed. It certainly addresses the issues but presents a very low recall. It onlymanages structured data. Unstructured data need to be annotated first. Annotating ahuge unstructured data is time consuming. Concept based information retrievalproposes to extend syntactic search using words semantic relationships. In thiswork, we present a method using an undirected graph of concepts extracted fromWikipedia corpus in order to retrieve unstructured data. The main feature is basedon concept’s semantic relationships.(3) Concept based search is a method that enhances information retrieval systemsusing semantic relationships. The recall in concept based search is relatively low.That low recall comes from the fact that it is not easy to represent a conceptcompletely. Query expansion intends to fill a gap because concept representation isalways partial. Query expansion improves the recall. In this paper we present anexpansion method for a concept based information retrieval. Our method usessemantic relatedness to extend user query through an undirected graph of concepts.The concepts-to-concept relatedness is the only source of expansion used in thiswork.(4) Query expansion has a very high computation cost. This computation cost candecrease if we transfer at indexing what is usually done at search time. Clusteringis a way to organize data according to a given similarity. Traditional clusteringmethods are not able to describe the generated clusters. Conceptual clustering is an important and active research area that aims to efficiently cluster and explain thedata. Previous conceptual clustering approaches provide descriptions that do notuse a human comprehensible knowledge. In this work we presentan algorithm thatuses concepts to process a clustering method. The generated clusters overlap eachother and serve as a basis for an information retrieval system. The method has beenimplemented in order to improve the performance of the system by reducing thecomputation cost.