Description

Title Applying concept networks for knowledge discovery
Abstract The amount of biomedical information continues to increase. This information is not only contained in scientific articles, whose number has already surpassed twenty million, but also in biomedical databases. It is simply impossible for researchers to process such amounts of information. Computers, however, have less difficulties with processing large amounts of data, as long as it is structured. There are multiple candidates for such a structure. One of the simplest, but therefore most powerful and flexible structure is the so-called triple. Triples consists of two entities, in the case of this thesis concepts, with their mutual relationship. Often metadata is added to a triple, such as a source for the triple and a date. Once a computer has access to this structured data, it can be used to reason over. The structured data is thought to be usable for finding new patterns and connections, which will lead to new knowledge and discoveries. The Erasmus Medical Centre has developed a graph database capable of processing such structured data and reasoning over it. In this thesis its capability to knowledge discovery have been tested. As the database is based on the UMLS semantic network, with many of its relationships obtained through text mining, Swanson's classic ABC automated reasoning algorithm has initially been tested. Additionally, we investigated the manners in which this graph database differs from preceding attempts to automated reasoning with Swanson's algorithm, and what lessons have been learned from those attempts, with an emphasis on the ranking of associated concepts. Finally, the various output formats developed during the course of this project have been discussed. Ultimately we conclude the EMC graph database differs strongly from preceding attempts to automated reasoning, making their ranking algorithms difficult or impossible to apply. In many other aspects the graph database is the next step in knowledge storage and representation, incorporating the various strengths of the preceding attempts. After testing with our dataset we conclude that the intuitive approach to knowledge discovery, simply identifying and selecting ``correct'' paths between the concepts was unsuitable for knowledge discovery. Instead, we present an approach which is based on the network properties of the concepts, which has a superior discriminatory capability due to its quantitative emphasis. As an optimal format to present our results we suggest a list of compounds. This format offers the highest density of data and flexibility, with its capacity for a user to rank and filter the candidates according to his own preferences.