Text Mining on heterogeneous knowledge bases. An application to optimised discovery of disease relevant genetic variants (GOA).
Project information

The growing overload of textual information available to organizations and professionals hampers effective knowledge management and discovery by increasing the time needed to find relevant information and by causing crucial information to be missed. Especially in the health sciences this is seen as a vexing problem, as the huge and largely unexplored volume of published literature, in combination with structured databases representing experimental data and background knowledge, might lead to new discoveries.

This project proposes the development of a methodology for combined text analysis and data mining (text mining) from such heterogeneous information sources and its application in molecular genetics/genomics and in knowledge management in general. The proposed approach relies on progress in fundamental research issues in text analysis and data mining. For text analysis, we will investigate semi-automatic adaptation of existing text analysis tools to biomedical language, and develop a limited but robust and accurate handling of negation, modality, and quantification in medical language. We will use this information for providing accurate relations automatically extracted from text and weighted according to their reliability. For data mining, we will modify and extend existing graph-based data mining algorithms, especially with regard to scalability and the dynamic nature of the graph that needs to be explored. We will also investigate principled ways for integrating the reliability measures of the output of text analysis with reliability measures for the structured information.

These developments will lead to a new methodology for text mining with heterogeneous information sources that will be tested in two application areas: biomedical text mining and knowledge management. For biomedical text mining, the methodology will be used to assist researchers in ranking candidate disease causing genes. A number of test cases of increasing complexity will be defined (both with known outcome and unknown outcome), and the results of the methodology will be compared to the literature (for the cases with known outcome) and experimentally validated (for the cases with unknown outcome). The application in knowledge management addresses the collection of information about persons (person profiling) from WWW information. It will be of a smaller scale than the biomedical application, and is intended to show the general applicability of the developed text mining approach.

The project will provide improved text mining tools (adaptable and with deeper semantic analysis), new graph-based data mining methods and progress in non-trivial text mining using heterogeneous information sources and in reliability assessment of mined knowledge. Apart from that, we also hope that through the applications, the project will show new results in mining for previously unknown relations between genes and phenotypes and improved gene prioritisation catching non-obvious disease causing genes.

Abstract Dutch: 

Het project introduceert een methodologie voor text mining met heterogene informatiebronnen en de toepassing ervan in moleculaire genetica en kennismanagement. Bestaande tekstanalyse en graaf-gebaseerde data mining technieken zullen worden uitgebreid om deze methodologie mogelijk te maken. De methodologie wordt toegepast in een biomedische toepassing (ordening van kandidaat ziekte-veroorzakende genen) en een kennismanagement toepassing (bepalen van profiel van personen op basis van www informatie).

Project Leader(s): 
Walter Daelemans
01/01/2007 - 31/12/2010

Bijzonder Onderzoeksfonds, Universiteit Antwerpen (GOA BO UA)

Publications + Talks

Morante, R. (2010).  Descriptive Analysis of Negation Cues in Biomedical Texts. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). 1429-1436.
Morante, R., Van Asch V., & Daelemans W. (2010).  Extraction of biomedical events. Computational linguistics in the Netherlands : selected papers from the twentieth CLIN meeting. 91-106.
Morante, R., Van Asch V., & Daelemans W. (2010).  Memory-based approaches to event extraction from biomedical texts. 20th Meeting of Computational Linguistics in the Netherlands (CLIN20), Utrecht, The Netherlands.
Morante, R., Van Asch V., & Daelemans W. (2010).  Memory-Based resolution of in-sentence scopes of hedge cues. Fourteenth Conference on Computational Natural Language Learning: Shared Task.
Syndicate content