Knowledge graph exploration for natural language understanding in web information retrieval

Schuhmacher, Michael

Vorschau

PDF
thesis.pdf - Veröffentlichte Version
Download (6MB)

URL:	https://madoc.bib.uni-mannheim.de/41485
URN:	urn:nbn:de:bsz:180-madoc-414859
Dokumenttyp:	Dissertation
Erscheinungsjahr:	2016
Ort der Veröffentlichung:	Mannheim
Hochschule:	Universität Mannheim
Gutachter:	Ponzetto, Simone Paolo
Datum der mündl. Prüfung:	11 November 2016
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Semantic Web (Juniorprofessur) (Ponzetto 2013-2015) Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Practical Computer Science II: Artificial Intelligence (Stuckenschmidt 2009-)
Fachgebiet:	004 Informatik
Normierte Schlagwörter (SWD):	Information Retrieval , Semantic Web , Wissensrepräsentation
Freie Schlagwörter (Englisch):	Information Retrieval , Semantic Web , Knowledge Bases
Abstract:	In this thesis, we study methods to leverage information from fully-structured knowledge bases (KBs), in particular the encyclopedic knowledge graph (KG) DBpedia, for different text-related tasks from the area of information retrieval (IR) and natural language processing (NLP). The key idea is to apply entity linking (EL) methods that identify mentions of KB entities in text, and then exploit the structured information within KGs. Developing entity-centric methods for text understanding using KG exploration is the focus of this work. We aim to show that structured background knowledge is a means for improving performance in different IR and NLP tasks that traditionally only make use of the unstructured text input itself. Thereby, the KB entities mentioned in text act as connection between the unstructured text and the structured KG. We focus in particular on how to best leverage the knowledge as contained in such fully-structured (RDF) KGs like DBpedia with their labeled edges/predicates – which is in contrast to previous work on Wikipedia-based approaches we build upon, which typically relies on unlabeled graphs only. The contribution of this thesis can be structured along its three parts: In Part I, we apply EL and semantify short text snippets with KB entities. While only retrieving types and categories from DBpedia for each entity, we are able to leverage this information to create semantically coherent clusters of text snippets. This pipeline of connecting text to background knowledge via the mentioned entities will be reused in all following chapters. In Part II, we focus on semantic similarity and extend the idea of semantifying text with entities by proposing in Chapter 5 a model that represents whole documents by their entities. In this model, comparing documents semantically with each other is viewed as the task of comparing the semantic relatedness of the respective entities, which we address in Chapter 4. We propose an unsupervised graph weighting schema and show that weighting the DBpedia KG leads to better results on an existing entity ranking dataset. The exploration of weighted KG paths turns out to be also useful when trying to disambiguate the entities from an open information extraction (OIE) system in Chapter 6. With this weighting schema, the integration of KG information for computing semantic document similarity in Chapter 5 becomes the task of comparing the two KG subgraphs with each other, which we address by an approximate subgraph matching. Based on a well-established evaluation dataset for semantic document similarity, we show that our unsupervised method achieves competitive performance similar to other state-of-the-art methods. Our results from this part indicate that KGs can contain helpful background knowledge, in particular when exploring KG paths, but that selecting the relevant parts of the graph is an important yet difficult challenge. In Part III, we shift to the task of relevance ranking and first study in Chapter 7 how to best retrieve KB entities for a given keyword query. Combining again text with KB information, we extract entities from the top-k retrieved, query-specific documents and then link the documents to two different KBs, namely Wikipedia and DBpedia. In a learning-to-rank setting, we study extensively which features from the text, theWikipedia KB, and the DBpedia KG can be helpful for ranking entities with respect to the query. Experimental results on two datasets, which build upon existing TREC document retrieval collections, indicate that the document-based mention frequency of an entity and the Wikipedia-based query-to-entity similarity are both important features for ranking. The KG paths in contrast play only a minor role in this setting, even when integrated with a semantic kernel extension. In Chapter 8, we further extend the integration of query-specific text documents and KG information, by extracting not only entities, but also relations from text. In this exploratory study based on a self-created relevance dataset, we find that not all extracted relations are relevant with respect to the query, but that they often contain information not contained within the DBpedia KG. The main insight from the research presented in this part is that in a query-specific setting, established IR methods for document retrieval provide an important source of information even for entity-centric tasks, and that a close integration of relevant text document and background knowledge is promising. Finally, in the concluding chapter we argue that future research should further address the integration of KG information with entities and relations extracted from (specific) text documents, as their potential seems to be not fully explored yet. The same holds also true for a better KG exploration, which has gained some scientific interest in recent years. It seems to us that both aspects will remain interesting problems in the next years, also because of the growing importance of KGs for web search and knowledge modeling in industry and academia.