Stitching web tables for improving matching quality


Lehmberg, Oliver ; Bizer, Christian



DOI: https://doi.org/10.14778/3137628.3137657
URL: https://www.researchgate.net/publication/319599651...
Weitere URL: http://www.vldb.org/pvldb/vol10/p1502-lehmberg.pdf
Dokumenttyp: Zeitschriftenartikel
Erscheinungsjahr: 2017
Titel einer Zeitschrift oder einer Reihe: Proceedings of the VLDB Endowment
Band/Volume: 10
Heft/Issue: 11
Seitenbereich: 1502-1513
Ort der Veröffentlichung: New York, NY [u.a.]
Verlag: Assoc. of Computing Machinery
ISSN: 2150-8097
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet: 004 Informatik
Freie Schlagwörter (Englisch): Data Integration , Matching , Web Tables , Knowledge Bases , DBpedia
Abstract: HTML tables on web pages (“web tables”) cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables. Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Aufruf-Statistik

Aufrufe im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen