Closing the gap: Sequence mining at scale


Beedkar, Kaustubh ; Berberich, Klaus ; Gemulla, Rainer ; Miliaraki, Iris


[img]
Vorschau
PDF
a8-beedkar.pdf - Veröffentlichte Version

Download (2MB)

DOI: https://doi.org/10.1145/2757217
URL: https://madoc.bib.uni-mannheim.de/40019
Weitere URL: http://dl.acm.org/citation.cfm?doid=2799368.275721...
URN: urn:nbn:de:bsz:180-madoc-400197
Dokumenttyp: Zeitschriftenartikel
Erscheinungsjahr: 2015
Titel einer Zeitschrift oder einer Reihe: ACM Transactions on Database Systems : TODS
Band/Volume: 40
Heft/Issue: 2
Seitenbereich: Art. 8, 1-44
Ort der Veröffentlichung: New York, NY
Verlag: ACM Press
ISSN: 0362-5915 , 1557-4644
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Practical Computer Science I: Data Analytics (Gemulla 2014-)
Fachgebiet: 004 Informatik
Abstract: Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Download-Statistik

Downloads im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen