Contributo in atti di convegno, 2010, ENG

Mining top-K patterns from binary datasets in presence of noise

Lucchese C.; Orlando S.; Perego R.

CNR-ISTI, Pisa, Italy; Dipartimento di Informatica, Università Ca' Foscari di Venezia, Venezia, Italy; CNR-ISTI, Pisa, Italy

The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data.

Tenth SIAM International Conference on Data Mining, pp. 165–176, Columbus, Ohio, US, April 29 - May 1 2010

Keywords

Database Management. Data mining, Pattern mining

CNR authors

Lucchese Claudio, Perego Raffaele

CNR institutes

ISTI – Istituto di scienza e tecnologie dell'informazione "Alessandro Faedo"

ID: 92091

Year: 2010

Type: Contributo in atti di convegno

Last update: 2018-03-02 14:01:35.000

External IDs

CNR OAI-PMH: oai:it.cnr:prodotti:92091

Google Scholar: http://scholar.google.com/citations?view_op=view_citation&hl=en&user=bdoG6ScAAAAJ&sortby=pubdate&citation_for_view=bdoG6ScAAAAJ:4JMBOYKVnBMC