Contributo in atti di convegno, 2011, ENG, 10.1145/1989284.1989300
Orlandi, Alessio ; Venturini, Rossano
Dipartimento di Informatica, Università di Pisa ; CNR-ISTI, Pisa
We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.
30th symposium on Principles of Database Systems of Data, PODS 2011, pp. 95–106, Athens, Greece, 12-16 June 2011
Full text Index, Text indexing
ISTI – Istituto di scienza e tecnologie dell'informazione "Alessandro Faedo"
ID: 199911
Year: 2011
Type: Contributo in atti di convegno
Creation: 2013-01-23 11:37:01.000
Last update: 2013-02-28 11:33:56.000
CNR authors
External links
OAI-PMH: Dublin Core
OAI-PMH: Mods
OAI-PMH: RDF
URL: http://portal.acm.org/citation.cfm?id=1989300&CFID=29642294&CFTOKEN=82582261
External IDs
CNR OAI-PMH: oai:it.cnr:prodotti:199911
DOI: 10.1145/1989284.1989300
Scopus: 2-s2.0-79960186706