2023, Articolo in rivista, ENG
Vadicamo L.; Amato G.; Gennaro C.
Permutation-based Indexing (PBI) approaches have been proven to be particularly effective for conducting large-scale approximate metric searching. These methods rely on the idea of transforming the original metric objects into permutation representations, which can be efficiently indexed using data structures such as inverted files. The standard conceptualization of permutation associated with a metric object involves only the use of object distances and their relative orders from a set of anchors called pivots. In this paper, we generalized this definition in order to enlarge the class of permutation representations that can be used by PBI approaches. In particular, we introduced the concept of permutation induced by a space transformation and a sorting function, and we investigated which properties these transformations should possess to produce permutations that are effective for metric search. Furthermore, as a practical outcome, we defined a new type of permutation representation that is calculated using distances from pairs of pivots. This proposed technique allowed us to produce longer permutations than traditional ones for the same number of object-pivot distance calculations. The advantage lies in the fact that when longer permutations are employed, the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase.
2021, Contributo in atti di convegno, ENG
Vadicamo L.; Gennaro C.; Amato G.
In the domain of approximate metric search, the Permutation-based Indexing (PBI) approaches have been proved to be particularly suitable for dealing with large data collections. These methods employ a permutation-based representation of the data, which can be efficiently indexed using data structures such as inverted files. In the literature, the definition of the permutation of a metric object was derived by reordering the distances of the object to a set of pivots. In this paper, we aim at generalizing this definition in order to enlarge the class of permutations that can be used by PBI approaches. As a practical outcome, we defined a new type of permutation that is calculated using distances from pairs of pivots. The proposed technique permits us to produce longer permutations than traditional ones for the same number of object-pivot distance calculations. The advantage is that the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase when longer permutations are used.
2019, Articolo in rivista, ENG
Connor R.; Vadicamo L.; Cardillo F.A.; Rabitti F.
Metric search is concerned with the efficient evaluation of queries in metric spaces. In general, a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equivalent to a finite embedding property, which states that any three points of the space can be isometrically embedded in two-dimensional Euclidean space. In this paper, we examine a class of semimetric space which is finitely four-embeddable in three-dimensional Euclidean space. In mathematics this property has been extensively studied and is generally known as the four-point property. All spaces with the four-point property are metric spaces, but they also have some stronger geometric guarantees. We coin the term supermetric(1) space as, in terms of metric search, they are significantly more tractable. Supermetric spaces include all those governed by Euclidean, Cosine,(2) Jensen-Shannon and Triangular distances, and are thus commonly used within many domains. In previous work we have given a generic mathematical basis for the supermetric property and shown how it can improve indexing performance for a given exact search structure. Here we present a full investigation into its use within a variety of different hyperplane partition indexing structures, and go on to show some more of its flexibility by examining a search structure whose partition and exclusion conditions are tailored, at each node, to suit the individual reference points and data set present there. Among the results given, we show a new best performance for exact search using a well-known benchmark. (C) 2018 Elsevier Ltd. All rights reserved.
2017, Articolo in rivista, ENG
Connor R.; Cardillo F. A.; Vadicamo L.; Rabitti F.
Most research into similarity search in metric spaces relies on the triangle inequality property. This property allows the space to be arranged according to relative distances to avoid searching some subspaces. We show that many common metric spaces, notably including those using Euclidean and Jensen-Shannon distances, also have a stronger property, sometimes called the four-point property: In essence, these spaces allow an isometric embedding of any four points in three-dimensional Euclidean space, as well as any three points in two-dimensional Euclidean space. In fact, we show that any space that is isometrically embeddable in Hilbert space has the stronger property. This property gives stronger geometric guarantees, and one in particular, which we name the Hilbert Exclusion property, allows any indexing mechanism which uses hyperplane partitioning to perform better. One outcome of this observation is that a number of state-of-the-art indexing mechanisms over high-dimensional spaces can be easily refined to give a significant increase in performance; furthermore, the improvement given is greater in higher dimensions. This therefore leads to a significant improvement in the cost of metric search in these spaces.
2015, Articolo in rivista, ENG
Amato G.; Esuli A.; Falchi F.
Abstract Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query, following the assumption that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection techniques do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivot selection techniques on three permutation-based similarity access methods. Among those, we propose a novel technique specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested techniques. Second, there is no technique that is universally the best for all permutation-based access methods; rather different techniques are optimal for different methods. This indicates that the pivot selection technique should be considered as an integrating and relevant part of any permutation-based access method.
2013, Contributo in atti di convegno, ENG
Amato G., Esuli A., Falchi F.
Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query. This, of course assumes that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection strategies do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivots selection strategies on three permutation-based similarity access methods. Among those, we propose a novel strategy specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested strategies. Second, there is not a strategy that is universally the best for all permutation-based access methods; rather different strategies are optimal for different methods.
2012, Software, ENG
Amato G.
MI-File (Metric Inverted File) allows you to perform approximate similarity search on huge datasets. As an example give a look at the Image Similarity Search Engine, which allows you searching in a dataset of more than 100 millions images, that was built using the MI-File library. The technique is based on the use of a space transformation where data objects are represented by ordered sequences of reference objects. The sequence of reference objects that represent a data object is ordered according to the distance of the reference objects from the data object being represented. Distance between two data objects is measured by computing the spearmann footrule distance between the two sequence of reference objects that represent them. The closer the two data objects the most similar the two sequence of reference objects. The index is based on the use of inverted files.
2011, Rapporto di progetto (Project report), ENG
Falchi F.; Amato G.; Bolettieri P.; Gennaro C.
In this document, the developing of the software component for approximate image matching is reported. This document is an output of the Activity A4.3 "Indici efficienti per il matching approssimato e la classificazione delle immagini" of the VISITO Tuscany project. Planned from month 8 to month 17, the activity started and ended as planned. The output of Activity 4.3 is this document and the software component made available inside on the VISITO Tuscany project wiki. The software component is optimized for working on images related to cultural heritage objects.
2010, Rapporto di progetto (Project report), ITA
Bolettieri P.; Benedetti L.; La Torre F.; Loschiavo D.; Lucchese C.; Lungarotti F.; Salvadori S.; Scopigno R.; Venturini R.
Questo documento è il rapporto con la specifica dettagliata delle funzionalità svolta all'interno del Progetto VISITO Tuscany a partire dal mese 4 fino al mese 5 nell'ambito dell'Obbiettivo Operativo 2, Attività A2.1 "Rapporto con la specifica dettagliata delle funzionalità della piattaforma Visito Tuscany". Il rapporto descriverà dettagliatamente le funzionalità offerte dal sistema, ponendo particolare attenzione nell'identificare funzionalità , di notevole importanza per i potenziali utenti, che non sono tuttora fornite da altri sistemi e che potranno essere realizzate capitalizzando sulla sinergia dei membri del consorzio del progetto e sulle loro attività pregresse.
2010, Rapporto di progetto (Project report), ITA
Amato G.
Lo scopo di questo deliverable è quello di stabilire le procedure e metodologie da adottare durante l'esecuzione del progetto VISITO Tuscany. L'obbiettivo è quello di garantire un'efficace ed efficiente gestione del progetto, il raggiungimento degli obbiettivi , ed assicurare il consenso sulle attività da parte dei membri del progetto. Il PQ definisce un insieme di regole per la gestione del lavoro cooperativo nel progetto, includendo anche le procedure da adottare, i meccanismi di reportistica, l'organizzazione degli incontri, il controllo del flusso di informazione, l'affidabilità dei risultati, la preparazione della documentazione per la sottomissione alla Regione Toscana (RT).
2010, Rapporto di progetto (Project report), ITA
Amato G.; Falchi F.; Bolettieri P.; Lucchese C.; Scopigno R.; La Torre F.; Minelli S.; Tavanti F.; Scartoni R.; Salvadori S.; Zanetti N.; Loschiavo D.
Questo documento descrive i risultati ottenuti dal progetto VISITO-Tuscany nei primi otto mesi di lavoro. Dopo una breve panoramica degli obbiettivi generali del progetto, si evidenzieranno i risultati previsti alla fine dell'ottavo mese e si esporrà quale è stato il lavoro effettivamente sostenuto e i risultati raggiunti, in maniera dettagliata per le varie attività previste.
2010, Articolo in rivista, ENG
Batko M.; Falchi F.; Lucchese C.; Novak D.; Perego R.; Rabitti F.; Sedmidubsky J.; Zezula P.
As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web- scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have been facing was obtaining a collection of images of this scale with the corresponding descriptive features. We have tackled the non-trivial process of image crawling and extraction of several MPEG-7 descriptors. The result of this effort is a test collection, the first of such scale, opened to the research community for experiments and comparisons. The second challenge was to develop indexing and searching mechanisms able to scale to the target size and to answer similarity queries in real-time. We have achieved this goal by creating sophisticated centralized and distributed structures based purely on the metric space model of data. We have joined them together which has resulted in an extremely flexible and scalable solution. In this paper, we study in detail the performance of this technology and its evolvement as the data volume grows by three orders of magnitude. The results of the experiments are very encouraging and promising for future applications.
2009, Contributo in atti di convegno, ENG
Falchi F.; Lucchese C.; Orlando S.; Perego R.; Rabitti F.
In order to become an effective complement to traditional Web-scale text-based image retrieval solutions, content-based image retrieval must address scalability and effciency issues. In this paper we investigate the possibility of caching the answers to content-based image retrieval queries in metric space, with the aim of reducing the average cost of query processing, and boosting the overall system throughput. Our proposal exploits the similarity between the query object and the cache content, and allows the cache to return approximate answers with acceptable quality guarantee even if the query processed has never been encountered in the past. Moreover, since popular images that are likely to be used as query have several near-duplicate versions, we show that our caching algorithm is robust, and does not suffer of cache pollution problems due to near-duplicate query objects. We report on very promising results obtained with a collection of one million high-quality digital photos. We show that it is worth pursuing caching strategies also in similarity search systems, since the proposed caching techniques can have a signicant impact on performance, like caching on text queries has been proven effective for traditional Web search engines.
2008, Contributo in atti di convegno, ENG
Batko M.; Falchi F.; Lucchese C.; Novak D.; Perego R.; Rabitti F.; Sedmidubsky J.; Zezula P.
In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images, to show the possibility to scale Content-based Image Retrieval (CBIR) systems towards the Web size. First, we had to tackle the non-trivial process of image crawling and descriptive feature extraction, performed by using the European EGEE computer GRID, building a test collection, the first of such scale, that will be opened to the research community for experiments and comparisons. Then, we had to develop indexing and searching mechanisms which can scale up to these volumes and answer similarity queries in real-time. The results of our experiments are very encouraging for future applications.
2008, Articolo in rivista, ENG
Falchi F.; Gennaro C.; Savino P.; Stanchev P.
A new approach for video-stream filtering that makes use of the features representing video content and exploits the properties of metric spaces can help reduce the filtering receiver's computational load.
2008, Articolo in rivista, ENG
Falchi F.; Gennaro C.; Zezula P.
Most of the peer-to-peer search techniques proposed in the recent years have focused on the single-key retrieval. However, similarity search in metric spaces represents an important paradigm for content-based retrieval in many applications. In this paper we introduce an extension of the well-known Content-Addressable Network paradigm to support storage and retrieval of more generic metric space objects. In particular we address the problem of executing the nearest neighbors queries, and propose three different algorithms of query propagation. An extensive experimental study on real-life data sets explores the performance characteristics of the proposed algorithms by showing their advantages and disadvantages.
2004, Presentazione, ENG
Batko M.; Gennaro C.; Zezula P.
Similarity search in metric spaces represents an important paradigm for content-based retrieval of many applications. Existing centralized search structures can speed-up retrieval, but they do not scale up to large volume of data because the response time is linearly increasing with the size of the searched file. In this article, we study the problem of executing the nearest neighbor(s) queries in a distributed metric structure, which is based on the P2P communication paradigm and the generalized hyperplane partitioning. By exploiting parallelism in a dynamic network of computers, the query execution scales up very well considering both the number of distance computations and the hop count between the peers. Results are verified by experiments on real-life data sets.
2004, Rapporto tecnico, ENG
Falchi F.; Gennaro C.; Savino P.
The production of digital multimedia content is continuously increasing. With the advent of the digital cable, satellite and terrestrial televisions, thousands of channels are practically available for users. Moreover, multimedia is widely used in many professional applications, such as video program production, video surveillance, and e-learning. Consequently, the use of techniques for video retrieval as well as video filtering is becoming of crucial importance. The adoption of the MPEG-7 standard is a significant step forward in simplifying the video retrieval and filtering. However, the performance issue can be relevant if the retrieval must be accomplished in real time, as in some applications such as the video surveillance or video filtering in general. Moreover, the number of streams to search, the number of queries, and the computational complexity of the feature similarity measure can heavily affect the effectiveness of such real time filtering applications. In this paper we present the Pivoted Stream, a novel approach for efficient filtering of a video stream by using the MPEG-7 descriptors. Our proposal exploits the properties of the metric spaces, in order to reduce the computational load of the filtering receiver.
2003, Contributo in atti di convegno, ENG
Dohnal V.; Gennaro C.; Savino P.; Zezula P.
Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the notion of mathematical metric space provides a useful abstraction for similarity. In this paper we consider the problem of organizing and searching large data-sets from arbitrary metric spaces, and a novel access structure for similarity search in metric data, called D-Index, is discussed. D-Index combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. Moreover, we propose an extension of this access structure (eD-Index) which is able to deal with the problem of similarity self join. Though this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments.
2003, Presentazione, ENG
Batko M.; Gennaro C.; Zezula P.
In this paper we propose a new access structure, called GHT*, based on generalized hyperplane tree (GHT) and distributed dynamic hashing (DDH) techniques. GHT* is a distributed structure which allows to perform range search in a metric space according to a distance function d. The structure does not require a central directory and it is able to gracefully scale through splits of one bucket at a time.