CNR ExploRA

Presentazione, 2015, ENG

A Comparison of Distributional Semantics Models for Polylingual Text Classification

Esuli A.; Moreo Fernández A.; Sebastiani F.

CNR-ISTI, Pisa, Italy; CNR-ISTI, Pisa, Italy; CNR-ISTI, Pisa, Italy;

Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.

6th Italian Information Retrieval Workshop, Cagliari, Italy, 25-26/05/2015