Esuli A.; Moreo Fernández A.; Sebastiani F.
CNR-ISTI, Pisa, Italy; CNR-ISTI, Pisa, Italy; CNR-ISTI, Pisa, Italy;
Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.
6th Italian Information Retrieval Workshop, Cagliari, Italy, 25-26/05/2015
Polylingual text classification, Distributional semantic models, Random Indexing
Moreo Fernandez Alejandro, Esuli Andrea, Sebastiani Fabrizio
ISTI – Istituto di scienza e tecnologie dell'informazione "Alessandro Faedo"
ID: 344534
Year: 2015
Type: Presentazione
Creation: 2016-01-14 11:52:32.000
Last update: 2021-02-12 18:59:40.000
External IDs
CNR OAI-PMH: oai:it.cnr:prodotti:344534