CNR ExploRA

2017, Articolo in rivista, ENG

Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification

Moreo Fernandez A.; Esuli A.; Sebastiani F.

Researchers from ISTI-CNR, Pisa (in a joint effort with the Qatar Computing Research Institute), have developed a transfer learning method that allows cross-domain and cross-lingual sentiment classification to be performed accurately and efficiently. This means sentiment classification efforts can leverage training data originally developed for performing sentiment classification on other domains and/or in other languages.

ERCIM news 111, pp. 48–48

View

2017, Articolo in rivista, ENG

Lightweight random indexing for polylingual text classification

Moreo Fernandez A.; Esuli A.; Sebastiani F.

Researchers from ISTI-CNR, Pisa (in a joint effort with the Qatar Computing Research Institute), have undertaken an effort aimed at producing more accurate and more efficient means of performing poly-lingual text classification, i.e., automatic text classification in which classifying text in one language can also leverage training data expressed in a different language.

ERCIM news (110), pp. 41–41

View

2016, Rapporto tecnico, ENG

Picture it in your mind: generating high level visual representations from textual descriptions

Carrara F.; Esuli A.; Fagni T.; Falchi F.; Moreo Fernandez A.

In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature mapping, while a text-focused loss is aimed at modeling the higher-level semantic concepts expressed in language and countering the overfit on non-relevant visual components of the visual loss. We report preliminary results on the MS-COCO dataset.

View

2016, Contributo in atti di convegno, ENG

Transductive Distributional Correspondence Indexing for cross-domain topic classification

Fernandez A.M.; Esuli A.; Sebastiani F.

Obtaining high-quality annotated data for training a classifier for a new domain is often costly. Domain Adaptation (DA) aims at leveraging the annotated data available from a different but related source domain in order to deploy a classification model for the target domain of interest, thus alleviating the aforementioned costs. To that aim, the learning model is typically given access to a set of unlabelled documents collected from the target domain. These documents might consist of a representative sample of the target distribution, and they could thus be used to infer a general classification model for the domain (inductive inference). Alternatively, these documents could be the entire set of documents to be classified; this happens when there is only one set of documents we are interested in classifying (transductive inference). Many of the DA methods proposed so far have focused on transductive classification by topic, i.e., the task of assigning class labels to a specific set of documents based on the topics they are about. In this work, we report on new experiments we have conducted in transductive classification by topic using Distributional Correspondence Indexing method, a DA method we have recently developed that delivered state-of-the-art results in inductive classification by sentiment. The results we have obtained on three popular datasets show DCI to be competitive with the state of the art also in this scenario, and to be superior to all compared methods in many cases.

7th Italian Information Retrieval Workshop, Venezia, Italy, 30-31 May 2016CEUR workshop proceedings 1653, pp. 8–11

View

2016, Rapporto tecnico, ENG

ProgettISTI 2016

Banterle F.; Barsocchi P.; Candela L.; Carlini E.; Carrara F.; Cassarà P.; Ciancia V.; Cintia P.; Dellepiane M.; Esuli A.; Gabrielli L.; Germanese D.; Girardi M.; Girolami M.; Kavalionak H.; Lonetti F.; Lulli A.; Moreo Fernandez A.; Moroni D.; Nardini F. M.; Monteiro De Lira V. C.; Palumbo F.; Pappalardo L.; Pascali M. A.; Reggianini M.; Righi M.; Rinzivillo S.; Russo D.; Siotto E.; Villa A.

ProgettISTI research project grant is an award for members of the Institute of Information Science and Technologies (ISTI) to provide support for innovative, original and multidisciplinary projects of high quality and potential. The choice of theme and the design of the research are entirely up to the applicants yet (i) the theme must fall under the ISTI research topics, (ii) the proposers of each project must be of diverse laboratories of the Institute and must contribute different expertise to the project idea, and (iii) project proposals should have a duration of 12 months. This report documents the procedure, the proposals and the results of the 2016 edition of the award. In this edition, ten project proposals have been submitted and three of them have been awarded.

View

2016, Articolo in rivista, ENG

Lightweight random indexing for polylingual text classification

Moreo Fernandez A.; Esuli A.; Sebastiani F.

Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.

The journal of artificial intelligence research (Print) 57, pp. 151–185

DOI: 10.1613/jair.5194

View

2016, Contributo in atti di convegno, ENG

Distributional random oversampling for imbalanced text classification

Moreo Fernandez A.; Esuli A.; Sebastiani F.

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

SIGIR 2016 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17-21 July 2016

DOI: 10.1145/2911451.2914722

View

2016, Rapporto tecnico, ENG

JaTeCS, a Java library focused on automatic text categorization

Esuli A.; Fagni T.; Moreo Fernández A.

JaTeCS is an open source Java library focused on automatic text categorization. It covers all the steps of an experimental activity, from reading the corpus to the evaluation of the results. JaTeCS focuses on text as the central input, and its code is optimized for this type of data. As with many other machine learning (ML) frameworks, JaTeCS provides data readers for many formats and well-known corpora, NLP tools, feature selection and weighting methods, the implementation of many ML algorithms as well as wrappers for well-known external software (e.g., libSVM, SVMlight). JaTeCS also provides the implementation of methods related to text classification that are rarely, if never, provided by other ML framework (e.g., active learning, quantification, transfer learning).

View

2016, Software, ENG

Java Text Categorization System

Esuli A.; Fagni T.; Moreo Fernandez A.

JaTeCS is an open source Java library focused on Automatic Text Categorization (ATC). It covers all the steps of an experimental activity, from reading the corpus to the evaluation of the experimental results. JaTeCS focuses on text as the central input, and its code is optimized for this type of data. As with many other machine learning (ML) frameworks, it provides data readers for many formats and well-known corpora, NLP tools, feature selection and weighting methods, the implementation of many ML algorithms as well as wrappers for well-known external software (e.g., libSVM, SVM_light). JaTeCS also provides the implementation of methods related to ATC that are rarely, if never, provided by other ML framework (e.g., active learning, quantification, transfer learning).

View

2016, Articolo in rivista, ENG

Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification

Moreo Fernandez A.; Esuli A.; Sebastiani F.

Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a "target'' domain when the only available training data belongs to a different "source'' domain. In this paper we present the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification. DCI derives term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a pivot, i.e., to a highly predictive term that behaves similarly across domains. Term correspondence is quantified by means of a distributional correspondence function (DCF). We propose a number of efficient DCFs that are motivated by the distributional hypothesis, i.e., the hypothesis according to which terms with similar meaning tend to have similar distributions in text. Experiments show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification. DCI also brings about a significantly reduced computational cost, and requires a smaller amount of human intervention. As a final contribution, we discuss a more challenging formulation of the domain adaptation problem, in which both the cross-domain and cross-lingual dimensions are tackled simultaneously.

The journal of artificial intelligence research (Print) 55, pp. 131–163

DOI: 10.1613/jair.4762

View

2015, Presentazione, ENG

A Comparison of Distributional Semantics Models for Polylingual Text Classification

Esuli A.; Moreo Fernández A.; Sebastiani F.

Polylingual Text Classification (PLTC) is a supervised learning task that consists of assigning class labels to documents belonging to different languages, assuming a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities emerging on the Internet. This task is receiving increased attention in the text classification community also due to the new challenge it poses, i.e., how to effectively leverage polylingual resources in order to infer a multilingual classifier and to improve the performance of a monolingual one. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or not always free to use. In this work we analyse some important methods proposed in the literature that are machine translation-free and dictionary-free, including Random Indexing, a method that, to the best of our knowledge, no-one before had tested on PLTC. We offer an analysis on the basis of space and time efficiency, and propose a particular configuration of the Random Indexing method (that we dub Lightweight Random Indexing), that outperforms (showing also a significantly reduced computational cost) all other compared algorithms.

6th Italian Information Retrieval Workshop, Cagliari, Italy, 25-26/05/2015

View

2015, Contributo in atti di convegno, ENG

Distributional correspondence indexing for cross-language text categorization

Esuli A.; Fernandez A.M.

Cross-Language Text Categorization (CLTC) aims at producing a classifier for a target language when the only available training examples belong to a different source language. Existing CLTC methods are usually affected by high computational costs, require external linguistic resources, or demand a considerable human annotation effort. This paper presents a simple, yet effective, CLTC method based on projecting features from both source and target languages into a common vector space, by using a computationally lightweight distributional correspondence profile with respect to a small set of pivot terms. Experiments on a popular sentiment classification dataset show that our method performs favorably to state-of-the-art methods, requiring a significantly reduced computational cost and minimal human intervention.

ECIR 2015 - Advances in Information Retrieval. 37th European Conference on IR Research, Vienna, Austria, 29 March - 2 April 2015Lecture notes in computer science 9022, pp. 104–109

DOI: 10.1007/978-3-319-16354-3_12

View

RESULTS FROM 1 TO 12 OF 12

Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification

Lightweight random indexing for polylingual text classification

Picture it in your mind: generating high level visual representations from textual descriptions

Transductive Distributional Correspondence Indexing for cross-domain topic classification

ProgettISTI 2016

Lightweight random indexing for polylingual text classification

Distributional random oversampling for imbalanced text classification

JaTeCS, a Java library focused on automatic text categorization

Java Text Categorization System

Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification

A Comparison of Distributional Semantics Models for Polylingual Text Classification

Distributional correspondence indexing for cross-language text categorization

RESULTS FROM 1 TO 12 OF 12