2015, Contributo in volume, ENG
Grossi V.; Monreale A.; Nanni M.; Pedreschi D.; Turini F.
The problem of clustering a set of data is a textbook machine learning problem, but at the same time, at heart, a typical optimization problem. Given an objective function, such as minimizing the intra-cluster distances or maximizing the inter-cluster distances, the task is to find an assignment of data points to clusters that achieves this objective. In this paper, we present a constraint programming model for a centroid based clustering and one for a density based clustering. In particular, as a key contribution, we show how the expressivity introduced by the formulation of the problem by constraint programming makes the standard problem easy to be extended with other constraints that permit to generate interesting variants of the problem. We show this important aspect in two different ways: first, we show how the formulation of the density-based clustering by constraint programming makes it very similar to the label propagation problem and then, we propose a variant of the standard label propagation approach.
2013, Contributo in volume, ENG
Pedreschi D.; Ruggieri S.; Turini F.
Discrimination discovery from data consists in the extraction of discriminatory situations and practices hidden in a large amount of historical decision records.We discuss the challenging problems in discrimination discovery, and present, in a unified form, a framework based on classification rules extraction and filtering on the basis of legally-grounded interestingness measures. The framework is implemented in the publicly available DCUBE tool. As a running example, we use a public dataset on credit scoring.
2013, Articolo in rivista, ENG
Romei A.; Ruggieri S.; Turini F.
Discovering contexts of unfair decisions in a dataset of historical decision records is a non-trivial problem. It requires the design of ad hoc methods and techniques of analysis, which have to comply with existing laws and with legal argumentations. While some data mining techniques have been adapted to the purpose, the state-of-the-art of research still needs both methodological refinements, the consolidation of a Knowledge Discovery in Databases (KDD) process, and, most of all, experimentation with real data. This paper contributes by presenting a case study on gender discrimination in a dataset of scientific research proposals, and by distilling from the case study a general discrimination discovery process. Gender bias in scientific research is a challenging problem, that has been tackled in the social sciences literature by means of statistical regression. However, this approach is limited to test an hypothesis of discrimination over the whole dataset under analysis. Our methodology couples data mining, for unveiling previously unknown contexts of possible discrimination, with statistical regression, for testing the significance of such contexts, thus obtaining the best of the two worlds. (C) 2013 Elsevier Ltd. All rights reserved.
2012, Articolo in rivista, ENG
Furletti B.; Turini F.
Ontologies allow us to represent knowledge and data in implicit and explicit ways. Implicit knowledge can be derived by means of several deductive logic-based processes. This paper introduces a new way for extracting implicit knowledge from ontologies by means of a sort of link analysis of the T-box of the ontology integrated with a data mining step on the A-box. The implicit extracted knowledge has the form of In uence Rules" i.e. rules structured as: if the property p1 of concept c1 has value v1, then the property p2 of concept c2 has value v2 with probability . The technique is completely general and applicable to whatever domain. The In uence Rules can be used to integrate existing knowledge or for supporting any other data mining process. A case study about an ontology describing intrusion detection is used to illustrate the result of the method.
2012, Articolo in rivista, ENG
Bellandi A.; Turini F.
Probabilistic reasoning is an essential feature when dealing with many application domains. Starting with the idea that ontologies are the right way to formalize domain knowledge and that Bayesian networks are the right tool for probabilistic reasoning, we propose an approach for extracting a Bayesian network from a populated ontology and for reasoning over it. The paper presents the theory behind the approach, its design and examples of its use
2012, Articolo in rivista, ENG
Grossi V., Turini F.
Mining data streams has become an important and challenging task for a wide range of applications. In these scenarios, data tend to arrive in multiple, rapid and time-varying streams, thus constraining data mining algorithms to look at data only once. Maintaining an accurate model, e.g. a classifier, while the stream goes by requires a smart way of keeping track of the data already passed away. Such a synthetic structure has to serve two purposes: distilling the most of information out of past data and allowing a fast reaction to concept drifting, i.e. to the change of the data trend that necessarily affects the model. The paper outlines novel data structures and algorithms to tackle the above problem, when the model mined out of the data is a classifier. The introduced model and the overall ensemble architecture are presented in details, even considering how the approach can be extended for treating numerical attributes. A large part of the paper discusses the experiments and the comparisons with several existing systems. The comparisons show that the performance of our system in general, and in particular with respect to the reaction to concept drifting, is at the top level
2011, Contributo in atti di convegno, ENG
Luong, Binh Thanh; Ruggieri, Salvatore; Turini, Franco
With the support of the legally-grounded methodology of situation testing, we tackle the problems of discrimination discovery and prevention from a dataset of historical decisions by adopting a variant of k-NN classifi cation. A tuple is labeled as discriminated if we can observe a signi ficant di erence of treatment among its neighbors belonging to a protected-by-law group and its neighbors not belonging to it. Discrimination discovery boils down to extracting a classi fication model from the labeled tuples. Discrimination prevention is tackled by changing the decision value for tuples labeled as discriminated before training a classi fier. The approach of this paper overcomes legal weaknesses and technical limitations of existing proposals.
2011, Contributo in atti di convegno, ENG
Furletti B., Turini F.
A method for extracting new implicit knowledge starting from the ontology schema by using an inductive/ deductive approach is presented. By giving a new interpretation to relationships that already exist in an ontology, we are able to return the extracted knowledge as weighted If-Then Rules among concepts. The technique, that combines data mining and link analysis, is completely general and applicable to whatever domain. Since the output is a set of "standard" If-Then Rules, it can be used to integrate existing knowledge or for supporting any other data mining process. An application of the method to an ontology representing companies and their activities is included.
2010, Articolo in rivista, ENG
Turini F.; Baglioni M.; Bellandi A.; Furletti B.; Pratesi C.
One of the main objectives of the European MUSING project is to design and test software tools in order to support the activities of small and medium sized businesses. In this paper we examine financial risk management and, more specifically, the self-assessment of business plans. The role of intangible assets is discussed, and we report on how intangible assets can be collected, how they can be represented, taking into account their semantic relationships, and how they can be used to build an analytical tool for business plans. The basic technology embedded in the tool is the construction of classification trees, a well-known technique in inductive learning. We show how using knowledge of intangible assets can improve the construction of the classifier, as proved by the testing carried out so far.
2010, Articolo in rivista, ENG
Pedreschi D.; Ruggieri S.; Turini F.
In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Discrimination in credit, mortgage, insurance, labor market, and education has been investigated by researchers in economics and human sciences. With the advent of automatic decision support systems, such as credit scoring systems, the ease of data collection opens several challenges to data analysts for the fight against discrimination. In this article, we introduce the problem of discovering discrimination through data mining in a dataset of historical decision records, taken by humans or by automatic systems. We formalize the processes of direct and indirect discrimination discovery by modelling protected-by-law groups and contexts where discrimination occurs in a classification rule based syntax. Basically, classification rules extracted from the dataset allow for unveiling contexts of unlawful discrimination, where the degree of burden over protected-by-law groups is formalized by an extension of the lift measure of a classification rule. In direct discrimination, the extracted rules can be directly mined in search of discriminatory contexts. In indirect discrimination, the mining process needs some background knowledge as a further input, for example, census data, that combined with the extracted rules might allow for unveiling contexts of discriminatory decisions. A strategy adopted for combining extracted classification rules with background knowledge is called an inference model. In this article, we propose two inference models and provide automatic procedures for their implementation. An empirical assessment of our results is provided on the German credit dataset and on the PKDD Discovery Challenge 1999 financial dataset.
2010, Contributo in atti di convegno, ENG
Pedreschi D.; Turini F.; Ruggieri S.
Discrimination discovery in databases consists in finding unfair practices against minorities which are hidden in a dataset of historical decisions. The DCUBE system implements the approach of [5], which is based on classification rule extraction and analysis, by centering the analysis phase around an Oracle database. The proposed demonstration guides the audience through the legal issues about discrimination hidden in data, and through several legally-grounded analyses to unveil discriminatory situations. The SIGMOD attendees will freely pose complex discrimination analysis queries over the database of extracted classification rules, once they are presented with the database relational schema, a few ad-hoc functions and procedures, and several snippets of SQL queries for discrimination discovery.
2010, Articolo in rivista, ENG
Pedreschi D.; Turini F.; Ruggieri S.
We present a reference model for finding (prima facie) evidence of discrimination in datasets of historical decision records in socially sensitive tasks, including access to credit, mortgage, insurance, labor market and other benefits. We formalize the process of direct and indirect discrimination discovery in a rule-based framework, by modelling protected-by-law groups, such as minorities or disadvantaged segments, and contexts where discrimination occurs. Classification rules, extracted from the historical records, allow for unveiling contexts of unlawful discrimination, where the degree of burden over protected-by-law groups is evaluated by formalizing existing norms and regulations in terms of quantitative measures. The measures are defined as functions of the contingency table of a classification rule, and their statistical significance is assessed, relying on a large body of statistical inference methods for proportions. Key legal concepts and reasonings are then used to drive the analysis on the set of classification rules, with the aim of discovering patterns of discrimination, either direct or indirect. Analyses of affirmative action, favoritism and argumentation against discrimination allegations are also modelled in the proposed framework. Finally, we present an implementation, called LP2DD, of the overall reference model that integrates induction, through data mining classification rule extraction, and deduction, through a computational logic implementation of the analytical tools. The LP2DD system is put at work on the analysis of a dataset of credit decision records.
2008, Contributo in atti di convegno, ENG
Pedreschi D.; Ruggieri S.; Turini F.
In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Rules extracted from databases by data mining techniques, such as classification or association rules, when used for decision tasks such as benefit or credit approval, can be discriminatory in the above sense. In this paper, the notion of discriminatory classification rules is introduced and studied. Providing a guarantee of non-discrimination is shown to be a non trivial task. A naive approach, like taking away all discriminatory attributes, is shown to be not enough when other background knowledge is available. Our approach leads to a precise formulation of the redlining problem along with a formal result relating discriminatory rules with apparently safe ones by means of background knowledge. An empirical assessment of the results on the German credit dataset is also provided
2007, Articolo in rivista, ENG
Rinzivillo S.; Turini F.
We propose a general mechanism to represent the spatial transactions in a way that allows the use of the existing data mining methods. Our proposal allows the analyst to exploit the layered structure of geographical information systems in order to define the layers of interest and the relevant spatial relations among them. Given a reference object, it is possible to describe its neighborhood by considering the attribute of the object itself and the objects related by the chosen relations. The resulting spatial transactions may be either considered like "traditional" transactions, by considering only the qualitative spatial relations, or their spatial extension can be exploited during the data mining process. We explore both these cases. First we tackle the problem of classifying a spatial dataset, by taking into account the spatial component of the data to compute the statistical measure (i.e., the entropy) necessary to learn the model. Then, we consider the task of extracting spatial association rules, by focusing on the qualitative representation of the spatial relations. The feasibility of the process has been tested by implementing the proposed method on top of a GIS tool and by analyzing real world data. © Springer Science+Business Media, LLC 2007.
2005, Contributo in atti di convegno, ENG
Miriam Baglioni; Barbara Furletti; Franco Turini
Classification is one of the most useful techniques for extracting meaningful knowledge from databases. Classifiers, e.g. decision trees, are usually extracted from a table of records, each of which represents an example. However, quite often in real applications there is other knowledge, e.g. owned by experts of the field, that can be usefully used in conjunction with the one hidden inside the examples. As a concrete example of this kind of knowledge we consider causal dependencies among the attributes of the data records. In this paper we discuss how to use such a knowledge to improve the construction of classifiers. The causal dependencies are represented via Bayesian Causal Maps (BCMs), and our method is implemented as an adaptation of the well known C4.5 algorithm. Copyright 2005 ACM.
2005, Articolo in rivista, ENG
Andrea Bracciali; Antonio Brogi; Franco Turini
Coding no longer represents the main issue in developing software applications. It is the design and verification of complex software systems that require to be addressed at the architectural level, following methodologies which permit us to clearly identify and design the components of a system, to understand precisely their interactions, and to formally verify the properties of the systems. Moreover, this process is made even more complicated by the advent of the "network-centric" model of computation, where open systems dynamically interact with each other in a highly volatile environment. Many of the techniques traditionally used for closed systems are inadequate in this context. We illustrate how the problem of modeling and verifying behavioural properties of open system is addressed by different research fields and how their results may contribute to a common solution. Building on this, we propose a methodology for modeling and verifying behavioural aspects of open systems. We introduce the IP-calculus, derived from the ?-calculas process algebra so as to describe behavioural features of open systems. We define a notion of partial correctness, acceptability, in order to deal with the intrinsic indeterminacy of open systems, and we provide an algorithmic procedure for its effective verification. © 2004 Elsevier Inc. All rights reserved.