• No se han encontrado resultados

2 ¿A qué llamamos límites?

5. Los límites como cualidad

5.1. Los límites políticos

5.2.1. Un modelo agotado

In this section we propose a classification scheme and an evaluation criteria for existing privacy protection techniques, which build upon but do not strictly follow the classification and evaluation criteria proposed in [9].

For the purpose of our study, we assume that a data set is a two- dimensional table, where each row (record) corresponds to an individual (case) and each column (attribute) corresponds to a property that describes individ- uals. Some attributes are confidential and the others are assumed to be public knowledge and thus possibly known to an intruder. Certain attributes might uniquely, or almost uniquely, identify a record, for example a name or a social security number. We assume that such attributes are removed from the data set. However, this is usually insufficient to protect privacy of individuals, as a combination of nonconfidential attributes may also identify an individual.

The purpose of privacy-preserving data mining is to make the data set available for data mining tasks, such as classification, clustering and as- sociation rule mining, while at the same time prevent an intruder from re-identifying individual records and learning the values of confidential at- tributes. Moreover, very often the patterns that exist in the data set, for ex- ample association rules, are themselves considered sensitive and thus should also be protected from disclosure.

The data sets used for data mining can be either centralized or distributed; this does not refer to the physical location where data is stored, but rather to the availability/ownership of data. Centralized data is owned by a single party, and it is either available at the computation site or can be sent to it. Distributed data is shared between two or more parties, which do not necessarily trust each other and/or do not wish to share their private data. The data set can further be heterogeneous, or vertically partitioned, where each party owns the same set of records (rows in the two-dimensional table introduced above) but different subset of attributes (columns). Alternatively, the data set can be homogeneous, or horizontally partitioned, where each party owns the same set of attributes but different subsets of records.

There are two classes of privacy-preserving techniques that can be applied in the context of data mining. The first class of techniques encrypts the data

11 Privacy-Preserving Data Mining 155

set, while still allowing data mining tasks. Such techniques are typically based on cryptographic protocols and are applied to distributed data sets where data mining tasks are to be performed on the union of all data sets, but the data owners are not prepared to share their own data sets with each other or any third party. These techniques are commonly referred to as secure multiparty

computation (SMC).

The second class of privacy-preserving techniques modifies the data set be- fore releasing it to users. The data can be modified in such a way as to protect either the privacy of individual records or the privacy of sensitive underlying patterns, or both. Data modification techniques include data swapping, noise addition, aggregation and suppression. Data swapping interchanges the at- tribute values among different records. Noise addition perturbs the attribute values by adding noise. Note that in statistical database terminology, data swapping is often seen as a special case of noise addition [1]. Aggregation refers to both combining a few attribute values into one, and grouping a few records together and replacing them with a group representative. Finally, sup-

pression means replacing an attribute value by a symbol denoting a missing

value. Note that missing values naturally occur in data sets when values are either unapplicable or unknown.

In order to evaluate a privacy-preserving technique, the following proper- ties should be considered.

1. Versatility refers to the ability of the technique to cater for various privacy requirements, types of data sets and data mining tasks. Versatility includes the following.

• Private: data versus patterns

Does the technique protect the privacy of data, underlying patterns, or both?

• Dataset: centralized or distributed (vertical or horizontal)

Is the technique suitable for centralized or distributed data sets, or both? If distributed, is it suitable for vertically or horizontally parti- tioned data?

• Attributes: numerical or categorical (boolean)

Is the technique suitable for numerical or categorical attributes (or Boolean, as a special case of categorical)?

• Data mining task

For which data mining tasks is the technique suitable? For example, is it suitable for classification by decision trees, clustering or mining association rules?

2. Disclosure risk

Disclosure risk refers to the likelihood of sensitive information being in- ferred by a malicious data miner. It is inversely proportional to the level of security offered by the technique. Evaluating a disclosure risk is a very challenging task and is highly dependent on the nature of the technique. For example, in a technique that protects sensitive association rules, a dis-

156 L. Brankovic, Md.Z. Islam, H. Giggins

closure risk may be measured by the percentage of the sensitive rules that can still be disclosed; in a technique that adds noise to protect individual records, a disclosure risk might be measured by the re-identification risk, a measure used in security of statistical databases (see Chap. 12, Sect. 12.5).

3. Information loss

Modification of data by a privacy-preserving technique can lead to a loss of information. Information loss is highly dependent on the data mining task for which the data set is intended. For example, in mining asso- ciation (classification) rules, information loss could be measured by the percentage of rules that have been destroyed/created by the technique, and/or by the reduction/increase in the support and confidence of all the rules; for clustering, information loss can be evaluated by the variance of the distances among the clustered items in the original database and the sanitized database [9].

4. Cost

Cost refers to both the computational cost and the communication cost between the collaborating parties [9]. The higher the cost, the lower the efficiency of the technique. Computational cost encompasses both pre- processing cost (e.g., initial perturbation of the values) and running cost (e.g., processing overheads). Communication costs become relevant when a privacy-preserving technique is applied to a distributed data set. In Sect. 11.5 we illustrate these criteria by presenting a comparative study of a few privacy-preserving techniques. The techniques have been carefully selected to exemplify a broad range of methods.