2 ¿A qué llamamos límites?
5. Los límites como cualidad
5.1. Los límites políticos
5.2.3. Nuevos modos de conocimiento
11.4.1 Data Swapping
Data swapping techniques were first devised in the context of secure statistical
11 Privacy-Preserving Data Mining 159
method involves replacing the original database with another, whereby values within sensitive attributes are exchanged between records. This exchange is done in such a way as to preserve the t-order statistics of the original data, where a t-order statistic is a statistic that can be specified and computed from the values of exactly t attributes. The main appeal of this method is that all of the original values are kept in the database, while at the same time re-identification of the records is made more difficult. Unfortunately, most real databases do not have a data swap, and when they do, the task of finding one is a difficult problem (probably intractable) [1]. More recently, data swapping has been proposed in privacy-preserving data mining, where the requirement for preserving t-order statistics has been relaxed [27]. Instead, classification rules are to be preserved. The class values are randomly swapped among the records belonging to a same heterogeneous leaf of a decision tree. This technique appears to preserve most classification rules, even if they are obtained by another classification method [27].
11.4.2 Noise Addition
The basic idea of noise addition (NA) is to add noise to the original numerical attribute values. This noise is typically drawn at random from a probability distribution having zero mean and small standard deviation. Generally, noise is added to the confidential attributes of a microdata file before the data is released. However, adding noise to both confidential and nonconfidential at- tributes can improve the level of privacy by making re-identification of the records more challenging. NA techniques can be used both to protect con- fidential values and the privacy of confidential patterns, such as association rules [28, 29].
It is desirable for any NA method to be unbiased, that is, for there to be no difference between the unperturbed statistic and its perturbed estimate. Early NA techniques were relatively unsophisticated and only protected against bias in estimating the mean of an attribute. Gradually NA techniques have evolved and offered protection against bias in estimating variance and covariance be- tween various attributes [30, 31]. A useful classification of bias types was presented by Mulalidhar et al. [32]. However, noise addition techniques that were originally designed for statistical databases did not take into account the bias requirements specific to data mining applications. In 2002, Wilson and Rosen [33] investigated a classifier built from a data set perturbed with an existing statistical noise addition technique, and found that for a testing data set the classifier suffered from a lack of prediction accuracy. This sug- gests the existence of another type of bias, data mining bias, which is related to the change of patterns discovered/used in KDDM. Patterns of a data set include clusters, classification and association rule sets, and subpopulation
correlations.
One of the techniques specifically developed for privacy-preserving data mining controls the data mining bias by first building a decision tree from
160 L. Brankovic, Md.Z. Islam, H. Giggins
the original data set [34]. This is done in order to learn the existing patterns in the data set. The noise is then added in a controlled way so that the patterns of the data set remain unaffected. Finally, the perturbed data set is released to data miners, who now have full access to individual records and can build their own decision trees. It was experimentally shown that decision trees obtained from a perturbed data set are very similar to decision trees obtained from the original data set. Moreover, the prediction accuracy of the classifiers obtained from the original and perturbed data sets are comparable, thus this technique does not suffer from data mining bias. This technique has been extended to incorporate categorical attributes [35], and can also incorporate existing statistical database perturbation methods, such as GADP or EGADP, in order to preserve statistical parameters along with data mining patterns [34]. Additionally, the perturbed data set can also be used for other data mining tasks, such as clustering. This is possible due the low amount of noise that has been added to the data.
Another NA technique [36] adds a large amount of random noise and sig- nificantly changes the distribution of the original data values. In this technique it is in general no longer possible to precisely estimate the original values of individual records and a reconstruction procedure is used to regenerate the original distribution. A decision tree built on the reconstructed distributions has a very good prediction accuracy, even for higher levels of noise [36]. An- other advantage of this technique is that it is also applicable to distributed data sets, as every party is free to add random noise to their own data set before sharing it with other parties. This technique suffers from information loss in the reconstructed distribution which can be minimized by a reconstruc- tion algorithm called expectation minimization (EM) [37]. The EM algorithm works best for a data set having a large number of records.
Kargupta et al. questioned the usefulness of adding random noise for pre- serving privacy [38, 39]. They proposed a spectral filtering technique that is able to closely estimate the original from the perturbed data set when there exists a correlation between data samples. Consequently, Kargupta et al. explored random multiplicative and colored noise as an alternative to in- dependent white additive noise.
One of the challenging problems in this area is adding noise to categorical attributes. Categorical values lack a natural inherent ordering, which makes it difficult to control the amount of noise added to them. One possible so- lution is to cluster them in order to learn about similarity between different values [40]. Various categorical clustering techniques are available, such as CACTUS, ROCK, COOLCAT, CORE and DETECTIVE [35]. DETECTIVE obtains attribute specific and mutually exclusive clusters of records. Values of an attribute are considered to be similar if they appear in records belonging to the same cluster. A possible way to perturb categorical values is to change them into other similar categories with a given, relatively high probability and change them into dissimilar values with a low probability.
11 Privacy-Preserving Data Mining 161
11.4.3 Aggregation
In aggregation (also known as generalization or global recoding) a group of
k records of a data set is replaced by a representative record. An attribute
value of the representative record is generally the mean of the corresponding attribute values of the original k records. Generalization typically results in some information loss. Prior to generalization, the original records are often clustered (into mutually exclusive groups of k records) in order to reduce information loss. However, in the case of lower information loss, disclosure risk is higher because an intruder can usually make a better estimate of an attribute value. An appropriate balance of information loss and disclosure risk can be obtained by adjusting the cluster size, i.e., the number of records in each cluster [41].
Aggregation also refers to a transformation which makes an attribute value less informative. For example, an exact date of birth may be replaced by the year of birth, and an exact salary may be rounded to the nearest thousand. Excessive application of generalization may make the released data useless, for example, replacing the exact date of birth by the century of birth [42].
11.4.4 Suppression
Suppression deletes (suppresses) sensitive data values prior to the release of the microdata. An important issue in suppression is to minimize information loss by minimizing the number of suppressed attribute values. At the same time, suppression should be resistant to an intruder’s attempt to predict the suppressed attribute value with reasonable accuracy. This can be done by building a classifier from the released microdata, where the attribute in ques- tion is considered to be the class [43]. For some applications, such as medical data, suppression is preferred over noise addition. Suppression has also been used for association and classification rule confusion [44, 45].