4.4 MARCO INSTITUCIONAL
4.4.3 Sistema Institucional de Evaluación de los Estudiantes de la IEDAR
Data perturbation, a common data-hiding approach with roots in statistical databases, comprises of techniques that distort data element-wise, methods that project original data to a subspace or a smaller space, that is, reduce dimensions or attributes of data,
data microaggregation, data swapping, data transformation and probability distri- bution. The idea behind perturbing data is to solve data mining problems without access to actual data. Normally, original data distributions are reconstructed from known perturbing distributions to conduct mining. Data perturbation relies on the fact that users are not equally protective of all values in their records. Hence, while
they may not mind giving true values of certain fields, they may agree to divulge some others only as modified values. Most data perturbation techniques have been
applied to sensitive numerical attributes, although there are also examples of per- turbing categorical and Boolean data in the literature.
Element-wise.perturbation.of.numerical.data, also known as random perturbation distorts sensitive attributes by adding or multiplying random noise directly to each value of the sensitive attribute. It is used to distort the most frequently encountered numerical or quantitative data type such as salary, age and account balances. Ad- ditive.perturbation (AP) was first proposed by Agrawal and Srikant (2000). The
technique involves using n independent and identically distributed sensitive random variables Xi , i =1,2,...,n, each with the same distribution as the random variable X, and their n original data values x1, x2, x3, ..., xn. To hide these data values, n inde- pendent samples y1, y2, y3, ..., yn, are drawn from n independent random variables Yi
, i = 1, 2, ..., n, each with the same distribution as the random variable Y which has
mean μ = 0 and standard deviation σ. The owner of the data shares the perturbed
values x1 + y1, x2 + y2, x3 + y3, ..., xn + yn and the cumulative distribution function
Fy for Y with the public. Agrawal and Srikant (2000) prove that it is possible to ac- curately estimate or reconstruct the distribution Fx of the original data X from the perturbed data. They provide this proof by using the reconstructed distributions to
build decision tree classifiers and showing that the accuracy of these classifiers is comparable to the accuracy of classifiers built with the original data.
It may be noted that the exact distribution of X is impossible to reconstruct. In fact, the accuracy with which a data distribution can be estimated depends on the reconstruc- tion algorithm, and one of the criticisms against Agrawal and Srikant’s approach is that they ignore the convergence behavior of their proposed reconstruction algorithm. It is believed that a given reconstruction algorithm may not always converge, and even if it does, there is no guarantee that it provides a reasonable estimate of the original distribution. A reconstruction algorithm that not only converges, but also does so to the maximum likelihood estimate of theoriginal distribution is proposed by Agrawal and Aggarwal (2001). It is known as the Expectation Maximization (EM) algorithm. For very large data sets, the EM algorithm reconstructs the distribution with little or almost no information loss.
Another objection raised against the method suggested by Agrawal and Srikant is that it does not account for the fact that knowing the original distribution can cause
a breach of data privacy. Applying Agrawal and Srikant’s technique to categorical
data, Evfimevski, Srikant, Agrawal, and Gehrke (2002) simply replace each item in
a transaction by a new item using probability p. They show that while it is feasible to recover association rules and preserve privacy of individual transactions using
this approach, the discovered rules can be exploited to find whether an item was
present or absent in the original transaction, which also constitutes a breach of data privacy. They, therefore, propose a technique, which in addition to replacing some items also inserts “false” items into a transaction such that one is as likely to see a “false” itemset as a “true” one.
Kargupta et al. (2003a, 2003b) question the use of additive perturbation and point
out that random additive noise can be filtered out in many cases which is likely to
compromise privacy. Therefore, any noise that is a function of the original values, such as noise that results from multiplication, is more likely to produce better results in terms of privacy protection. Multiplicative.perturbation.(MP) is performed either by multiplying a random number ri with mean = 1 and small variance with each data element xi(Muralidhar, Batrah, & Kirs, 1995) or by first taking a log of
data elements, adding random noise, and then taking the antilog of the noise-added data (Kim & Winkler, 2003).
There are inherent differences in the additive and the multiplicative data perturba- tion approaches. While the results of AP are independent of the original data values, that is, the expected level of perturbation is the same regardless ofwhether the original data value is 10 or 100, MP results in values that are in proportion to the original values, that is, the distortion is less if the original value is 10 and more if it is 100. In general, the higher the variance of the perturbing variable, the higher the distortion and privacy.
While both additive and multiplicative perturbation techniques preserve data distributions, neither one of them preserves distances between data points. This
means that they cannot be used for simple yet efficient and widely used Euclid-
ean distance-based mining algorithms such as k-means clustering1 and k-nearest
neighbor classification2. Distance measures are commonly used forcomputing the
dissimilarity of objects described by variables, that is, objects are clusteredbased on the distance between them. Adding or multiplying random noise to attributes
or variables can remove clusters and neighbors (also defined by distance from the
given object) where they do initially exist. Oliveira and Zaïane (2003a) and Chen and Liu (2005), therefore, discuss the use of random rotation for privacy preserving
clustering and classification. A multiplicative perturbation technique that preserves
distance on expectation and is also ideal for large-scale data mining is discussed in Liu et al. (2006a, 2006b). Element-wise random perturbation also does not fair well for association rule mining of numerical data, because association rules depend
on individual data values and if these data values are perturbed element-wise, as- sociations and correlations between them go hey wire. This is also pointed out by Wilson and Rosen (2003), who empirically prove that perturbation, regardless of
whether it is additive or multiplicative alters the relationships between confidential and non-confidential attributes.
Element-wise.perturbation.of.categorical.data.is used by Evfimevski et al. (2002) to conduct secure mining of association rules and by Du and Zahn (2003) to build
decision-tree classifiers. The technique used is known as Randomized.Response
and is mainly suitable to perturb categorical data. It was first proposed by Warner
(1965)tohide sensitive responses in an interview. The technique allows interviewees
to furnish confidential information only in terms of a probability.
Projection-based.perturbation involves the use of techniques such as Principal Component Analysis (PCA) and random projection to map the original data to a subspace in such a way that properties of the original space remain intact. PCA
is a common technique for finding patterns, that is, highlighting similarities and
differences in data of high dimension. To analyze a high dimensional data set,
PCA simplifies it by reducing it to lower dimensions. Thus, if a data set consists
of N tuples and K dimensions, PCA searches for k-dimensional orthogonal, that is, perpendicular vectors that can best be used to represent the data, where k ≤ K. The orthogonal vectors are obtained by factoring a covariance matrix into eigenvalues
and eigenvectors. The highest eigenvalue is the first principle component of the data
set and the eigenvector associated with it accounts for the maximum variability in the data. Lowest eigenvalues and their corresponding eigenvectors may be ignored without much loss of information. PCA has been used as a data reconstruction technique in (Huang, Du, & Chen, 2005) and random projection is proposed by Liu et al. (2006b) to preserves both the correlations between attributes as well as the Euclidean distances between data vectors by multiplying the original data with a lower dimension random matrix. While PCA is unsuitable for large, high-dimen- sional data due to computational complexity of the order O(K2N) + O(K3), random
projection is not suitable for small data sets because of loss of orthogonality in data of small size. Additionally, the randomness associated with the performance of random projection is not practical in real world.
Data. microaggregation is a widely used technique to hide sensitive microdata by aggregating records into groups and releasing the mean of the group to which the sensitive data belong, rather than the sensitive values themselves. In addition to the techniques mentioned here, all techniques listed under data transformation below are also examples of data microaggregation. Data microaggregation has
been used to secure statistical databases in (Domingo-Ferrer & Mateo-Sanz, 2002;
ficient polynomial algorithm to optimally aggregate and privatize a single attribute,
Domingo-Ferrer and Mateo- Sanz (2002) consider a clustering technique to ag- gregate all attributes, in addition to univariate (single attribute) microaggregation. Multivariate (multiple attributes) microaggregation has also been proposed in the area of data mining by Aggarwal and Yu (2004) and lately by Li and Sarkar (2006).
Aggarwal and Yu (2004) split the original data into clusters of predefined size. The
mean, covariance and correlations of the original data are preserved in these clus- ters and are used to simulate replacement data, which is disseminated for mining purposes. Li and Sarkar (2006) propose a kd-tree based approach for PPDM. This method involves selecting a non-sensitive attribute with the largest variance from all given numeric attributes and using the median of the selected attribute to divide a given data set into two groups. Selecting an attribute with the largest variance to start the splitting process optimizes the process of partitioning. Splitting data at the median of the selected attribute ensures that within each subset, the numeric values of the attribute selected to split data are relatively close to each other. The splitting process continues on the partitioned sets until the leaf nodes contain the values for the sensitive attribute. Sensitive and homogeneous values at each leaf are then replaced by the average of all sensitive values at that leaf. Both of these multivari- ate microaggregation techniques have their limitations. Aggarwal and Yu’s (2004) approach does not guarantee that only records closest in statistical characteristics comprise a group. The kd-tree based approach does not discuss why and how pat- terns are preserved.
Data.swapping was first proposed by Dalenius and Reiss (1982). The idea behind
swapping is to interchange values of specific records in such a way that the under-
lying statistics remains unchanged. This ensures that the data retains its utility for mining purposes even after the sensitive values are masked.
Data.transformation techniques make use of mathematical approaches such as Fou- rier-related transforms and Discrete Wavelet Transform (DWT) to break a signal (or
an original series) down into coefficients that retain most of the characteristics of the
original series and those that do not. The former are called high-energy/low-frequency
coefficients and the latter are known as low-energy/high-frequency coefficients.
Most transformation techniques applied in the area of PPDM exploit this feature to
preserve high-energy coefficients and thus preserve the original data patterns and discard low-energy coefficients, and thereby mask sensitive data values.
Mukherjee et al. (2006) recommend using Discrete Cosine Transform (DCT) to prepare data for Euclidean distance-based mining algorithms such as k-means clustering and k-nearest neighbor classification. The distance preserving property
of the DCT is exploited. Some level of data privacy is also offered by manipulat-
Gangopadhyay (2006) test the performance of the Haar and the Daub-4 wavelet transforms in preserving privacy and maintaining the predictive accuracies of SVMs3,
Naive Bayesian and Adaptive Bayesian networks. Their experiments show that both transforms preserve the privacy for real valued data, in addition to preserving the
classification patterns.
Preserving association patterns and privacy is the focus of yet another wavelet- based PPDM approach presented at the Secure Knowledge Management (SKM) workshop in 2006 (Gangopadhyay & Ahluwalia, 2006). Here, properties of wavelets preserve the underlying patterns of the original data in the privatized data shared for mining. A sort-, transform- and duplicate- operation of the data preprocessing phase also protects the privacy of the original data values either by changing the values or anonymizing them in the transformed data. Our approach of using only the row-orthonormal matrix reduces the row dimension of the data by a factor of two, compared to the approach of Liu et al. (2006a, 2006b), which reduces the attribute dimensions. Our methodology thus maintains the attribute semantics in the priva- tized dataset. The distribution of the privatized values has a mean that is identical to the mean of the distribution of the original values, but a standard deviation that is lower than the standard deviation of the original values due to the aggregation effect of wavelet transforms. The noise-reducing aggregation effect of wavelets is exploited to preserve the patterns. This is also an advantage over the random perturbation techniques discussed above, which distort the relationships between attributes. Apart from these advantages, the wavelet decomposition completes in a single iteration over the data set. It requires little storage for each sequence and linear time4 in the length of the sequence. Wavelet transforms are therefore scalable
in contrast to PCA, which is data-dependent. Any change in the data size affects the covariance between variables and hence the PCA calculations.
Probability.distribution involves replacing the original data either by a sample from the same population and probability distribution as the original data (Liew, Choi, & Liew, 1985) or by its distribution (Lefons, Silvestri, & Tangorra, 1983). Liew et al. (1985) prove that privacy of a single sensitive numerical or categorical attribute can be protected by using the attribute’s probability distribution to simulate new values and by substituting the original attribute values by new values. Lefons et al. (1983) propose an analytical approach for protecting multi-numerical sensitive attributes by substituting the original sensitive database by its probability distribution.