Secure multiparty computation (SMC) refers to a computation performed by two or more mutually untrusted parties [10, 11]. Each party owns some private data which it is not willing to share with other parties. However, all the parties are interested in performing a computation on the union of data belonging to individual parties. An example of such a situation may be a taxation office and a social security department which are both interested in mining their joint data but are legally precluded from sharing confidential individual information without the explicit consent of their clients.
The SMC problem was originally introduced in 1982 by Yao [11] and has been extensively studied since then. In addition to privacy-preserving data mining, the SMC problem has applications in other areas, including the secu- rity of statistical databases and private information retrieval (PIR).
In order to illustrate SMC in privacy-preserving data mining, we describe a computation of a secure sum, which is often used to illustrate the concepts of SMC, as well as to show how the system can be subverted if the parties
11 Privacy-Preserving Data Mining 157 collude [12]. In our example there are s parties (sites), where site i owns a valuexi which it is not willing to share with other parties. Suppose that the sum x= s i=1 xi
is in the range [0. . . n]. Then site 1 generates a random numberR1in the range [0. . . n], computes R2 = (R1+x1) modn and sendsR2 to site 2. Note that likeR1,R2is also uniformly distributed over the range [0. . . n], and thus site 2 cannot learn anything aboutx1. Site 2 then computesR3= (R2+x2) modn and sends it to site 3. Finally, site s receives Rs, computes Rs+1 = (Rs+
xs) modn and sends it back to site 1. Site 1 then calculates x = (Rs+1 −
R1) mod n, and sends xto all the parties. Note that each party is assumed to have used their correct valuexi.
If there is no collusion, party ionly learns the total sum x, and can also calculate (x−xi) mod n, that is, the sum of values of all the other parties. However, if two or more parties collude, they can disclose more information. For example, if the two neighbors of partyi(that is, partiesi−1 andi+ 1) collude, they can learnxi= (Ri+1−Ri) modn. The protocol can be extended in such a way that each party divides its value intomshares, and the sum of each share is computed separately. The ordering of parties is different for each share, so that no party has the same neighbors twice. The bigger the number of shares, the more colluding parties are required to subvert the protocol, but the slower the computation. In general, collusion is considered to be a serious threat to SMC.
In the last few years a number of SMC algorithms for various data mining tasks have been proposed [13, 14, 15, 16, 17, 18, 19, 20, 21, 15, 22, 23, 24]. Most of these algorithms make use of similar primitive computations, including se- cure sum, secure set union, secure size of set intersection and secure scalar product. Clifton et al. have initiated building a toolkit of such basic com- putation techniques, in order to facilitate the development of more-complex, application-specific privacy-preserving techniques [12]. For the benefit of the interested reader, we next describe some of these application specific tech- niques, and where applicable we specify which primitive computation tech- nique was used.
Secure multiparty computation for association rule mining has been stud- ied in [13, 14, 15, 16]. The task here is to develop an SMC for finding the global support count of an item set. For data that is vertically partitioned among parties, and boolean attribute values, finding the frequency of an item set is equivalent to computing the secure scalar product [14]. For horizon- tally partitioned data the frequency of an item set reduces to the secure set union [15].
An algorithm for SMC of association rules that prevents a k-compromise is presented in [13], wherek-compromise refers to the disclosure of a statistic
158 L. Brankovic, Md.Z. Islam, H. Giggins
based on fewer thankparticipants (for more details see Chap. 12, Sect. 12.2). However, this algorithm is not resistant to colluding participants.
Another technique for horizontally partitioned datasets [16] relies on the fact that a global frequent item set (GFI) has to be a frequent item set in at least one of the partitions [25]. GFIs are those itemsets having global sup- port count greater than a user-defined threshold. In this technique, maximal frequent itemsets (MFI) of all partitions are locally computed by the parties. The union of all these local MFIs is then computed by a trusted third party. The support counts of all possible subsets of each of the MFIs belonging to the union are computed by the parties locally. Finally, the global summation of the support counts for each itemset is computed using the secure sum com- putation. GFIs can be used for various purposes, including the discovery of association rules and correlations.
Building a decision tree on horizontally partitioned data based on obliv- ious transfer was proposed in [21]. The protocol uses the well-known ID3 algorithm for building decision trees. Each party performs most of the com- putations independently on its own database. This increases the efficiency of the protocol. The results obtained from these independent computations are then combined using an efficient cryptographic protocol based on oblivious transfer and specifically tailored towards ID3.
A secure multiparty computation function for naive Bayesian classifier on horizontally partitioned data that relies on the secure sum was proposed in [15]. The same paper also provides an extension based on a secure algorithm for evaluating a logarithm [21] to enhance the security.
Secure protocols for classification on vertically partitioned data relying on secure scalar product were proposed in [17, 18]. The protocol proposed in [17] builds a classifier, but does not disclose it to any of the parties, due to legal and/or commercial issues. Rather, the parties collaborate to classify an instance. However, the classifier can be reverse-engineered from knowledge of a sufficient number of classified instances.
A solution for building a decision tree on vertically partitioned data was proposed in [19]. This method is based on a secure scalar product, and uses a semi-trusted third-party commodity server in order to increase performance. A secure multiparty computation of clusters on vertically partitioned data was studied in [22]. Regression on vertically partitioned data was considered in [23, 20], while secure computing of outliers for horizontally and vertically partitioned data was studied in [24].