• No se han encontrado resultados

TABLA 4 UNIDADES RELEVANTES PARA EL ANÁLISIS DE LA ESTRUCTURA VERBAL:

We now turn our attention to association rule mining. Association rule mining algorithms provide a means to discover interesting correlations (or implications) between different items in a transactional database. Informally, a transactional database is a collection of records, where each record represents a transaction made by a customer. Each transaction contains a subset of all items known to the system, which is called the global itemset. This will be made more formal shortly.

When a transactional database is stored at a single site, there can be no privacy concern, other than from outside attacks. When the data belongs to two or more sites, there are causes for concern. A motivating example, where privacy preserving algorithms would be needed would involve patient data belonging to two hospitals. It may be unethical or even illegal to distribute the patient data to either site. While obeying this restriction, the hospitals still wish to engage in association rule mining to determine what can be learned from the union of the data. Therefore, the challenge is for both hospitals to perform data mining on the data, without unnecessarily disclosing the individual data elements. A solution to this problem, the two party privacy preserving association rule mining problem, is proposed by this chapter.

Association rules from association rule mining algorithms are determined by the support and confidence of itemsets. An example of an itemset is {computer, keyboard, speakers}. If association rule, based on this itemset, like {computer, keyboard}ñ{speakers} is discovered and revealed to the manager of a supermarket, then the manager can consider grouping these items to increase sales. There exists a popular algorithm for determining association rules efficiently, which is Agrawal et.al.’s Apriori based solution [2]. This recursive algorithm is known to be fast and produce results in a reasonable time, but we cannot simply apply this to a data sensitive scenario like our hospital example. We need to modify the algorithm such that the privacy of the data belonging to both parties is preserved.

Generally speaking, the meaning of privacy in data mining algorithms is to prevent data misuse [24]. In the ideal setting, we assume a trusted third party will perform the data mining algorithm and then broadcast the result. Of course, this is hard to realise in practice. Hence we need tools and techniques to satisfy the privacy requirements, which is comparable to a trusted third party. The first privacy preserving data mining algorithm was introduced by Lindell and Pinkas in 2000 [65]. They present a protocol that produces a decision tree using the ID3 algorithm, proposed by Quinlan [84], whereby the entropy or information gain is computed privately. This is achieved through the use of Yao’s garbled circuit [103] and 1-out-of-2 oblivious transfer [33]. The main contribution is that Yao’s Garbled Circuit can be applied to provide privacy for both parties, since it enables private evaluation of a function.

With private function evaluation, there are other methods of preserving privacy in association rule mining. They include data perturbation [88, 34], and homomorphic encryption [81, 32]. Data perturbation means that random data is added to the actual record, in order to preserve privacy. The randomness alleviates the privacy concerns because the data is concealed, but it also reduces the accuracy of the final result. Homomorphic encryption, on the other hand, allows the data miner to modify the plaintext while encrypted. This provides far greater control and accuracy than that of pure data perturbation methods. Hybrid protocols, which combine both methods, have also been presented [80, 107].

The core component of association rule mining is to compare the count frequency of a particular itemset. This is the same as in the case where data is stored on two separate sites. If the database size of two databases are represented by d1 and d2 and the count frequency of an itemset (such as abc) possessed by both parties arec1 and c2, then the inequality dc11`c`d22 ěs tests whether the itemset is frequent or not, where s is the minimum support threshold. To preserve the privacy of the data, this test must be performed securely. Kantarcioglu et al. suggest that the computation of the form c1`c2

d1`d2 ěs can be converted into the millionaire’s problem [58], which

can be solved by Yao’s garbled circuit [103]. They also expose an inherent problem with the two party association rule mining process, which is that any rule that is supported globally and is not supported by the first party, must be supported by the second party. This inherent problem is beyond the scope of this work. Our goal is to protect both databases from unnecessary disclosure. In this chapter we build on the result by [58], by providing a more reasonable solution using a new result in homomorphic encryption.

There are many encryption schemes that have the homomorphic property [87, 32, 81]. However, up until recently, all known homomorphic encryption schemes are partially homomorphic. In other words, they are only homomorphic under one operation, either addition or multiplication. An encryption scheme that supports both operations would be considered fully homomorphic. The breakthrough work of Gentry, has provided the first secure fully homomorphic encryption scheme [38]. Basically, he created a encryption scheme that could evaluate its own decryption circuit.

Association rule mining algorithms are typically a two stage process. The first stage consists of generating a list of frequent itemsets from the set of all known items. To avoid generating a list of all possible itemsets a threshold value is chosen, which is known as the support. This support value filters out most itemsets that would lead to uninteresting results. The second stage generates association rules from the list of frequent itemsets. The association rules are chosen, based on their support value. The two values, support and confidence, define how well we should trust an association rule generated from this process.

More formally, any combination of items are known as an itemset. That is, an itemset Is “ tI1

Ť

I2...

Ť

Iku where, Ii Ď I. An itemset with length k is known as a k-itemset. The general form of an association rule is X ñ Y, where X Ď I,

Y ĎI and XXY “Φ. The support of X ñY is the probability of a transaction in the database to contain both X and Y. On the other hand, the confidence of

X ñY is the probability of a transaction containing X will also contain Y. If an association rule is of the formABñC, the support and confidence is calculated as the following.

SupportABñC “s“ řsites

i“1 SupportCountABCi

řsites

i“1 DatabaseSizei (6.1)

SupportAB “ řsites

i“1 SupportCountABi

řsites

i“1 DatabaseSizei (6.2)

Conf idenceABñC “c“ SupportSupportABñC

AB (6.3)

The Apriori algorithm is an effective method for determining association rules [2]. It works recursively, starting with finding frequent 1-itemsets L1, which have support greater than the threshold values. From the 1-itemsets, the 2-itemsets L2 are found. This repeats until Lk`1 is empty. Then the set, L1 YL2Y...YLk is the set of globally frequent itemsets, which is represented byLg. UsingLg, one generates all association rules, which have confidence greater thanc. We refer the interested

reader to [51, 97], for a more comprehensive description of the Apriori algorithm and related algorithms.

Documento similar