• No se han encontrado resultados

7. Análisis

7.1 Hablemos de género

7.1.1 Respecto a lo femenino

It is now possible to specify the Probabilistic Frequent Itemset Mining (PFIM)problem

as follows. Given an uncertain transaction databaseT, a minimum support scalar minSup

and a frequentness probability thresholdτ, the objective is to find all probabilistic frequent itemsets.

15.1.4

Contributions and Outline

This chapter makes the following contributions:

• A probabilistic framework will be proposed for frequent itemset mining in databases containing uncertain transactions, based on the possible worlds model (cf. Defini- tion 9.1 in Chapter 9).

• A dynamic computation method will be presented for computing the probability that an itemset is frequent, as well as the entire SPDF of the support of an itemset, in O(N) time, assuming that minSup is a constant. Without this technique, it would run in exponential time in the number of transactions. Using the approach that will be proposed in this chapter, the algorithm has the same time complexity as methods based on the expected support [79, 80, 150]. However, the proposed approach yields much more effectiveness, since it provides confidences for frequent itemsets.

• An algorithm will be proposed to mine all itemsets that are frequent with a prob- ability of at least τ. Furthermore, an additional algorithm will be proposed that incrementally outputs the uncertain itemsets in the order of their frequentness prob- ability. This ensures that itemsets with the highest probability of being frequent are output first. This has two additional advantages. First, it makes the approach free of the parameter τ. Secondly, it solves the top-k itemsets problem in uncertain databases.

The remainder of this chapter is organized as follows. Section 15.2 will present the proposed probabilistic support framework. Section 15.3 will show how to compute the frequentness probability inO(N) time. Section 15.4 will present a probabilistic frequent itemset mining algorithm. Section 15.5 will present the proposed incremental algorithm. The experiments will be presented in Section 15.6. Finally, Section 15.7 will conclude this chapter.

15.2

Probabilistic Frequent Itemsets

15.2.1

Expected Support

Previous work addressing the problem of frequent itemset mining in uncertain databases was based on the expected support [79, 80, 150], which is defined as follows.

Definition 15.4 (Expected Support) Given an uncertain transaction database T, the

expected support E(X) of an itemset X is defined as E(X)=P

ID Transaction t1 (A, 0.8); (B, 0.2); (D, 0.5); (F, 1.0) t2 (B, 0.1); (C, 0.7); (D, 1.0); (E, 1.0), (G, 0.1) t3 (A, 0.5); (D, 0.2); (F, 0.5); (G, 1.0) t4 (D, 0.8); (E, 0.2); (G, 0.9) t5 (C, 1.0); (D, 0.5); (F, 0.8); (G, 1.0) t6 (A, 1.0); (B, 0.2); (C, 0.1)

Table 15.4: Example of a larger uncertain transaction database.

Considering an itemset frequent if its expected support is above minSup has a major drawback. Uncertain transaction databases naturally involve uncertainty concerning the support of an itemset. Considering this is important when evaluating whether an itemset is frequent or not. However, this information is forfeited when using the expected support approach. In the example shown in Table 15.4, the expected support of the itemset {D}is E({D}) = 3.0. The fact that {D} occurs for certain in one transaction, namely int2, and that there is at least one possible world where {D} occurs in five transactions are totally ignored when using the expected support in order to evaluate the frequency of an itemset. Indeed, supposeminSup = 3; is it appropriate to call{D}frequent? And if so, how certain can we even be that{D} is frequent? By comparison, consider itemset{G}. This also has an expected support of 3.0, but its presence or absence in the transactions is more certain. It turns out that the probability that{D}is frequent is 0.7 (cf. Subsection 15.2.3), and the probability that {G} is frequent is 0.91. While both have the same expected support, we can be quite confident that{G}is frequent, in contrast to{D}. An expected-support-based technique does not differentiate between the two.

The confidence with which an itemset is frequent is very important for interpreting uncertain itemsets. Therefore, concepts are required that allow to evaluate the uncertain data in a probabilistic way. This section formally introduces the concept of probabilistic frequent itemsets.

15.2.2

Probabilistic Support

In uncertain transaction databases, the support of an item or itemset cannot be represented by a unique value, but must rather be represented by a discrete SPDF.

Definition 15.5 (Support Probability) Given an uncertain transaction databaseT and the set W of possible worlds (instantiations) of T, the support probability Pi(X) of an

itemset X is the probability that X has the support i. Formally,

Pi(X) =

X

W∈W

P(W)·IS(X,W)=i,

where S(X, W)is the support of itemsetX in world W and Iz is an indicator variable that is 1 if z =true and 0 otherwise.

15.2 Probabilistic Frequent Itemsets 179 Ϭ ϯϱ Ϭ͕ϰ Ϭ͕ϰϱ Wŝ;΂΃Ϳ Ϭ͕Ϯ Ϭ͕Ϯϱ Ϭ͕ϯ Ϭ͕ϯϱ Ϭ Ϭ͕Ϭϱ Ϭ͕ϭ Ϭ͕ϭϱ ƐƵƉƉŽƌƚŝ Ϭ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ (a) SPDF of{D}. WtŵŝŶ^ƵƉ;΂΃Ϳ ϭ͕Ϯ Ϭ ϲ Ϭ͕ϴ ϭ Ϭ͕Ϯ Ϭ͕ϰ Ϭ͕ϲ ŵŝŶŝŵƵŵƐƵƉƉŽƌƚ;ŵŝŶ^ƵƉͿ Ϭ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ (b) Frequentness probabilities of{D}.

Figure 15.1: Probabilistic support of itemset{D} in the uncertain database of Table 15.4.

Intuitively,Pi(X) denotes the probability that the support of X is exactlyi. The support probabilities associated with an itemset X for different support values form the SPDF of the support of X.

Definition 15.6 (Support Probability Distribution Function (SPDF)) Theprob- abilistic support of an itemset X in an uncertain transaction database T is defined by the support probabilities of X (Pi(X)) for all possible support values i ∈ {0, . . . , N}. This

probability distribution is called Support Probability Distribution Fuction (SPDF). The following statement holds: P

0≤i≤NPi(X) = 1.0.

Returning to the example of Table 15.4, Figure 15.1(a) shows the SPDF of itemset {D}. The number of possible worlds |W| that need to be considered for the computation of Pi(X) is extremely large. In fact, there are O(2N· |I|) possible worlds, where |I| denotes the total number of items. The following Lemma shows how to compute Pi(X) without materializing all possible worlds.

Lemma 15.1 For an uncertain transaction database T with mutually independent trans- actions and any 0≤i≤N, the support probability Pi(X) can be computed by

Pi(X) = X T0T,|T0|=i Y t∈T0 P(X ⊆t)· Y t∈T −T0 (1−P(X ⊆t)) ! , (15.1)

where the transaction subset T0 ⊆ T contains exactly i transactions.

Proof. The transaction subset T0 ⊆ T contains i transactions. The probability of a

world W where all transactions in T0 contain X and the remaining |T − T0| transactions

do not contain X is P(W) = Q

t∈T0P(X ⊆ t)·

Q

t∈T −T0(1−P(X ⊆t)). The sum of the probabilities according to all possible worlds satisfying the above conditions corresponds to

15.2.3

Frequentness Probability

The definition of the probabilistic support now allows to tackle the actual problem defini- tion to compute the probability that an itemset is frequent, i.e., the probability that an itemset occurs in at least minSup transactions.

Definition 15.7 (Frequentness Probability) Let T be an uncertain transaction data- base and X be an itemset. P≥i(X) denotes the probability that the support of X is at least i, i.e., P≥i(X) =

PN

k=iPk(X). For a given minimum support minSup ∈ {0, . . . , N},

the probability P≥minSup(X), which is called the frequentness probability of X, denotes the probability that the support of X is at least minSup.

Figure 15.1(b) shows the frequentness probabilities of {D} for all possible minSup values in the database of Table 15.4. For example, the probability that {D} is frequent when

minSup = 3 is approximately 0.7, while its frequentness probability when minSup = 4 is approximately 0.3.

The intuition behindP≥minSup(X) is to have a confidence to rate an itemset as frequent.

With this policy, the frequentness of an itemset becomes subjective and the decision about which candidates shall be reported to the user depends on the application. Hence, the minimum frequentness probabilityτ is used as a user-defined parameter. Some applications may need a lowτ, while in other applications only highly confident results shall be reported (high τ).

In the possible worlds model, it is known that P≥i(X) =

P

W∈W,S(X,W)≥iP(W). This can be computed according to Equation (15.1) by

P≥i(X) = X T0⊆T,|T0|≥i Y t∈T0 P(X ⊆t)· Y t∈T −T0 (1−P(X ⊆t)) ! . (15.2)

Hence, the frequentness probability can be computed by enumerating all possible worlds satisfying the minSup condition through the direct application of Equation (15.2). How- ever, this na¨ıve approach is very inefficient. It is possible to speed this up significantly. Typically minSup N and the number of worlds with support i is at most

N i

. Hence, enumeration of all worlds W in which the support of X is greater than minSup

is much more expensive than enumerating those where the support is less than minSup. Using the following easily verified Corollary, the frequentness probability can be computed exponentially in minSup N. Corollary 15.1 P≥i(X) = 1− X T0⊆T,|T0|<i Y t∈T0 P(X ⊆t)· Y t∈T −T0 (1−P(X ⊆t)) !

Documento similar