• No se han encontrado resultados

La nueva forma del paramilitarismo llamada Las Bacrim

Capítulo 3: Manifestaciones de la violencia en las últimas décadas (2005 2010)

3.4 La nueva forma del paramilitarismo llamada Las Bacrim

relationships.

Parts II and III of the thesis are linked in two ways. First, GIM and GRM can be applied to solve parts of the problems in partIII; GIM can be applied to mine complex correlation structures and GRM can be used to mine rules based on signicance. Secondly, when correlation is incorporated into the vectorised frameworks of GIM and GRM, it has a geometric interpretation as the angle between interaction vectors. This leads to an intuitive method for predictive rule mining where the search causes the antecedent interaction vector to move closer to the vector for the variable to be predicted. This is used in the methods of chapters6and7. PartIIIis linked to part

IV through the development of signicant frequent itemset mining in chapter11.

1.1.3 Probabilistic Frequent Itemset Mining in Uncertain Databases

Association analysis is one of the most important elds in data mining. It is tra- ditionally applied to market-basket databases for analysis of consumer purchasing behaviour, but is much more widely applicable. Such databases consist of a set of `transactions', each containing the `items' a customer `purchased'. The database can be analyzed to discover frequent patterns and associations among dierent sets of items. The most important step in the mining process is the extraction of frequent itemsets sets of items that occur in at least minSup transactions. It is generally assumed that the items occurring in a transaction are known for certain, but this is not always the case. In many applications the data is inherently noisy, such as data collected by sensors or in satellite images. In privacy protection applications, articial noise can be added deliberately in order to prevent reverse engineering of the data through pattern analysis. Data sets may also be aggregated: For exam- ple, by aggregating transactions by customer, it is possible to mine patterns across customers instead of transactions. The resulting probabilistic database shows the estimated purchase probabilities per item per customer rather than certain items per transaction.

In such applications, the information captured in transactions is uncertain since the existence of an item is associated with a probability. Given an uncertain or prob- abilistic transaction database, it is not obvious how to identify whether an itemset is frequent because we usually cannot say for certain whether an itemset actually appears in a transaction. This makes the problem challenging.

Prior to the work in this thesis, the expected support was used to solve this problem; an itemset was considered interesting if the expectation of it's support was above

8 1.1. RESEARCH PROBLEMS AND THESIS OVERVIEW minSup. This approach returns an estimate of whether an object is frequent or not with no indication of how good this estimate is. Since it ignores the probability distribution of support, it can lead to itemsets being labeled frequent even if the probability that they are frequent is less than the probability that they are not frequent. Clearly, this is a problem.

This thesis tackles the problem from a new direction: itemsets are considered inter- esting if the probability that they are frequent is above a user specied thresholdτ. This is known as the frequentness probability. Accordingly, a Probabilistic Frequent Itemset (PFI) is dened as an itemset with a frequentness probability of at leastτ. This creates two main problems:

1. Given the existential probabilities of an itemset in all transactions, how can one eciently calculate the probability distribution of the support and hence the frequentness probability of the given itemset?

2. How can one mine all itemsets that satisfy the frequentness probability con- straints eciently? This is called the Probabilistic Frequent Itemset Mining (PFIM) problem. A PFIM algorithm has three main tasks: Eciently search- ing through the space of uncertain itemsets; eciently calculating the required probabilities for 1 for each itemset that must be examined; and then using 1 to determine whether an itemset is interesting.

These problems are considered in partIV of this thesis:

Chapter 10 introduces and motivates the PFIM problem as an important research direction. It also solves both parts of the problem: Ecient calculation of the fre- quentness probability is achieved by employing the Poisson binomial recurrence re- lation and using a divide and conquer scheme in a possible worlds model. Mining the PFIs is achieved by developing ProApriori; an algorithm based on the Apriori method with candidate generation and testing. An incremental algorithm solving the topkPFI problem is also presented.

Chapter12 improves on this by developing a probabilistic pattern growth approach inspired by the FP-Growth [47] method. Here, a compact data structure called the probabilistic frequent pattern tree (ProFP-tree) compresses probabilistic databases and allows the ecient extraction of the existence probabilities required for part 1 of the problem. The ProFP-Growth algorithm is subsequently proposed for mining all PFIs without candidate generation and solves the PFIM problem an order of magnitude faster than ProApriori. Part 1 of the problem is solved in a more intuitive manner by employing generating functions.

CHAPTER 1. INTRODUCTION 9