Capítulo 4. Análisis y resultados
4.1 Comparación del sismo Tornillo con la señal generada del modelo de Helmhontz
This section describes how association rules can be generated very eciently. Algo- rithm4.3 presents the method, and the following lemma describes it and proves its
CHAPTER 4. GEOMETRICALLY INSPIRED ITEMSET MINING IN THE
TRANSPOSE 89
Algorithm 4.1 Data types, initialisation, main loop and auxiliary methods. The primary processing is done in algorithm 4.2.
Input:
(1) A data set (in inputF ile) in transpose format (may have g(·) already applied)
(2)f(·),◦,F(·) and minM easure.
Output: CompletedP ref ixT ree (pref ixT ree) and SequenceM ap(map) contain- ing all F-itemsets.
Data Types:
P air : (Itemvector yi, Item item)
//yi is the item-vector foritemand corresponds to yi in fact1. //They are reused through buf f er:
State: (P ref ixN ode node, Itemvector yI0, Iterator itemvectors, boolean top, P air newP air, List buf f er)
//yI0 is the item-vector corresponding to node(andyI0 in fact1).
//buf f er is used to create the Iterators (such as itemvectors) for the States //created to hold the children of node. buf f er is needed to make use of fact 3. // itemvectorsprovides the yi to join withyI0 andnewP air helps in doing this.
Initialisation:
initialisepref ixT ree with its root. Initialisemap and f rontier as empty. //Create initial state:
Iterator itemvectors=new AnnotatedItemvetorIterator(inputF ile);
//Iterator is over P air objects and reads input one row at a time and annotates //the item-vector with the item it corresponds to. Could also applyg(·)
f rontier.add(new State(pref ixT ree.getRoot(), null, itemvectors, f alse, null, new LinkedList()));
Main Loop:
while (!f rontier.isEmpty())
step(f rontier.getF irst()); //See algorithm4.2
Auxiliary Methods:
/*Let αbe the itemset corresponding to node. α∪ {item} is the itemset
represented by a child p ofnodeso that p.item=item. value would be p.value. This method calculates p.V alueby usingmap to look up theP ref ixN odes corresponding to thek required subsets of α∪ {item} to get theirvalue values, value1, ..., valuek. Then it returnsF(value1, ..., valuek).*/
doublecalculateF(P ref ixN ode node, Item item, double value) //details depend onF(·)
/*Check whether the itemset α∪ {item} could be interesting by exploiting the
anti-monotonic property of F(·): use mapto check whether subsets ofα∪ {item}
(exceptα and (α−node.item)∪ {item} by fact 3) exist.*/
boolean check(P ref ixN ode node, Item item) //details omitted
90 4.6. MINING ASSOCIATION RULES Algorithm 4.2 Procedure to perform one expansion. state.nodeis the parent of the new P ref ixN ode (newN ode) that we create if newN ode.V alue ≥ minM easure. localT op is true i we are processing the top sibling of any sub-tree. nextT op becomesnewN ode.topand is set so thattopis true only for a node that is along the topmost branch of the prex tree.
voidstep(State state)
if (state.newP air6=null) //see end of method ♣
state.buf f er.add(state.newP air);
state.newP air=null; //so it won't be added again P air p=state.itemvectors.next();
boolean localT op=!state.itemvectors.hasN ext();
if (localT op)
//Remove statefromf rontier (and hence deletestate.yI0) as the
//the top child ofnode is being created in this step. Fact6 localF rontier.removeF irst();
Itemvector yI0∪{i} =null;double value, V alue; boolean nextT op; //topin the next Statewe create.
if (state.node.isRoot()) //we are dealing with itemsets of length1 (soI0={})
value=f(p.yi);V alue=calculateF(null,{p.item}, value);yI0∪{i} =p.yi; state.top=localT op;nextT op=localT op; //initialisetops.
else
nextT op=localT op&&state.top;
if (check(state.node, p.item)) //make use of pruning property
if (localT op&& (state.node.getDepth()>1||state.top)) //Fact 6 or 7 //No longer needstate.yI0 as this is the last child we can create under
//state.node (and it is not a single item other than perhaps the topmost) yI0∪{i} =state.yI0;
yI0∪{i}◦=p.yi; //can write result directly intoyI0∪{i}
else //need to use additional memory for the child (yI0∪{i}). yI0∪{i} =state.yI0◦p.yi;
value=f(yI0∪{i});V alue=calculateF(state.node,{p.item}, value);
else //don't need to calculate since it is known that V alue < minM easure value=V alue=−∞
if (V alue≥minM easure) //Found an interesting itemset - create newN odefor it. P ref ixN ode newN ode=pref ixT ree.createChildU nder(state.node);
newN ode.item=p.item;newN ode.value=value;newN ode.V alue=V alue; sequenceM ap.put(newN ode);
if (state.buf f er.size()>0) //there is potential to expandnewN ode. Fact5 State newState=new State(newN ode, yI0∪{i}, state.buf f er.iterator(),
nextT op, new LinkedList());
//add to front of frontier so depth rst search. Fact 2. f rontier.addF ront(newState);state.newP air=p;
//if state.nodeis not complete, pwill be added to state.buf f er after // newStatehas been completed. See ♣
CHAPTER 4. GEOMETRICALLY INSPIRED ITEMSET MINING IN THE
TRANSPOSE 91
correctness.
Lemma 4.8. Let s = hi1, ..., iki = αβγ be the sequence corresponding to a prex node n where α, β 6= ∅. All association rules can be generated by creating all rules
α⇒β and β⇒α for each leaf node in the prex tree.
Proof. (Sketch) Given a leaf node n corresponding to a sequence s=< i1, ..., ik >, algorithm4.3generates all the rules α⇒β andβ⇒α for allα,β,γ whereα6=∅is
a prex ofs,γ is a possibly empty sux ofs, andβ 6=∅is the remaining sub-string
(a sux i γ = ∅). That is, s = αβγ. It is not possible to generate all possible association rules that can be generated from itemset {i1, ..., ik} by considering only node n. Specically, the following are missed: (1) any rules α0 ⇒ β0 or β0 ⇒ α0 where α0 is not a prex of s, and (2) any such rules where there is a gap between α0 and β0. However, by the construction of the tree there exists another node n0 corresponding to the sequence s0 =hα0, β0i (since s0 @s). If n0 is not in the fringe, then by denitions0 @s” where s” =hα0, β0, γ0i for some γ0 6= ∅ and n” (the node
for s”) is in the fringe. Hence α0 ⇒β0 and β0 ⇒ α0 will be generated from node(s) other thann. Finally, the longest sequences are guaranteed to be in the fringe, hence all rules will be generated (and without duplication) by induction.
In this procedure, the evaluated measures (value, V alue) for α are stored in the prex nodes visited by the algorithm as α is a prex ofs. To obtain the evaluated measures for β, the PrexNode (βn) corresponding toβ must be obtained. This is done using a Sequence Map (map) that has also been built by the mining algorithm.
4.7 Experiments
The GLIMIT algorithm was evaluated on two publicly available data sets from the FIMI repository9 T10I4D100K and T40I10D100K. These data sets have 100,000 transactions and a realistic skewed histogram of items. They have870and942items
respectively. To apply GLIMIT, the data was rst transposed as a pre-processing step. This is cheap, especially for sparse matrices precisely what the data sets in question typically are. The data used was in the experiments was transposed in 8 and 15 seconds respectively using a naive Java implementation and without exploiting sparse techniques.
9http://mi.cs.helsinki./data/
92 4.7. EXPERIMENTS
(a) Runtime and frequent itemsets. T10I4D100K. Inset shows detail for low support.
(b) Runtime and frequent itemsets. T40I10D100K.
CHAPTER 4. GEOMETRICALLY INSPIRED ITEMSET MINING IN THE
TRANSPOSE 93
(a) Runtime ratios. T10I4D100K.
(b) Average time taken per frequent itemset shown on two scales. T10I4D100K.
Figure 4.8: Run time results. FP-Growth and GLIMIT.
94 4.7. EXPERIMENTS
Algorithm 4.3 Generating association rules from the prex tree. This should be called for each PrexNode in the fringe to output all rules. We assume the measure is support and we evaluate for condence.
voidgenerateAssociations(P ref ixN ode f ringeN ode)
for (P ref ixN ode αβn=f ringeN ode;αβn.item6=;αβn=αβn.parent) σα∪β =αβn.M;βsize= 1;
for (P ref ixN ode αn=αβn.parent();αn.item6=;αn=αn.parent) Sequence βseq =αβn.getSuf f ix(βsize+ +);
P ref ixN ode βn=map.get(βseq); σα=αn.V alue;σβ =βn.V alue; cα⇒β = σα∪β σα ;cβ⇒α = σα∪β σβ ;
/*output the rules and theirσ andcs*/ output(αn,βn,σα∪β,cα⇒β);
output(βn,αn,σα∪β,cβ⇒α);
Figure 4.9: Number of item-vectors needed and maximum frontier size. Data set: T10I4D100K.
CHAPTER 4. GEOMETRICALLY INSPIRED ITEMSET MINING IN THE
TRANSPOSE 95
GLIMIT was compared to a publicly available implementation of FP-Growth and Apriori. The algorithms used were from ARtool10 as it is also written in Java and has been available for some time. The algorithms were not used via the supplied GUI, but rather the underlying classes were invoked directly to avoid overheads. The primary goal of this section is to show that GLIMIT is fast and ecient when compared to existing algorithms on the traditional FIM problem. Recall that a major contribution of this chapter however is the item-vector framework that allows operations that previously could not be considered, and a exible and new class of algorithm that uses this framework to eciently mine data cast into dierent and useful spaces. The fact that it is also very fast when applied to traditional FIM is a consequence of this. To represent item-vectors for traditional FIM, bit-vectors were used11so that each bit is set if the corresponding transaction contains the item(set). Therefore g creates the bit-vector,◦=AN D,f(·) =sum(·) andF(m) =m.
Figure4.7(a)shows the run time12of FP-Growth, GLIMIT and Apriori13on T10I4D100K, as well as the number of frequent items. The analogous graph for T40I10D100K is shown in gure4.7(b). Apriori was not run in this experiment as it is too slow. These graphs clearly show that when the support threshold is below a small value (about 0.29% and 1.2% for the respective data sets), FP-Growth is superior to GLIMIT. However, above this threshold GLIMIT outperforms FP-Growth signicantly. Figure
4.8(a)shows this more explicitly by presenting the run time ratios for T40I10D100K. FP-Growth takes at worst 19 times as long as GLIMIT. These results indicate that GLIMIT is superior above a threshold. Furthermore, this threshold is very small and practical applications usually mine with much larger thresholds than these.
GLIMIT scales roughly linearly in the number of frequent itemsets. Figure 4.8(b)
demonstrates this experimentally by showing the average time to mine a single fre- quent itemset. The value for GLIMIT is quite stable, rising slowly toward the end (as in these cases GLIMIT must still check itemsets, but very few of them turn out to be frequent). FP-Growth on the other hand, clearly does not scale linearly. The reason behind these dierences is that FP-Growth rst builds an FP-tree. This ef- fectively stores the entire Data set (minus infrequent single items) in memory. The FP-tree is also highly cross-referenced so that searches are fast. The downside is that this takes signicant time and a lot of space. This pays o extremely well when the support threshold is very low, as the frequent itemsets can read from the tree very
10http://www.cs.umb.edu/ laur/ARtool/.
11The Colt (http://dsd.lbl.gov/~hoschek/colt/) BitVector implementation was used. 12Pentium 4, 2.4GHz with 1GB RAM running WindowsXP Pro.
13Apriori was not run for extremely low support as it took longer than 30 minutes forminSup≤
0.1%
96 4.8. CONCLUSION AND FUTURE WORK quickly. However, when minSup is larger, much of the time and space is wasted. GLIMIT uses time and space as needed, so it does not waste as many resources, making it fast. The downside is that the operations on bit-vectors (in our experi- ments, of length100,000) can be time consuming when compared to the search on
the FP-tree, which is why GLIMIT cannot keep up when minSup is very small. Figure4.9shows the maximum and average14 number of item vectors our algorithm uses as a percentage of the number of items. At worst, this can be interpreted as the percentage of the data set in memory. Although the worst case space is1.5 times
the number of items,n(lemma4.7), the gure clearly shows this is never reached in these experiments. The maximum was approximately0.82n. By the time it were to get close to1.5n,minSup would be so small that the run time would be unfeasibly large anyhow. Furthermore, the space required drops quite quickly as minSup is increased (and hence the number of frequent items decreases). Figure4.9also shows that the maximum frontier size is very small. (recall from lemma 4.7it is bounded above bydl/2e. Finally, recall that the algorithm can avoid using the prex tree and
sequence map on the FIM problem, so the only space required are the item vectors and thef rontier. That is, the space required is truly linear.