Algorithm and hardware architecture for the discovery of frequent sequences

(1)

.

Algorithm and Hardware Architecture

for the Discovery of Frequent

Sequences

by

Osvaldo Navarro Guzm´an

thesis submited as partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE WITH SPECIALTY IN COMPUTER SCIENCE

at the

Instituto Nacional de Astrof´ısica, ´

Optica y Electr´

onica

Tonantzintla, Puebla

Supervised by:

Ren´e Armando Cumplido Parra, PhD Luis Villase˜nor Pineda, PhD

c

INAOE 2012

The author grants to INAOE the permission of

(2)

(3)

Acknowledgement

Thanks to my parents and my family, for their love, advice and support through my whole life.

Thanks to my advisors, for their guide and time spent in my formation at INAOE.

Thanks to CONACyT for the grant 51443 provided to the author of this work.

(4)

Abstract

Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, automatic text reuse detection, ana-lyzing purchase behavior, among others. Nevertheless, with the dramatic increase in data volume, the current approaches result inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. We propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover frequent patterns, that is, by recursively growing an already known frequent pattern p using frequent symbols from the projected database with respect to p. Our algorithm only maintains in memory a struc-ture of the pseudo-projections and the symbols required for the algorithm in case it has to go back and try to grow a pattern with another valid element. Also, we propose a hardware architecture that implements the processes of generat-ing pseudo-projection databases and findgenerat-ing frequent elements from a projection database, which comprehends the most costly operations of our algorithm, in order to accelerate its running time. Experimental results showed that our algo-rithm has a better performance and scalability, in comparison with the UDDAG and PLWAP algorithms. Moreover, a performance estimate showed us that our hardware architecture significantly reduces the running time of our proposed al-gorithm. To our knowledge, this is the first hardware architecture that tackles the problem of sequential pattern mining.

(5)

List of Figures

2.1 Related work’s general diagram . . . 6 2.2 A sequence database with a projection database and a

pseudo-projection created from it, using the word ‘ability’ as reference . . 8 2.3 DMAStrip for the sequence AC→ AH→ HJL. Figure taken from

(Tan et al., 2006). . . 10 2.4 Example of a sequence database and the correspondent wap tree.

Figure, taken and modified, from (Ezeife and Lu, 2005). . . 11 2.5 Forest of first-occurrence subtrees ofausing the example database

D in Figure 2.4(a). Figure taken from (Ezeife and Lu, 2005). . . . 12 2.6 PLWAPLong aggregate tree using the example database D in

Fig-ure 2.4(a). FigFig-ure taken from (Ezeife et al., 2009). . . 13 2.7 Example of the UDDAG data structure . . . 14 2.8 Example of the document representation by their contiguous word

pairs. Figure taken from (Garc´ıa Hern´andez, 2007). . . 15 2.9 Data structure built from the documents listed in Figure 2.8.

Fig-ure taken from (Garc´ıa Hern´andez, 2007). . . 15

3.1 Example of a database containing 3 sequences (a) and its corre-spondent sequential patterns (b), for a frequency threshold of 2. . 22

4.1 Example of the search space representation of a sequence database 29 4.2 Example of the algorithm’s functionality. The figure shows a

se-quence database, the set of frequent elements obtained from it and the filtered database. . . 32

(8)

LIST OF FIGURES

4.3 Example of the algorithm’s functionality. The figure shows two recursive calls to the main mining process, where 3 valid patterns

are found. . . 33

4.4 Example of the algorithm’s functionality. The upper part shows the next recursive call with respect to the Figure 4.3, where the input database does not contain any word. The algorithm goes back to the previous pattern and does another recursive call with a different valid word (lower part of the figure), which algo does not contribute with any valid pattern. . . 34

4.5 Patterns found by growing the pattern away in the algorithm’s example. . . 35

4.6 Recursive calls of the Mine function to grow the patterncat. . . . 36

4.7 Last recursive call to grow the patterncat. The projection database does not contain any word, so the recursion stopped in this direction. 36 4.8 Patterns found by growing the patterncatin the algorithm’s example. 37 4.9 General diagram of the proposed hardware architecture . . . 41

4.10 Diagram of the memory module . . . 42

4.11 Diagram of the projection module . . . 42

4.12 Diagram of the valid elements module . . . 43

4.13 Diagram of the control module . . . 44

5.1 Comparison of running time for different minimum supports. . . . 47

5.2 Performance of our proposed algorithm, at different percentages of the corpus. . . 48

5.3 Performance of our proposed algorithm, at different sizes of the symbol set. . . 49

5.4 Zoomed version of the graph in Figure 5.3. In this figure, the performance of the three algorithms at the larger symbol set sizes can be seen with more detail. . . 49

(9)

LIST OF FIGURES

5.5 Comparative estimate of the UDDAG algorithm, our proposed al-gorithm and our proposed architecture. The figure shows the num-ber of iterations done by the UDDAG and PLWAP algorithms in a series of tests with minimum supports from 20 to 10. The algo-rithms are compared with a series of estimates of the number of iterations required by the hardware architecture for the same tests. 51

(10)

Chapter 1 Introduction

1.1 Motivation

There is now a vast amount of data stored in digital form, available in several fields. Moreover, there is a great interest in the analysis of that data in order to obtain valuable information. To tackle this problem is the main objective of the research area Knowledge Discovery in Databases(KDD), which, according to Fayyad et al. (1996), can be defined as follows: KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns

in data. In other words, KDD is a field that attempts to discover interesting patterns that help us to build useful knowledge and to make beneficial decisions. Within this process, there is a key step, named data mining, which Fayyad et al. (1996) also give us a definition for: Data mining is a step in the KDD process that consists of applying data analysis and discovery algorithms that, under acceptable

computational efficiency limitations, produce a particular enumeration of patterns

(or models) over the data.

Furthermore, according to Leavitt (2002) estimates suggest that at least 80 percent of today’s data is in an unstructured textual format. Therefore, there is a need for methods that discover useful patterns in text with no given structure. Also, as text keeps a natural sequential order between the words, it is desirable that the patterns discovered preserve that order. That is the main objective of the sequential pattern mining problem, i.e., to find patterns that mantain its

(11)

original sequential order, among a set of sequences.

1.2 Implementation Platform Limitations

Due to the dramatic pace at which data is recollected nowadays, most of the current data mining methods are becoming ineffective. This is because most of the popular data mining methods were created when the common dataset size was several orders of magnitude smaller (Garc´ıa-Pedrajas and de Haro-Garc´ıa, 2011). Moreover, the majority of those methods are implemented in software. In this environment, there are several limitations that negatively affect the per-formance of an algorithm implementation, such as having to read each required instruction from memory and decodify it to know which action to execute. Even more, if an algorithm requires an operation that cannot be carried out by any of the processor’s instructions, it must be constructed from the processor’s instruc-tion set (Hern´andez, 2009). Thus, there is a need for methods that can process large databases in an acceptable time, and it is desirable to implement them in environments with convenient characteristics.

1.3 Main Objective

Design an algorithm that discovers all the frequent sequences from a sequence database, and design a hardware architecture based on that algorithm, which performance has an improvement in running time of a least one magnitude order, in comparison with state of the art algorithms.

1.4 Specific Objectives

• Design a hardware-oriented algorithm for the discovery of frequent sequences.

• Design a hardware architecture that implements the most expensive oper-ations of the algorithm devised.

(12)

1.5 Proposed Solution

We present the algorithm TreeMine, a method for the discovery of sequential patterns.

TreeMine is based on the algorithm UDDAG (Chen, 2009), and also relies on the pattern-growth strategy (Mortazavi-Asl et al., 2001) to discover valid patterns by growing already mined valid patterns, traversing the search space in a depth-first fashion. We eliminated the need for the candidate-generation phase, which was a bottleneck stage of the UDDAG algorithm. Experimental results showed that our algorithm performs significantly better against the UDDAG (Chen, 2009) and PLWAP (Ezeife and Lu, 2005) algorithms, two prominent methods of sequen-tial pattern mining. Moreover, we propose a new hardware architecture, which is based on our designed algorithm. Our architecture implements the most costly operations of our algorithm, in order to accelerate its execution, which in turn improves the overall performance of our method.

1.6 Contributions

The contributions of our thesis project are as follows:

• An algorithm for the discovery of sequential patterns.

• A hardware architecture based on our proposed algorithm. To our knowl-edge, this is the first hardware architecture designed to mine sequential patterns.

1.7 Document Organization

This document is organized as follows. Chapter 2 presents the basic concepts and ideas necessary to understand the sequential pattern mining problem and the efforts made to tackle it. Chapter 3 presents a summary of the related work about the sequential pattern mining, maximal frequent sequences mining and the initiatives in hardware architectures proposed to implement pattern mining methods. Chapter 4 presents our proposed method and hardware architecture.

(13)

Chapter 5 describes the experiments carried out to evaluate our algorithm and architecture as well as the results obtained from them. Finally, chapter 6 presents our conclusions derived from this thesis project, the main contributions of our project and suggestions for future work.

(14)

Chapter 2 Related Work

In this chapter, a general overview of the most relevant approaches to the sequen-tial pattern mining problem is presented. Also, we describe a few approaches for two similar problems: the mining of maximal frequent word sequences with gap restriction and mining patterns from data streams. The former problem differs from sequential pattern mining mainly because it only outputs maximal sequen-tial patterns, that is, frequent sequences that have no frequent super-sequences. Also, it deals only with sequence databases composed by text, and the proposed solutions may have features specific to this kind of databases. On the other hand, pattern mining from data streams deals with finding frequent single items, item-sets or association rules from a stream of data, instead of a traditional static database. Processing itemsets from a stream of data entails taking into account several additional issues than the ones inherent to processing itemsets from a static database, such as choosing an appropriate data processing model that could handle the very high speed at which the data is received, which could only be read once for the same reason, implementing a good strategy for memory manage-ment and resource awareness as well, among others (Jiang and Gruenwald, 2006). Finally, two hardware implementations for the frequent pattern mining problem are described, as an example of the efforts being made under the reconfigurable computing paradigm, with a similar problem. A diagram showing these methods under their corresponding category can be seen in Figure 2.1

(15)

Sequence Pattern Mining                                Apriori-Based   

AprioriAll (Agrawal, 1995) SPADE (Zaki, 2001) SEQUEST (Tan et al., 2006)

Pattern-Growth       

PLWAP (Ezeife and Lu, 2005) FOF (Peterson and Tang, 2008) PLWAPLong (Ezeife et al., 2009) TreeMine*

Hybrid

UDDAG (Chen, 2009)

Maximal Frequent Word Sequence

Mining

Ahonen-Myka (2002) Garc´ıa Hern´andez (2007)

Frequent Pattern Mining Architectures

Sun et al. (2008) Mesa (2010)

Pattern Mining from Data Streams

Hou et al. (2008) Lin et al. (2006)

Figure 2.1: Related work’s general diagram

2.1 Sequence Pattern Mining

The sequence pattern mining problem was first defined in (Agrawal, 1995). Since then, many other approaches have been proposed. These methods can be catego-rized, by the way they discover sequential patterns, in three categories: candidate generation approaches, pattern-growth approaches and hybrid approaches.

The candidate generation approaches usually discover the sequential patterns in a breadth-first fashion, i.e., they obtain the all frequent sequences of size k at the k iteration of the algorithm as it traverses the search space. Also, these methods depend on a feature named generate-and-test (Mabroukeh and Ezeife, 2010) to carry out the mining. This feature entails growing already found patterns by one item at a time and then testing the generated candidates against the minimum support.

The pattern-growth strategy is characterized by traversing the search space in a depth-first fashion. It offers the advantage of not having to generate candidate patterns, which avoids the combinatorial explosion of patterns when dealing with large databases. Because of this, the pattern-growth strategy has become the most used scheme, preferred over the candidate generation strategy.

In the pattern-growth strategy, instead of generating pattern candidates and then testing each one against the minimum support, an already known valid

(16)

pattern of length k is chosen as a pivot, then a search in the input database for elements that concur frequently enough with that element is done, to form new valid patterns of sizek+ 1. Next, each one of the patterns of lengthk+ 1 is used as a pivot as well and the process repeats recursively until finding all the frequent patterns. This process is usually called pattern growing, prefix growing if the concurring element appears after the pattern, or suffix growing if the concurring element appears before the pattern. Also, to avoid searching in the entire input database in each recursive call, the database projection with respect to the pivot pattern is used, so a smaller database is searched with each recursive call to the mining process. However, using database projections entails building a sequence database with each recursion call, which can result very costly in terms of memory usage. A good solution to this issue is using pseudo-projections (Mortazavi-Asl et al., 2001), a concept that has been adopted since by other approaches (Chen (2009), Antunes (2004)). The main idea underneath pseudo-projections is that, instead of generating a whole new database, a set of pointers are created, each one pointing to the sequence id and the position where the subsequence that belongs to the database projection is located. This results in a considerable memory saving. Figure 2.2 shows an example of a sequence database, a projection database generated from it and also the pseudo-projection that represents that projection database.

The hybrid approaches comprehend methods that combine one or more fea-tures from both candidate generation and pattern-growth approaches.

2.1.1 Candidate Generation Approaches

Agrawal (Agrawal, 1995) proposed the AprioriAll algorithm, to tackle the prob-lem of finding sequential patterns in a database of customer transactions, where each transaction consists of a customer-id, the transaction time, and the items bought in the transaction. This algorithm traverses the search space in a breadth-first fashion, because it finds all the patterns of length k in the k−th iteration before moving on to the next one. To find the patterns in an iteration level, this method generates a set of candidate patterns, which are further tested against the minimum support. Also, the algorithm incorporates a pruning technique based

(17)

Figure 2.2: A sequence database with a projection database and a pseudo-projection created from it, using the word ‘ability’ as reference

on the antimonotonic property that states that if a sequence cannot pass the minimum support test, all its super-sequences will also fail the test (Mabroukeh and Ezeife, 2010).

This method has the main disadvantage of generating an explosive number of candidates, particularly at early stages of the mining process, which greatly increases its running time and consumes a great amount of memory.

Zaki (Zaki, 2001) presents the algorithm SPADE. This method not only dis-covers single item sequences, but sequences of subsets of items, as well. SPADE represents the search space as a lattice of frequent sequences, which can be tra-versed in either depth-first or breadth-first fashion. The algorithm is composed of 3 steps: first, it finds all the frequent 1-sequences; in the second step the algo-rithm finds the frequent 2-sequences. The third and main step involves traversing the lattice, testing candidates against the minimum support to find the rest of the frequent sequences. In this step, an id-list is generated for each frequent se-quence in the lattice, which is a list of all the input-sese-quence identifier and item set identifier pairs that contains that sequence. To obtain the support count of a candidate sequence k-pattern, the algorithm performs temporal joins over the

(18)

id-lists of any two of its (k−1) subsequences. Moreover, as maintaining all the intermediate id-list in memory would not be possible for databases of consider-able size, Zaki (2001) proposed to break the lattice into disjoint subsets called

equivalence classes. Thus, each equivalence class can be loaded into main memory and processed separately.

Even with the improvement of not having to constantly read the original database, but only three times, this method still faces the problem of generating a huge number of candidates, specially when dealing with large databases and/or low minimum supports.

Tan et al. (Tan et al., 2006) proposed SEQUEST, an algorithm that relies on a structure called Direct Memory Access Strips (DMA-Strips) to efficiently generate candidate patterns. A DMA-Strip represents a single sequence in a database, and is composed by an ordered list which stores a sequence of items’

label, scope and event-id. The event-id groups items based on their timestamps and thescopeis used to determine the relationship between two consecutive items in a strip. An example of a DMA-strip can be seen in Figure 2.3. To generate a candidate pattern, the algorithm iterates through a DMA-Strim and extends one item at a time. Next, to test a candidate pattern against the minimum support, two approaches were proposed: a vertical join counting approach similar to the one used in (Zaki, 2001) and a horizontal counting approach using a hash table. Experimental results indicated that the vertical approach performed better than the horizontal approach. However, the vertical approach performance could degrade if the frequency of extracted candidate patterns is high, as it usually occurs when the minimum support is low, due to the nature of the join approach complexity.

2.1.2 Pattern-Growth Approaches

Ezeife and Lu (Ezeife and Lu, 2005) present a method for mining sequential patterns from a web access sequence database, which uses a data structure named

aggregate tree (Spiliopoulou, 1999) to efficiently access the sequences to be mined. First, this algorithm stores the database in a prefix tree, where each node contains a sequence element, a frequency counter and a position code, which helps to find

(19)

Figure 2.3: DMAStrip for the sequence AC → AH → HJL. Figure taken from (Tan et al., 2006).

the ancestor and the descendants of any node; in this manner, each path from the root to any node corresponds to a sequence in the original database. Next, each node of the same type in the tree is linked, to assist node traversal; an example of an aggregate tree is shown in Figure 2.4. Finally, the algorithm recursively mines the tree, using prefix conditional sequence search, to find all the sequential patterns.

One of the benefits of this method is that it only reads the original database twice, which helps to reduce I/O access costs or extra storage requirements. Also, the algorithm does not have to build a data structure at each level of recursion, which yields an improvement in running time, compared with previous works, which used the same data structure (Pei et al., 2000).

Peterson and Tang (Peterson and Tang, 2008) proposes an algorithm which also uses a data structure based on the aggregate tree. To discover the frequent word sequences, the algorithm uses conditional search to traverse the search space in a depth-first fashion. This method does not link the elements of the same type in the aggregate tree, but instead builds a forest offirst-occurrence subtreesas the basic data structure to represent the database projections. Given a symbolw, the forest of first-occurrence subtrees of wis a list of pointers to the first-occurrences of w in the aggregate tree. The sum of the counts of the nodes pointed by the forest provides the frequency of w in the sequence database. Also, the subtrees rooted at the children of the nodes pointed by the list represent the projection database with respect to the word w. Because the nodes of the forest are already in the aggregate tree, the memory requirements are only for the list of pointers.

(20)

Figure 2.4: Example of a sequence database and the correspondent wap tree. Figure, taken and modified, from (Ezeife and Lu, 2005).

Despite the memory saving in having to build only a list of pointers instead of a whole projection, the algorithm still has to build the entire aggregate tree before the mining process. This would require a lot of memory when dealing with large databases, which is usually the case within the text mining area.

Ezeife et al. (Ezeife et al., 2009) presents PLWAPLong, based on the algorithm PLWAP (Ezeife and Lu, 2005). PLWAPLong uses the same data structure as PLWAP, but uses a simpler position code numbering scheme to identify ancestor and descendant relationships in the aggregate tree. PLWAP uses a single code to identify the position of each node, which can hold up to thirty two nodes. For long sequences that exceed thirty two bits, PLWAP uses linked lists, which has a negative impact in performance. Instead of that, PLWAPLong used two codes, start and end position, to identify the position of each node in larger sequences; an example of this structure is shown in Figure 2.6. In this way, PLWAPLong is capable of dealing with larger databases than PLWAP, with a lower impact in performance. Moreover, PLWAPLong uses a more efficient approach than PLWAP to find the first occurrences of a word in the aggregate tree, which is a recurrent operation in the mining process, used to determine the frequency of a word in a projected database. PLWAPLong avoids checking all the occurrences

(21)

Figure 2.5: Forest of first-occurrence subtrees of ausing the example database D in Figure 2.4(a). Figure taken from (Ezeife and Lu, 2005).

of a word in the projected database, in order to find the first occurrences; instead, determines the last descendant of each word in the aggregate tree, which is further used to avoid checking unnecessary nodes.

While PLWAPLong can process larger databases than PLWAP, the algorithm was tested against databases with rather small vocabularies and average docu-ment size, in comparison with the ones used in text mining tasks. Furthermore, PLWAPLong too has to build the entire aggregate tree before the mining process, which results costly when dealing with databases with large vocabularies and a large average document size.

2.1.3 Hybrid Approaches

Chen (Chen, 2009) proposes a novel data structure named UpDown Directed Acyclic Graph(UDDAG), which supports bidirectional pattern growth from both ends of the discovered patterns, in order to perform fast conditional search. First, the algorithm finds the frequent words in the sequence database, and uses them to filter the database from infrequent elements. Next, a frequent word is chosen to be the root of an UDDAG, and two database projections are created; one projection is formed by the prefixes of the sequence database with respect to the

(22)

Figure 2.6: PLWAPLong aggregate tree using the example database D in Figure 2.4(a). Figure taken from (Ezeife et al., 2009).

root pattern and the other one is formed by the suffixes. Frequent words are then found in these projections, and are appended to the root pattern to form new frequent patterns; thus it is said that in each level of recursion the root pattern is grown from both ends. Finally, the algorithm combines the patterns found in both projections to form candidate patterns, which are tested against the minimum support, in order to find new frequent patterns. Each one of the projections created is treated as a new database and the previous process is repeated, recursively. An example of the UDDAG data structure can be seen in Figure 2.7.

The UDDAG algorithm is an hybrid approach, as it uses conditional search in combination with candidate generation, to find all the sequential patterns. One of the advantages of its novel data structure, is that in each level or recursion the root pattern is grown from both sides concurrently, in comparison with other pattern growth methods, that only grow the length of the root pattern in 1 at each level of recursion. However, the need to use a candidate generation phase at each level of recursion impacts negatively in the algorithm’s performance, and this impact becomes unacceptable when processing large databases at low minimum supports, due to the explosive number of candidates generated.

(23)

Figure 2.7: Example of the UDDAG data structure

2.2 Frequent Word Sequence Mining

Ahonen-Myka (Ahonen-Myka, 2002) tackles a similar problem as the one ad-dressed in this document. In this work, a method for the discovery of maximal frequent word sequences with gap restriction in a set of documents is proposed. A frequent word sequence s is maximal if there is not a super-sequence of s that is frequent as well. The algorithm starts by collecting all the ordered pairs that are frequent in the document collection and that meet the desired gap restric-tion. Next, each pair is expanded by adding words to it and then testing the new formed pattern against the minimum support. The expansion continues until the new pattern is not frequent. After that, the resulting patterns are then joined together to form candidate patterns, which are also tested against the minimum support. Next, a pruning technique is applied, to remove all the patterns that are subsequences of an already found maximal frequent sequence. The whole process repeats until there are no more frequent patterns left to combine.

Garc´ıa Hern´andez (Garc´ıa Hern´andez, 2007) proposes a series of algorithms to discover maximal frequent word sequences with a gap restriction from a set of documents, and from a single document as well. Based on the work of Ahonen-Myka (Ahonen-Ahonen-Myka, 2002), all of the proposed methods represent the input document set as the set of word pairs contained in each one of those documents.

(24)

Figure 2.8: Example of the document representation by their contiguous word pairs. Figure taken from (Garc´ıa Hern´andez, 2007).

Figure 2.9: Data structure built from the documents listed in Figure 2.8. Figure taken from (Garc´ıa Hern´andez, 2007).

An example of this representation is shown in Figure 2.8. Also, all of the al-gorithms use a similar structure, which is built as follows. First, a list of the occurrences of every single pair in each one of the documents is made. Then, the pairs of every document are linked to preserve the sequential order among them; the data structure for the documents in Figure 2.8 can be seen in Figure 2.9. Finally, the data structure is traversed to discover the maximal sequential patterns, by taking each pair of words and growing the length of that pattern until the maximal sequential patterns, with that pair as a prefix, are found.

(25)

2.3 Pattern Mining from Data Streams

Mining patterns from a data stream is a task needed in emerging applications, such as network traffic monitoring and web click streams analysis (Jiang and Gruenwald, 2006). The very high speed and the huge amount of data received from the stream makes it necessary to take into account different strategies to mine frequent patterns than the ones used in a traditional database environment. Hou et al.(Hou et al., 2008) proposes an algorithm for mining frequent itemsets from a data stream with concept drifts. Concept drifts means that the underlying process that generates the data can change over time, in particular, the minimum support and the distribution of data. The algorithm finds an aproximate set of frequent itemsets, which has an associated error upper-bounded with a good probability. To address the problem of a changing minimum support, the method keeps track of subfrequent itemsets, which are potential frequent itemsets. These itemsets are stored in a tree-based structure calledFI-tree, which is updated with every transaction arrived at the system. This structure is checked anytime the minimum support could make any of the itemsets stored in the FI-tree become frequent, so they could be included in the set of frequent itemsets. The algorithm was tested for performance and precision, showing to be effective.

Lin et al.(Lin et al., 2006) proposes a method for mining frequent itemsets from data streams, while also supporting the option of changing the minimum support during the mining process. In a traditional database environment, up-dating the set of patterns according to the new minimum support would entail re-scanning the database, looking for patterns that meet the frequency threshold. However, the very high speed at which the data is received in a data stream en-vironment, and the huge amount of data managed in data stream applications, makes unfeasible to re-scan the data received each time the minimum support changes. Therefore, other strategies for updating the set of patterns without having to re-mine the entire database, are needed. To achieve this, Lin et al. proposes a method that reads the data stream in a bucket-by-bucket basis and stores the potential frequent itemsets in a tree structure. To update the set of patterns according to a change in the minimum support, asynopsis vectoris used, which is a data structure that only maintains statistics of past transactions, thus

(26)

avoiding re-scanning the entire database.

2.4 Hardware Architectures

To our knowledge, our proposed architecture is the first hardware implementation approach that tackles the problem of sequential pattern mining. There are how-ever, some hardware implementation approaches for mining frequent itemsets, which is the predecessor problem of sequential pattern mining. Here we describe some of the recent efforts made to solve this problem.

Sun et al. (Sun et al., 2008) present a hardware architecture for frequent pat-tern mining, which is based on the FP-Growth algorithm (Han et al., 2004). The architecture relies mainly on a systolic tree to store the patterns, scanning and counting the occurrences of each one of them. A systolic tree is an arrangement of pipelined processing units in a tree pattern; the root processing unit acts as an interface to the rest of the tree. The architecture has 3 modes of operation:

write, scan and count. In the write mode the systolic tree is built, by loading a single item into the tree with each clock cycle. If the item is already contained in a node, its correspondentcountvalue is increased. Otherwise, an empty node will be allocated with the item value. Next, the architecture switches toscanmode, in order to begin the discovery of the frequent itemsets. This is done by candidate itemset dictation, that is, a brute force method is used to generate candidate itemsets, which are then sent to the systolic tree. After an itemset is entirely sent to the systolic tree, the architecture switches to count mode. In this mode, the corresponding processing units report the count of every item in the itemset to its unique parent. Each parent node collects the counts of its children and sends them to its own parent. Finally, the frequency of the itemset is returned by the root node of the systolic tree.

Mesa (Mesa, 2010) proposes an algorithm to mining frequent patterns, which is specifically designed to be implemented as a hardware architecture. This method represents the input database and the itemsets as vertical bit vectors. In this way, the frequency of an itemset can be easily calculated by counting the number of 1’s in the resultant bit vector of the AND operation among the item’s vectors that form the itemset. The algorithm uses a series of interconnected

(27)

bi-nary trees to represent the search space, each one of them assigned to discover frequent itemsets that start with a different item. This structure is mapped to the architecture as a systolic tree, which has three modes of operation: fill, fre-quency thresholdandemptying. In the first mode, the architecture sents the input database to the systolic tree, and the processing units determine the frequency of each itemset. Next, the architecture switches to frequency threshold mode, and sends the frequency threshold, along with some control signals, to the systolic tree. Finally, the architecture switches toemptying mode, where the systolic tree returns those itemsets that met the given frequency threshold.

2.5 Discussion

There have been a lot of approaches to provide an efficient solution to sequen-tial pattern mining problem. However, the current methods still have problems when dealing with large databases and low minimum supports. In this case, the candidate generation approaches face the problem of generating an explosive number of candidate patterns at each iteration, which greatly slows down their performance. On the other hand, as the pattern growth approaches build a data structure to represent the input database or to store the discovered patterns, this usually produces a great increase in running time for large databases and/or low supports, and the resulting structure is huge, consuming a great deal of memory. All of the sequential pattern mining methods previously mentioned were im-plemented in software. An alternative to this, is the implementation in a hardware environment, through the design of a hardware architecture. Under this approach, there are several features that can benefit the algorithms performance, such as the pipeline of operations, parallelism, dedicated modules for certain operations, among others. This, and the issues previously described, are the main reasons for our proposed method. We propose an algorithm which relies on the pattern-growth scheme to discover all the sequential patterns. Our method is based on the algorithm UDDAG (Chen, 2009), as it uses a similar data structure to repre-sent each pattern, and traverse the search space in a depth-first fashion. Also, our algorithm does not build a data structure to represent the input database, but relies on pseudo-projection to do the support counting; and, is capable of

(28)

parti-tioning the problem as well. These two features are the main characteristics that made our algorithm feasible to an implementation in hardware. Moreover, our method represents the patterns discovered as a series of tree structures, instead of a directed acyclic graph, like the UDDAG algorithm. In this way, only one branch of a single tree is kept in main memory during the execution of the algorithm, which helps to save memory. Moreover, our method grows patterns only in one direction with each recursive call, instead of two, which allowed us to avoid the candidate generation phase of the UDDAG algorithm, which is the main bottle-neck of this approach. Experimental results showed that our algorithm performs significantly better than the UDDAG algorithm.

Our algorithm is implemented as a hardware architecture, in order to accel-erate its running time. The architecture is formed by four main modules, which perform the algorithm’s most resource demanding operations: building pseudo-projection databases and searching for frequent words. Pipelining is used in both modules, to accelerate the processing.

(29)

Chapter 3 Fundamentals

In this chapter we describe the basic concepts needed to understand the prob-lem of sequence pattern mining. Also, we describe the reconfigurable computing paradigm.

We begin by presenting a formal definition of the sequence pattern mining problem. Next, we describe the basic concepts of reconfigurable computing and its advantages against other algorithm implementation approaches.

3.1 Sequence Pattern Mining

Before giving the formal definition of the sequence pattern mining problem, we need to introduce the following definitions:

Definition 1. Let Σ be the set of symbols. A non-empty sequence s is a finite succession of symbols from Σ, s=s1. . . sm, such that si ∈Σ for all1≤i≤m < ∞ and si and sj are not necessarily different for i 6= j. The length of sequence s =s1. . . sm is m, and we will denote it as len(p). A sequence database D is

a set of finite sequences. A pattern is a non-empty sequence.

Definition 2. A non-empty sequence is a subsequence of another sequence if it is embedded in that sequence. In particular, sequence s0 = s0₁. . . s0_n is a subsequence of sequence s = s1. . . sm, denoted as s0 ⊆ s, if and only if n ≤ m

and there exist i1, . . . , in such that 1 ≤ i1 < . . . < in ≤ m and s0j = sij for all 1≤j ≤n.

(30)

Definition 3. A sequence s in D is said to support a pattern p if p is a subse-quence of s. The support of a pattern p in D, denoted as SupD(p), is the number

of sequences in D that support p. Given a threshold ξ in (0,1], a pattern p is frequent with respect to ξ and D if SupD(p) ≥ ξ|D|, where |D| is the number of sequences in D. ξ|D| is called the absolute threshold and denoted as η.

Definition 4. Given a sequence s and an element a from P

such that a ⊆ s,

the a-prefix (Antunes and Oliveira (2003)) of s is the prefix of s from the first element (the leftmost element) to the first occurrence of a inclusive.

For example, the b-prefix of sequenceabaoba isab. The subsequenceabaobis not the b-prefix because the first occurrence of b in the sequence is the one that is in bold.

Definition 5. Given a pair of sequences s = w1 · w2 ·w3 · . . .·wn and s0 = a1·a2·a3·. . .·am and s0 ⊆s, the projection of s with respect to s0, denoted as proj(s, s0) is defined as follows:

proj(s, s0) =wi·wi+1·wi+2·. . .·wn (3.1)

where i is defined as:

i=min{j = 1, . . . , n|wk =a1, wk+1 =a2, wk+2 =a3, . . . , wj−1 =am} (3.2)

where k = 1, . . . , n and k ≤j−1.

Roughly speaking, we can get the projection of a sequence s with respect to a sequence s0, by taking out the symbols in s, from the first one to the last symbol in the first occurrence of s0 in s. The remaining symbols will constitute the desired projection. For instance, if we have the sequences s=a·b·c·d·a·b·e and s0 =c·d, the projection of s with respect tos0 will be a·b·e.

Definition 6. Given a sequence database D and a sequence s, the projection database of D with respect to tos, denoted asP rojs(D) is defined as the set of

(31)

Figure 3.1: Example of a database containing 3 sequences (a) and its correspon-dent sequential patterns (b), for a frequency threshold of 2.

Having defined the previous concepts, we can define the sequence pattern mining problem as the task of finding all the frequent sequential patterns in a given sequence database D with respect to a given threshold ξ. For instance, the figure 3.1 shows a database formed by 3 sequences, and their corresponding sequential patterns, for an absolute threshold (also called frequency threshold) of 2.

Proposition 1. Given a sequence database D, a pattern p and a single element e, the support of p·e in D is equal to the support of the element e in the projection of D with respect top. Formally:

supD(p·e) =supprojp(D)(e) (3.3)

Proof. (Tang et al., 2007)

(≥) Let s be a sequence inDthat supportsp·e. We represents ass = ˆp·ê, such that ˆp is the p-prefix of s, wherep⊆ˆp ande⊆ê. Thus, ê is the projection of swith respect to p, and is part ofP rojp(D). Also, ase⊆ê and ê∈P rojp(D), we have that for every sequence in Dthat supportsp·e, there will be a sequence in P rojp(D) that supports e. This means that the support of e in P rojp(D) is at least equal to the support of p·e in D. Therefore, we have supprojp(D)(e) ≥

supD(p·e).

(32)

the projection of a sequence s0 in D with respect to p, we have that s0 = ˜p·˜e, where ˜p is the p-prefix ofs0. Sincep⊆˜p ande⊆e, then˜ p·e⊆˜p·˜e =s0. In this way, for every sequence in P rojp(D) that supports e, there will be a sequence in D that supports p·e. This means that the support ofp·e in Dis at least equal to the support of e inprojp(D). Therefore, we have supprojp(D)(e)≤supD(p·e).

3.2 Reconfigurable Computing

3.2.1 Basic Concepts

An algorithm can be implemented either in a software or hardware environment. In hardware implementation approaches, one of the most popular is using Ap-plication Specific Integrated Circuit (ASIC). An ASIC is an integrated circuit developed for a specific purpose. They are generally designed to be embedded in a product that will have a large production run, replacing a large number of individual electronic components that otherwise would be needed in a single inte-grated circuit. The main advantage of the first approach is its high performance. This is due to the fact that ASICs are designed to carry out specific tasks and only have the required hardware resources to do so. However, they lack of flexi-bility. Once the device is fabricated, it cannot be modified. Moreover, the cost of an ASIC design is high, therefore they are generally reserved for productions of several thousands of pieces (DELTA, 2010).

On the other hand, in a software environment, using programmed cessors through a software application is the most popular approach. A micropro-cessor executes a set of instructions in order to carry out the operations required by the algorithm. If a modification is needed, it can be done easily, by changing the respective instructions, without affecting the hardware. Unfortunately, this strategy has a low performance, in comparison with the previous approach.

The reconfigurable computing paradigm takes the qualities of the previous strategies. The reconfigurable computing approaches have the flexibility that the ASICs approaches lack. If a modification is needed, it can be done just by mod-ifying the architecture design and then reconfiguring the implementation device,

(33)

thus avoiding complex and slow redesign processes. Moreover the reconfigurable computing implementations offers a high performance, comparable to that of the ASICs approaches. This is due to various reasons, the main one being that a hardware implementation is tailor made to the problem we want to solve, so it only includes the elements required to carry out the algorithms operations.

3.2.2 FPGAs

FPGAs (Field-Programmable Gate Array) are the most popular configurable de-vices used in reconfigurable computing systems. An FPGA is mainly formed by a bi-dimensional array of logic cells. Each logic cell can implement any logical function and can be connected to any other cell through configurable interconnect resources. Thus, while a single cell can do little, complex logical functions can be created by interconnecting many of them. Moreover, FPGAs can have addi-tional resources, like embedded memory blocks, embedded processors, dedicated multiplier blocks, among others.

3.2.3 Architecture Design

Usually, a reconfigurable hardware architecture is part of a system formed by a general purpose microprocessor and a reconfigurable device where the archi-tecture is implemented. In this way, the first step in the design of a hardware architecture is the analysis of the algorithm that will be implemented, in order to find its most expensive operations and the data dependency between them. If there is little to none data dependency between the operations, they can be carried out in parallel by the hardware architecture, which produces a decrease in the algorithm’s running time; if the chosen operations are expensive and they are executed repeatedly during the algorithm’s execution, the decrease in running time will be even greater. If the chosen operations have a strong data dependency between them, the parallelism available in the hardware environment cannot be exploited, and the implementation of these operations would be entirely sequen-tial, which would not provide a significant decrease in running time in comparison with an implementation in a software environment. Therefore, the analysis and selection of operations has to be carried out carefully.

(34)

Once the operations to be carried out by the hardware architecture have been determined, an architecture design has to be created as the next step. The ar-chitecture design should perform the chosen operations while taking advantage of the particular features of the hardware environment, such as pipelining, par-allelism and embedded functions. If the algorithm is originally designed to be implemented in a software environment, the architecture design also entails the difficulty of changing from the software implementation paradigm, where the al-gorithm is implemented as a set of instructions that are executed sequentially, to a reconfigurable computing environment, where algorithm is implemented as a set of logical operations that can be carried out in parallel.

Once the architecture design is finished, a hardware description of the archi-tecture is coded, usually through hardware description languages such as VHDL or Verilog. Next, a simulation is performed, to see if the architecture works cor-rectly. Then, the architecture is implemented in a hardware device and tested for correct functionality, area and time measures. The hardware architecture is usually accompanied by a pre-processing and/or post-processing software ap-plication, that handles the algorithm’s operations that were not suitable to be implemented in the architecture. If this is the case, a data transmission strategy should be devised in order to obtain an optimal communication between the hard-ware device and the softhard-ware application, such that the decrease in running time obtained from the hardware architecture does not get affected by a transmission bottleneck. The characteristics of this strategy usually depends on the specifica-tions of the hardware device and the machine where the software application is running. .

(35)

Chapter 4 Proposed Method

In this chapter we describe our approach to the problem of mining sequential patterns. We begin by giving an analysis of the problem, such that we gain a general outlook about the challenges that entails to accomplish this task. Next, we present our proposed algorithm, accompanied with an explanation of each step and a proof of its correctness. Finally, we present our hardware architecture, which implements our proposed algorithm.

4.1 Problem Analysis

Tackling the problem of mining sequential patterns entails accomplishing several challenges:

• How to efficiently access the sequences in the input database.

• How to determine, in an efficient way, if a pattern is valid.

• How to traverse the search space efficiently, without performing redundant operations or processing unnecessary patterns.

Besides accomplishing these challenges we had to consider, including other features in our algorithm design, which would help our implementation as a hard-ware architecture to be more efficient. Among these features are: independence among the input data, problem partitioning, memory efficiency, and others.

(36)

Among the main strategies to solve this problem, the most efficient is the pat-tern growth strategy. One of the main reasons for this is that the patpat-tern growth approaches do not generate candidate patterns, thus avoiding the processing of a considerable number of invalid patterns. This advantage becomes clearer when processing large sequence databases with a low minimum support. Because of this, we took the pattern growth strategy as the base of our approach.

Within the pattern growth strategy, the most common way to access the se-quences in the input database is using database projections. With this technique, a smaller sequence database is searched for patterns with each recursive call to the mining process, so the algorithm only has to deal with those sequences that con-tain the growing pattern, and only with the portions of those sequences that could contain elements which co-occur frequently with the pattern. Thus, it avoids looking for valid patterns in unnecessary sequences. Besides pseudo-projection, several data structures have been used to represent the input database, in order to provide an easier access to the sequences and to save memory. Among those structures, we have tree-based representations (Ezeife and Lu (2005), Tang et al. (2007), Liu and Liu (2010), Peterson and Tang (2008)), directed acyclic graphs (Chen, 2009), linked lists based structures (Garc´ıa-Hern´andez et al., 2004), and others.

To know if a pattern meets the minimum frequency, there are different pro-posed solutions: doing a simple brute force method of searching in the en-tire database for the occurrences of the pattern (Agrawal, 1995), or calculating the frequency by counting concurring elements in the corresponding projection databases and using Proposition 1 (Chen, 2009), or more sofisticated ones, like building a data structure that represents the sequence database, which includes frequency of every element as an easy access field, such that the frequency calcu-lation of a pattern gets easier (Ezeife and Lu (2005), Liu and Liu (2010), Peterson and Tang (2008), Tang et al. (2007), Garc´ıa-Hern´andez et al. (2004)), and others. Overall, the calculation of the support of a given pattern should avoid searching in sequences that do not contribute to the frequency count, and avoiding redundant operations as much as possible.

It is desirable to traverse the search space efficiently, i.e., doing the minimum amount of operations and processing only the necessary elements and sequences,

(37)

in order to save memory and improve performance as much as possible. To accomplish this, there have been several approaches, such as pruning techniques that rely directly on the antimonotonic property (Agrawal, 1995), (Mortazavi-Asl et al., 2001) to avoid counting the support of supersequences of infrequent patterns; also, others approaches try to use a data structure that allows them to prune candidate sequences early in the mining process ((Mabroukeh and Ezeife, 2010) (Ezeife and Lu, 2005), (Chen, 2009)).

4.2 Algorithm TreeMine

The main idea behind our proposed algorithm is to build valid patterns by growing an already known valid patternpwith a frequent elementwfound in the database projection with respect to patternp, creating the patternp·w, which Proposition 1 guarantees that it is a valid pattern. This process repeats recursively, until there are no more frequent elements to append to the current pattern. The process then rolls back to the previous valid pattern found and continues growing it with another valid element. The search space, therefore, is represented as a series of trees, one for each valid pattern of length 1. Each node of a tree represents a valid pattern, and they are located in the tree in a way that a pattern node will be the same as its parent node, plus an element appended at the end of the pattern. Figure 4.1 shows an example of the search space of a small sequence database with four different elements.

With each valid pattern discovered, a new node is built and liked to its parent node. Each node contains a valid pattern, and pointers to its parent node and its descendants, to maintain the structure of the current tree. Without loss of generality, we represented the elements in our implementation as positive integers. A pattern can be represented, then, as an array of integers. The data structure of a node pattern is shown next:

class NodePattern {

int Id;

vector<int>Pattern; NodePattern * Parent;

(38)

Figure 4.1: Example of the search space representation of a sequence database

list<NodePattern *>Children; }

Properly linking every valid pattern found by the algorithm would result in a subtree of one of the search space trees. However, we do not need to store the whole tree in main memory during the mining process, but only the branch where the pattern that is being grown belongs. This is because we only need to access the ancestors of the current pattern, when the algorithm rolls back to a previous pattern and tries to grow it with another valid word. This results in a significant memory saving, specially when dealing with sequence databases with a large set of different symbols (P

).

Although our algorithm does not generate candidate patterns, but grows valid patterns from already found valid patterns instead, we still have to make sure that the new found patterns are valid. To guarantee that, our algorithm only uses elements that meet the minimum support to build patterns. Each time the algorithm searches in a projected database for elements, it only retrieves the elements that are frequent enough. In this way, when the algorithm uses those retrieved elements to grow the current valid pattern, we can be sure that the new pattern is valid as well.

(39)

are described later:

The first step of our algorithm comprises preliminary operations, which pre-pare the input data of the main mining process. First of all, the algorithm reads the original database and finds the frequent elements; these elements correspond to the set of the valid patterns of length 1. Next, the database is read again, and the elements that are not on the set of frequent elements are removed. Next, for each frequent element found, the projected database with respect to that element is built and a call to the mining process is made, having as input the element and the projected database. The mining process then will recursively find the valid patterns that start with the corresponding element. The union of the valid patterns found with each iteration will constitute the set of the valid patterns existent in the input database.

The second step of our algorithm occurs inside the main mining process, which receives as input data a sequence database, a pattern that will be used to build new patterns and the minimum support. The first operation made in the main mining process is finding the frequent elements inside the input database. This is done inside the function GetFreqElementsValid. There, each of the sequences of the input database is read, and the occurrences of the elements read in each sequence is counted. Once the function has read all the sequences in the input database, those elements that met the minimum support are returned as output. The third and final step of the proposed algorithm also occurs inside the functionMine, and comprises building new valid patterns out of the input pattern and the frequent elements found in step 2. After returning from the function

GetFreqElementsValid, and for each of the elements returned from it, a new valid pattern is formed, by appending that element to the end of the input pattern. Finally, a projection of the input database is generated and used as the input of a new recursive call to theMinefunction, along with the new valid pattern build, and the minimum support. The process then continues recursively, until no valid elements are found in the current input database. Next, we show an illustrative example about the functionality of our algorithm. Figure 4.2 shows a sequence database composed by six sequences of words or documents, and we want to mine all the sequential patterns with a frequency threshold of 2. The first step in the algorithm is to read the database and find the frequent words, which are also the

(40)

input : A sequence database D and a minimum support η. output: The set F, containing all the frequent patterns in D. 1 Initialization;

// STEP 1: Find frequent elements and filter input database.

2 FW ← GetFrequentElements(D, η); 3 D’ ← FilterDatabase(D, FW);

4 F← FW;

5 foreach symbol a in FW do

6 Proj← GetProjection(D’, a);

7 Mine(Proj,a,η); 8 end

// STEP 2: Find frequent elements in a given projection database.

9 Mine(D, p, η) 10 {

11 FW ← GetFrequentElements (D, η);

// STEP 3: Form new valid patterns by combining the pattern p with every element obtained in the previous step.

12 foreach symbol a in FW do 13 F ← F ∪{p·a};

14 Proj ←GetProjection (D,a); 15 Mine (Proj, {p·a}, η);

16 end

17 }

18 GetFrequentElements (D, η) 19 {

20 foreach document d in D do

21 for i← document.begin to document.end do 22 if FWArray.exists(w) = true then

23 if FWArray(w).sameDocument = false then

24 FWArray.count ++;

25 end

26 else

27 FWArray.add(w);

28 FWArray(w).count← 1;

29 FWArray(w).sameDocument ← true;

30 end

31 end

32 end

33 for wElem← to FWArray.end do 34 if wElem.count < η then 35 wElem.delete;

(41)

Figure 4.2: Example of the algorithm’s functionality. The figure shows a sequence database, the set of frequent elements obtained from it and the filtered database.

set of valid patterns of length 1. Next, we have to remove the infrequent words. The resultant database appears in the lower part of the figure. Next, we take one of the valid patterns found to start growing it. This is shown in Figure 4.3. For this example, we took the word away and built the projection database with respect to that pattern. Next, we search in the projection database for frequent words. From that search, we found two valid words: from and the, which are used to build the valid patternsaway fromand away the. Next, we chose a word, in this case the word from, from the current set of valid words and we build the projected database with respect to the chosen word, of the current database (which is a projected database of the original sequence database). Finally, we use the projected database, along with the wordfrom and the patternaway from

as the input for the next call of the mining process and the recursion continues; this part of the process corresponds to Step 3 in the algorithm. From the input database of the new recursive call, the only valid word found isthe, which is used to build the valid pattern away from the.

(42)

Figure 4.3: Example of the algorithm’s functionality. The figure shows two re-cursive calls to the main mining process, where 3 valid patterns are found.

(43)

Figure 4.4: Example of the algorithm’s functionality. The upper part shows the next recursive call with respect to the Figure 4.3, where the input database does not contain any word. The algorithm goes back to the previous pattern and does another recursive call with a different valid word (lower part of the figure), which algo does not contribute with any valid pattern.

In the next recursive call, there are no more valid patterns to be found, as the input database does not contain any words. This is shown in the upper part of Figure 4.4.

The algorithm then has to go back to the pattern which still has unprocessed valid words in its associated valid word set, in our example, the pattern away, and now choses the word the to use in a subsequent recursive call. Also, in this recursive call there are no valid patterns to be found, because the input database does not contain any words. This is shown in the lower part of Figure 4.4. Because there are no more valid words to grow the pattern away, we have now found all the patterns that start with that word. These patterns and their ancestor/descendant relations are shown as a tree structure in Figure 4.5. Next,

(44)

Figure 4.5: Patterns found by growing the pattern away in the algorithm’s ex-ample.

we chose the pattern cat and we follow the same process as the one with the pattern away, which is shown in Figure 4.6. We build the projection database with respect to the pattern cat and then we search in that database for valid words. We only found the word runs to be frequent, which we use, along with the pattern cat to build the valid pattern cat runs. Next, we take the word runs

from the new valid word set and generate another projected database. In this database, we only find the valid word fast, so we only build the valid pattern

cat runs fast. Now, we take the just discovered word fast and generate another projected database, which does not have any words, so the recursion stops there, and we will continue with the word fast associated with the patterncat, until we have found all the valid patterns that start with the word cat. This is shown in Figure 4.7. These patterns and their ancestor/descendant relations are shown in Figure 4.8. The next step would be to take another valid pattern of length 1 and repeat the process. The algorithm will continue until all of the valid patterns of length 1 are processed. Then, the complete setP of valid patterns is: P ={away, away from, away the, away from the, cat, cat runs, cat runs fast, dog, dog runs, dog runs towards, towards, fast, from, from the, my, my dog, my dog runs, my dog runs towards, runs, runs fast, runs away, runs away from, runs away from the, runs from, runs from the, the, the runs}

(45)

Figure 4.6: Recursive calls of the Mine function to grow the pattern cat.

Figure 4.7: Last recursive call to grow the pattern cat. The projection database does not contain any word, so the recursion stopped in this direction.

(46)

Figure 4.8: Patterns found by growing the patterncatin the algorithm’s example.

4.2.1 Correctness Proof

Let

• D be the input database.

• η be the frequency threshold.

• Palg be the set of patterns discovered by the algorithm.

• P be the set of patterns that meet the frequency threshold.

The correctness of our algorithm can be demonstrated by showing proof of the following two statements:

• Statement 1: If sis a pattern that meets the frequency threshold, then s is discovered by our algorithm, i.e., if s ∈P then s∈Palg.

• Statement 2: ifsis a pattern discovered by our algorithm, thensmeets the frequency threshold, i.e., if s∈Palg then s ∈P.

4.2.2 Proof of Statement 1

We show that if s ∈P then s ∈Palg, by mathematical induction over the length of s, or len(s), for all the natural numbers.

(47)

Proof. Basis: Show that the statement holds for len(s) = 1.

The algorithm scans the input database and finds all the words whose fre-quency is equal or greater than the frefre-quency threshold as a first step. These words correspond to the patterns of length 1. Therefore, it has been shown that the statement 1 holds for the basis step.

Inductive step: Show that if the statement holds for a sequences0 of length k , then it also holds for a sequence s of lengthk+ 1 where k ∈_N

We have

s=s1·s2·s3·. . .·sk·sk+1 (4.1)

This could be rewritten as a patternq formed by the first k elements of s:

s=q·sk+1 (4.2)

The pattern s is generated in the steps 2 and 3 of the algorithm, having the pattern q as the input pattern, as q meets the frequency threshold, and also is generated by the algorithm, by the induction hypothesis, since len(q) = k.

In step 2, the algorithm generates a list of symbols that meet the frequency threshold for the projection databaseprojs_k(D). Moreover, assis a valid pattern, then sk+1 is also a valid pattern, i.e., sk+1 meets the frequency threshold for the input database D. Then, by Proposition 1, we know that sk+1 also meets the frequency threshold for the projection database projsk(D). Therefore, sk+1 is

included in the list of symbols generated in step 2, and then it will be appended at the end of q in step 3, forming the pattern s. Thus, we have shown that s ∈Palg.

Since both the basis and the inductive step have been proved, it has now been proved by mathematical induction that statement 1 holds for all natural numbers.

4.2.3 Proof of Statement 2

We show that if s ∈Palg then s∈P, by mathematical induction over the length of s, or len(s), for all the natural numbers.

(48)

Proof. Basis: Show that the statement holds for len(s) = 1.

In step 1, the algorithm scans the input database and finds all the words whose frequency is equal or greater than the frequency threshold. Therefore, these words correspond to all the patterns in P whose length is 1, thus showing that the statement 2 holds for the basis step.

Inductive step: Show that if the statement holds for a sequences0 of length k , then it also holds for a sequence s of lengthk+ 1 where k ∈_N

Again, we have

s=s1·s2·s3·. . .·sk·sk+1 (4.3)

This could be rewritten as a patternq formed by the first k elements of s:

s=q·sk+1 (4.4)

The construction of the patternsis done in the steps 2 and 3 of the algorithm. In that particular recursive call, we have the patternq is the input pattern given, as q is generated by the algorithm.

In step 2, the algorithm finds the symbol sk+1, as part of the set of symbols found in function GetF reqElementsV alid. Therefore, we know that sk+1 occurs in at least η documents in the database projection projq(D).

Later, in step 3, the symbolsk+1 is appended at the end of patternq, forming the pattern s. By the inductive hypothesis, q meets the frequency threshold. Also, by the Proposition 1, we know that if sk+1 meets the frequency threshold in the projectionprojq(sk+1), the patternq·sk+1 then meets the frequency threshold in D. Thus, we can assure that s meets the minimum support. Therefore, we have shown that s∈P.

Since both the basis and the inductive step have been proved, it has now been proved by mathematical induction that the statement 2 holds for all natural numbers.

(49)

4.3 Architecture Design

To design a hardware architecture that implements our algorithm, we had to partition it into its main operations and then choose which ones are the most suitable and convenient to be performed by the hardware architecture and which ones should be carried out by a software application, as a pre/post processing phase. The main operations of our algorithm are: building pseudo-projections and finding valid words from such subsets of data. These two operations are executed repeatedly during the whole execution of the algorithm, and are also very costly. Therefore, accelerating these two operations would result in a great improvement in the overall performance. That is why we chose these operations to be the implemented in our hardware architecture. The rest of the operations, such as the ones comprehended in Step 1, which are only executed once, and the building of patterns, which is a rather simple operation, are more suitable to be carried out in a software environment.

Having defined which operations of the algorithm are going to be implemented in a hardware environment, we proposed a hardware architecture formed by four main modules, as shown in Figure 4.9. Two of the modules, ProjectionandValid Elements, build pseudo projections and find valid elements from them, respec-tively. Also, a third module manages the sequence database and providing the

Projection module the sequences that belongs to the current projection database to be processed. Finally, a control module that ensures a correct communication and synchronization among the rest of the modules is provided.

4.3.1 Memory Module

The memory module, shown in Figure 4.10, provides theprojection module with the projected database that is next to be processed and the element that is going to be used as a pivot to generate the next pseudo projection. Also, the module keeps trace of the previously generated pseudo projections and the sets of valid words needed by the algorithm, in case it needs to go back and try to grow a pattern with another element. Thememorymodule is formed by four submodules. The projection memory and the valid elements memory submodules store the new generated pseudo projection and valid element sets, respectively. The valid

Algorithm and hardware architecture for the discovery of frequent sequences