Data analysis researchers have seen in DISC systems the perfect tool to scale their computations to huge datasets. In the literature, there are many examples of successful applications of DISC systems in different areas of data analysis. In this section, we review some of the most signi- ficative example, with anl emphasis on Web mining and MapReduce.
Information retrieval has been the classical area of application for MR Indexing bags of documents is a representative application of this field, as the original purpose of MR was to build Google’s index for Web search. Starting from the original algorithm, many improvements have been proposed, from single-pass indexing to tuning the scalability and efficiency of the process. (McCreadie et al., 2009a,b, 2011).
Several information retrieval systems have been built on top of MR. Terrier (Ounis et al., 2006), Ivory (Lin et al., 2009) and MIREX (Hiemstra and Hauff, 2010) enable testing new IR techniques in a distributed set- ting with massive datasets. These systems also provide state-of-the-art implementation of common indexing and retrieval functionalities.
As we explain in Chapter 3, pairwise document similarity is a com- mon tool for a variety of problems such as clustering and cross-document coreference resolution. MR algorithms for closely related problems like fuzzy joins (Afrati et al., 2011a) and pairwise semantic similarity (Pan- tel et al., 2009) have been proposed. Traditional two-way and multiway joins have been explored by Afrati and Ullman (2010, 2011)
Logs are a gold mine of information about users’ preferences, natu- rally modeled as bags of lines. For instance frequent itemset mining of query logs is commonly used for query recommendation. Li et al. (2008) propose a massively parallel algorithm for this task. Zhou et al. (2009) design a DISC-based OLAP system for query logs that supports a vari- ety of data mining applications. Blanas et al. (2010) study several MR join algorithms for log processing, where the size of the bags is very skewed. MapReduce has proven itself as an extremely flexible tool for a num- ber of tasks in text mining such as expectation maximization (EM) and latent dirichlet allocation (LDA) (Lin and Dyer, 2010; Liu et al., 2011).
Mining massive graphs is another area of research where MapReduce has been extensively employed. Cohen (2009) presents basic patterns of computation for graph algorithms in MR, while Lin and Schatz (2010) presents more advanced performance optimizations. Kang et al. (2011) propose an algorithm to estimate the graph diameter for graphs with billions of nodes like the Web, which is by itself a challenging task.
Frequently computed metrics such as the clustering coefficient and the transitivity ratio involve counting triangles in the graph. Tsourakakis et al. (2009) and Pagh and Tsourakakis (2011) propose two approaches for approximated triangle counting on MR using randomization, while Yoon and Kim (2011) propose an improved sampling scheme. Suri and Vassilvitskii (2011) propose exact algorithms for triangle counting that scale to massive graphs on MapReduce.
MR has been used also to address NP-complete problems. Brocheler et al. (2010) and Afrati et al. (2011b) propose algorithms for subgraph matching. Chierichetti et al. (2010) tackles the problem of max cover- ing, which can be applied to vertexes and edges in a graph. Lattanzi et al. (2011) propose a filtering technique for graph matching tailored for MapReduce. In Chapter 4 we will present two MapReduce algorithms for graph b-matching with applications to social media.
A number of different graph mining tasks can be unified via general- ization of matrix operations. Starting from this observation, Kang et al. (2009) build a MR library for operation like PageRank, random walks with restart and finding connected components. Co-clustering is the si- multaneous clustering of the rows and columns of a matrix in order to find correlated sub-matrixes. Papadimitriou and Sun (2008) describe a MR implementation called Dis-Co. Liu et al. (2010) explore the problem of non-negative matrix factorization of Web data in MR.
Streaming on DISC system is still an unexplored area of research. Large scale general purpose stream processing platforms are a recent de- velopment. However, many streaming algorithms in the literature can be adapted easily. For an overview of distributed stream mining algo- rithms see the book by Aggarwal (2007). For large scale stream mining algorithms see Chapter 4 of the book by Rajaraman and Ullman (2010).
DISC technologies have also been applied to other fields, such as sci- entific simulations of earthquakes (Chen and Schlosser, 2008), collabo- rative Web filtering (Noll and Meinel, 2008) and bioinformatics (Schatz, 2009). Machine learning has been a fertile ground for MR applications such as prediction of user rating of movies (Chen and Schlosser, 2008). In general, many widely used machine learning algorithms can be im- plemented in MR (Chu et al., 2006).
Chapter 3
Similarity Self-Join in
MapReduce
Similarity self-join computes all pairs of objects in a collection such that their similarity value satisfies a given criterion. For example it can be used to find similar users in a community or similar images in a set. In this thesis we focus on bags of Web pages. The inherent sparseness of the lexicon of pages on the Web allows for effective pruning strategies that greatly reduce the number of pairs to evaluate.
In this chapter we introduce SSJ-2 and SSJ-2R, two novel parallel al- gorithms for the MapReduce framework. Our work builds on state of the art pruning techniques employed in serial algorithms for similarity join. SSJ-2 and SSJ-2R leverage the presence of a distributed file system to support communication patterns that do not fit naturally the MapRe- duce framework. SSJ-2R shows how to effectively utilize the sorting facilities offered by MapReduce in order to build a streaming algorithm. Finally, we introduce a partitioning strategy that allows to avoid memory bottlenecks by arbitrarily reducing the memory footprint of SSJ-2R.
We present a theoretical and empirical analysis of the performance of our two proposed algorithms across the three phases of MapReduce. Experimental evidence on real world data from the Web demonstrates that SSJ-2R outperforms the state of the art by a factor of 4.5.