• No se han encontrado resultados

6.1 MAIN CONTRIBUTIONS

This thesis considered the well-known graph partitioning problem. We claim that the com- putation performed on the partitionings computed by existing graph partitioning algorithms does not efficiently utilize modern HPC infrastructures. This impedes the efficiency of com- puting infrastructure as well as the scalability of the target workload. We advocate for architecture- and workload-aware graph partitioning to enable efficient distributed graph computation.

We first investigated the performance impact of modern HPC infrastructures on dis- tributed graph workloads. As a result of this study, we identified two important factors one should consider when partitioning the graphs: (1) the non-uniform network communication costs of the underlying computing infrastructures; and (2) the contention for the shared hardware resources on the memory subsystems of modern HPC clusters. We also provided a holistic view on: (a) why we have to be aware of the characteristics of modern HPC in- frastructures for distributed graph workloads; and (b) to what extent these characteristics may impact the performance of distributed graph workloads.

To avoid such negative performance impact, we proposed an architecture- and workload- aware graph partitioning algorithm, Argo, for efficient distributed graph computation on static graphs. Argo follows the same streaming model proposed by other graph parti- tioners. In this model, vertices arrive at the partitioner in a certain order along with their adjacency lists. The partitioner decides the placement of each arrived vertex to one of the partitions permanently based on the placements of the vertices previously arrived. The key novelty of Argo lies in making the vertex placement aware of (a) the non-uniform network

communication costs of the underlying computing infrastructures; and (b) the contentious- ness of the memory subsystems of modern HPC clusters. We also make Argo aware of the runtime characteristics of the target workload by encoding such information into the vertex and edge weights of the graph.

We then presented four new graph repartitioning algorithms: Aragon, Paragon, Planar, and Planar+ for efficient distributed graph computation on dynamic graphs. They all attempt to adapt the current partitioning to the changes in the graph by migrating vertices among the partitions. The migration is only allowed if the gain of moving the vertex from its current partition to an alternative partition is positive. The gain of migrating a vertex is defined as the reduction in the communication cost incurred by the vertex during the computation. One of the key contributions of this thesis is that we make the vertex gain computation process aware of both the communication heterogeneity and the contentiousness of the underlying computing infrastructures.

Out of the four new algorithms, Aragon is a centralized solution with the assumption that the graphs are small enough to be held in the memory of a single machine, whereas Paragon is a parallel version of Aragon designed for median-sized graphs. Planar and Planar+ overcome the drawbacks of Paragon by scaling it to even larger graphs and by increasing the degree of parallelism of the repartitioning algorithm. Planar+ further reduces the overhead of Planar, by introducing an efficient way of modeling the commu- nication heterogeneity and contentiousness. This, in turn, enables an optimized vertex gain computation. Making the partitioning algorithms scale efficiently against large graphs is another key contribution of the thesis.

Tables 6.1 and 6.2 provide a brief summary for the four proposed architecture- and workload-aware graph repartitioners. Table 6.1 summarizes our proposed repartitioners in terms of (a) the expected size of the graphs that each algorithm can handle; (b) the opti- mization objective; (c) the maximum and average hopcut/edgecut reduced by each algorithm when compared with the initial partitioning on the YouTube, as-skitter, com-lj, and com- orkut datasets; and (d) the maximum and average percentage of vertices migrated. Table6.2

summarizes the repartitioners in terms of (a) the largest graph evaluated; (b) the largest number of partitions evaluated; and (c) the maximum speedup achieved against LDG for

the execution of PageRank (20 Iterations) on the partitionings of the Friendster dataset as well as the actual CPU time saved. The CPU time saved is defined as:

CP U T imeSaving = (J ET Saving − repartT ime) ∗ n (6.1)

Here, J ET Saving, repartT ime, and n denote the reduction in the workload execution time, the time taken by the repartitioners to compute the partitioning, and the number of com- puting elements (cores) used, respectively.

Table 6.1: A summary of the Proposed Graph Repartitioners: Part1

Algorithms Desired Graph Size Optimization Objectives Hopcut Reduced (Figure4.36a) Edgecut Reduced (Figure4.36b)

Vertex Mig. Ratio (Figure4.37) Edgecut

or Hopcut

Mig. Cost Max Avg. Max Avg. Max Avg.

Aragon Small Hopcut X 30.0% 18.5% 17.0% 10.2% 23.0% 17.7% Paragon Median-Sized 30.2% 17.8% 16.2% 9.5% 21.6% 16.4% Planar Large 42.0% 25.3% 33.0% 19.1% 48.6% 40.1% Planar+ Large 42.4% 25.8% 34.4% 17.8% 31.0% 24.6%

uniPlanar+ Large Edgecut 39.6% 21.8% 36.8% 18.7% 28.6% 24.2%

Table 6.2: A summary of the Proposed Graph Repartitioners: Part2

Algorithms Largest Graph Evaluated Largest # of Partitions Evaluated Evaluation of PageRank on Friendster with 120 Partitions

(Figure 4.40a)

|V | |E| Max Speedup CPU Time Saved

Aragon 3M 234M 40 NA NA Paragon 124M 3.6B 240 2.09 25h Planar 3.2 27h Planar+ 43h uniPlanar+ 1.23 10h

Finally, we looked at the problem of skew-resistant graph partitioning for graph workloads with predictable runtime characteristics. Towards this, we studied the runtime characteristics of two representative traversal-style graph workloads: BFS and SSSP. Based on the study, we proposed the idea of multi-label graph partitioning (MLGP) and an appli- cation of this idea is to do skew-resistant graph partitioning.

6.2 MAIN IMPACT

The main impact of this thesis is that we identified an important aspect that has been ignored by the current graph processing community, that is, the performance impact of the underlying HPC infrastructures, especially the contentiousness of the memory subsystems, on distributed graph computation. In fact, this is also a blind spot for general distributed computation, where people often assume that the network is the bottleneck.

In particular, we made an in-depth analysis about the factors that one should consider while (re)partitioning the graph for distributed graph computing, namely, the non-uniform network communication costs (Section 2.2.1) and the contention on the memory subsystems (Section 2.2.2). We also experimentally demonstrated (1) that the network may not always be the bottleneck in modern HPC clusters (Section 2.2.3); and (2) that the contention on the memory subsystems can impact the performance of distributed graph computation significantly (Section2.2.3). Based on our analysis and our observations, we showed that even with simple managed graph (re)partitioning we can achieve significantly better performance (Chapters 3 & 4). All these observations will enable the graph processing community to rethink the design of graph (re)partitioning algorithms and even the design of distributed graph computing frameworks.