Proceso de creación
LAS FIESTAS RELIGIOSAS
5.3 ENCUENTROS DE UNA COMUNIDAD CON SU MÚSICA
a shared-memory system using a specialized interconnect and processor architecture. With increasing nodes, it is expected that we can even outperform [51] on the hash operation on the basis of the theoretical analysis in Section 4.2.
4.5
Discussion
The study of distributed parallel hashing main focuses on (1) low-level communication schemes such as the use of the IBM LAPI [83], and (2) parallel programming paradigms or languages, such as the use of Java, MPI and UPC [43, 85, 98]. In these implementations, hash operations are always accompanied with frequent and irregular remote memory ac- cess with a concomitant increase in low-level communication overhead and the associated performance hit. Therefore, they are more suitable for processing small data, but not for massive data.
There is long history of theoretical studies [72, 84] in terms of the thread-level parallel approaches. By employing different hashing strategies, implementations on various plat- forms have achieved excellent performance [50, 51, 106]. Our implementation performs comparably or slightly worse than the fastest one [51], however our approach relies on low cost commodity hardware, adding to its flexibility.
GPU computing has become a well-accepted parallel computing paradigm and there are many reports on implementations of parallel hashing based on that [8, 48]. Implementa- tions of these hash tables exhibit strong performance. However, GPU memory is limited so therefore such methods cannot work with excessively hash tables of the sizes shown in this thesis. In addition, reading data into GPUs takes a considerable time, adding significant overhead for a simple task, from the perspective of computation.
Although parallel hash joins are widely studied in modern parallel database manage- ment systems [7, 16, 20], there is little research focuses on the parallelism of underlying hash tables. With the increase in size of process datasets in this domain [7], we expect that the hash strategies used in our hash tables can further improve join performance here.
The idea behind our method is straightforward, yet not trivial, and does not appear in the literature. Consequently we believe that the evaluations conducted here and the results described are of value to the community as a basis for understanding the merits of the approach. Moreover, our theoretical analysis in Section 4.2 confirm that our structured method is faster for large datasets - a result verified through our experiments. Finally we also contribute a range-based strategy for our hashing implementation, which is shown to be faster than the commonly used CAS method within our framework.
4.6
Conclusions
In this chapter, we proposed a high-level structured framework for parallel hashing, which has been designed for processing massive data. This framework supports (a) distributed memory while avoiding frequent remote memory access, and (b) thread coordination on a per-partition basis. Based on that, we presented an efficient parallel hashing algorithm by employing the popular CAS and our proposed range-based lock-free hashing strategies.
The experimental evaluation results show that our implementation is highly efficient and scalable in processing large datasets. Moreover, this hash framework demonstrates useful flexibility in that it can employ various hashing techniques and can be run on commodity hardware. Finally, the proposed Range lock-free strategy is faster than the conventional CAS operation and presents better load balancing characteristics than approaches which use a single thread per partition. Additionally, we have characterized the performance of our hash implementations through extensive experiments, thereby allowing us to make a more informed choice for our high-performance implementations over distributed memory. In the following chapters, we will present a detailed implementation of our proposed framework, and focus on techniques to improve the performance for the three core opera- tions (encoding, joins and indexing) we have described.
Chapter 5
Scalable RDF Data Compression using
X10
5.1
Introduction
With the study of triple stores and parallel hashing in the previous two chapters, now we turn to the detailed implementations of our framework. In this chapter, we will propose an efficient dictionary encoding method to compress large RDF data in parallel. Our solution is based on a distributed architecture with multiple dictionaries. Namely, the RDF data is partitioned and then compressed using a dictionary on each computation node. However, similar to the state-of-art MapReduce method [116] as described in Chapter 2, there exist three main challenges under this schema:
• Consistency - a term appearing on different compute nodes should have the same id. • Performance - ensuring consistency based on naive methods can lead to serious per-
formance degradation.
• Load balancing - the heavy skew of terms which characterizes real world linked data [78] may lead to hotspots for the nodes responsible for encoding these popu- lar terms.
Both in space and time, the mapping of a term need always keep its uniqueness. For ex- ample, once the term “dbpedia:IBM" is first encoded as id “101" on node A, when encoding this string on another node B, we should also use the same value “101". Hash functions are potentially useful, but the length of the hash required to avoid collisions when processing billions of terms makes the space cost prohibitive.
We can ensure the consistency of the compression in the above example by copying the mapping [dbpedia:IBM, 101] from node A to node B, but network communication cost and dealing with concurrency (e.g. locking on data structures) would lead to low performance.
Compared with the two issues above, load balancing presents a bigger challenge as the distribution of terms in the Semantic Web is highly skewed: there exist both popular (like predefined RDF and RDFS vocabulary) and unpopular terms (like identifiers for entities that only appear for a limited number of times). For a distributed system, like ours, any com- pression algorithm needs to be carefully engineered so that good network communication and computational load-balance are achieved. If terms are assigned using a simple hash dis- tribution algorithm, the continuous re-distribution of all the terms would undoubtedly lead to an overloaded network. Furthermore, popular terms would lead to load-balancing issues. For the sake of explanation, let us categorize terms into three groups: high-popularity terms that appear in a significant portion of the input triples, low-popularity terms that ap- pear less than a handful of times and average-popularity terms (which is also the largest portion of RDF data). The state-of-the-art MapReduce compression algorithm [116] effi- ciently processes high-popularity terms. The very first job in the algorithm is to sample and assign identifiers to popular terms, using an arbitrarily chosen threshold. These identifiers are then distributed to all nodes in the system, and used to encode terms locally at each node. This dramatically improves load balancing and speeds up computation. For the rest of the terms, the data is repartitioned, and identifiers are assigned. For low-popularity terms, this also works well, as there are not many redundant data transfers. For low-popularity terms, we can either retrieve their mappings (possibly for multiple nodes), or we can send the data to the node where it is going to be encoded. In either case, the number of messages will be limited. For medium-popularity terms, the situation is different: Assume a term that appears 10000 times, and we have 100 compute nodes. If all nodes would need to retrieve the mapping from a single node, we would need 200 messages. If we repartition the terms, we would need at least 10000 messages. One can easily see the situation reversed for a term that appears 100 times (i.e. partitioning data might be more efficient that retrieving mappings). Then the question is: how can we reconcile efficient encoding of popular and non-popular terms?
To solve the above problems, we propose a straightforward but very efficient and scal- able solution for compressing massive RDF data in parallel in the following. We develop an algorithm and present its detailed implementation using the X10 language [22]. We eval- uate performance with up to 384 cores and with datasets comprising of up to 11 billion triples (1.9 TB). Compared to the state-of-the-art [116], our approach is faster (by a fac-