Valor de las seis hornadas
APÉNDICE LA MONCLOA
The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built after conducting a qualitative study of available tools in the market. We targeted mainly open sources to select appropriate cloud computing platform and distributed system. Hence, this chapter presents the analysis we followed in selecting cloud platform and distributed system.
6.1.Cloud Platform Selection
To compare available cloud open sources, we tried to choose the most popular platforms. The selection of competing platforms was based on a study that compares the popularity of OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12, the study showed that OpenStack has the largest total population index, followed by Eucalyptus, CloudStack, and Opennebula.
Figure 12: Active cloud community population [71]
Based on Figure 12, we selected to compare and study OpenStack, Opennebula and Eucalyptus. To adopt one of these cloud open sources, we used some other studies that compare their performance and quality [72-75].
In [72], authors compared some open and commercial cloud platforms. Concerning open platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they adopted a set of criteria, including storage, virtualization, network, management, security and vendor support. The results of the research showed that open-source and commercial solutions
39
can have comparable features, and that OpenNebula is the most feature-complete cloud platform when it is compared with Eucalyptus.
[73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors compared the performance of both cloud platforms based on measuring the time when the cloud starts instantiating VMs and the time when they are ready to accept SSH connections. The findings of the research demonstrate that OpenStack is slightly better than OpenNebula due to smaller instantiation time. Moreover, the results showed that OpenStack is more suitable for high computing due to faster instantiation of large number of VMs. In [74], authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula. For the qualitative analysis, they adopted some of the following criteria: security, virtualization supported, access, image support, resource selection, storage support, high- availability support and API support. Based on the results of the qualitative study, authors concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would benefit in case of persistent storage support. For the quantitative analysis, authors measured the deployment, network overhead and the clean-up time of VMs. The results of quantitative analysis showed that each platform can be used depending on user requirements and specifications.
In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack, OpenNebula and CloudStack. To perform the comparison, authors adopted the following criteria: storage, network, security, hypervisor, scalable and installation code openness. In short, the results of this study [75] showed that OpenStack is the preferred cloud open source. Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go for OpenStack as it is known for its flexibility and total openness.
40
6.2.Distributed and Parallel System Selection
To compare available distributed and parallel systems in the market, we opted again for the popularity index of those systems. The selection of competing systems was based on a study done in [76]. The study is summarized in Figure 13 which compares the popularity index of Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The study was done in 2012, and it demonstrates the total downloads between January 2011 and March 2012. Figure 13 depicts that Hadoop is the most popular distributed system, followed by MongoDB and Cassandra.
Figure 13: Active distributed systems population [76]
Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in order to end up with one selected system for the present research.
MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data in tables with columns and rows. To provide high redundancy and make data highly available, MongoDB offers replication across multiple servers. While data is synchronized between servers using replication, MongoDB also facilitates the scale out option by supporting sharding which partitions a collection and stores the different portions on different machines. MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On the other hand, Hadoop is an open source for distributed file system that supports processing, analyzing and storing large data sets across large clusters using MapReduce paradigm and HDFS [7]. More details about Hadoop are included in chapter 8.
41
A study done in [77] compares MongoDB and Hadoop systems. The study came up with three main conclusions; first, it is not appropriate to use MongoDB as an analytics platform; second, using Hadoop for MapReduce jobs is several times faster than using the built-in MongoDB MapReduce capability, and third, MongoDB is much slower than HDFS. Besides, a study was done in [78] did a comparison of Map-Reduce Performance of Hadoop and MongoDB. In short, the study showed that MongoDB is roughly four times slower than Hadoop in fully-distributed mode.
Table 3 summarizes the selected distributed system in [77] and [78]. Based on this table, we decided to go for Hadoop as an analytical and storage tool for the present research.
42