VIABILIDAD DE UN DISEÑO - SATELLITE POWER SIMULATOR

11. DISEÑO PRELIMINAR, SIMULACIONES Y ENSAYOS DEL SISTEMA DE POTENCIA

11.3. SATELLITE POWER SIMULATOR

11.3.7 VIABILIDAD DE UN DISEÑO

We decided to investigate how our multithreading approach could compare to a distributed approach using Hadoop. Therefore, we implemented a pattern matching application in

MapReduce. We also used the standard library’s regex package. The mappers process the input and search for the pattern. The mapper produces the intermediate output <key, value> pairs. The key is a text object consisting of the pattern group and the value is an integer indicating the location of that group. These pairs are then passed to the reducer. The reducer aggregates the results based on the keys, and outputs another <key, value> pair, where the key is the pattern group and the value is a list of locations for that pattern.

Splits in Hadoop are of two types: physical and logical. Physical splits are how the HDFS chunks the files into blocks and distributes them among the nodes. This is based on a block size parameter that can be specified either in the conf files or when the input files are ported to HDFS. The logical splits are how the MapReduce splits these chunks to be processed by the mappers. This is defined by the specification of the input file format. The mapper in Hadoop processes the input file one line at a time, then performs the mapping function, which in our case, searches for the pattern location. The reducer then groups the patterns found with their locations and writes the result to a file. We have kept the default installation configurations and have chosen to use one reducer by setting ‘setNumReduceTasks’ to 1 so that the aggregated result will be in one file. The output file thus consists of the different patterns along with their locations in the text file.

The Hadoop library contains a RegexMapper<K> class. The implementation of the map function searches for the pattern. Hence, each line is processed separately, and the pattern is searched for within that line. We have based our implementation on the RegexMapper class but have tweaked it to fit our needs. We have also implemented a reducer class that aggregates the multiple patterns found and outputs the patterns with their locations to file.

Testing on Hadoop was performed on a cluster of five virtual machines with the same specifications as those mentioned in Table 4-1. One machine served as the master and slave node concurrently, and the other four machines served as slave nodes.

Execution time using Hadoop is measured from the time the input is read up until the result is written to file. The results are recorded in Table 4-11. The first row indicates the average runtimes in milliseconds and the second row is the percent improvement over the single thread

implementation. The first column corresponds to the single thread implementation. The column in the table labeled Hadoop gives the Hadoop runtime and improvement. We obtained an 89% reduction in runtime. This is due to the fact that Hadoop reads the input one line at a time and performs its map function on each line, which might be shorter than even the substrings in the multithreaded option.

Table 4-11: Average runtimes and percent improvement for Hadoop implementation t = 1 Hadoop Single lines

Average runtime (ms) 73,199 7,670 20,880

Percent improvement 89.52 71.47

In order to better compare, we have devised another approach that consists of the single threaded option reading the input from file and searching for the pattern one line at a time. This is comparable to the way Hadoop reads input files. The results are in the last column of Table 4-11. This method performed worse than Hadoop due to the fact that Hadoop has a more

efficient file system and does reading and writing to files more efficiently. However, this method performed better than our single-threaded method of reading the whole input into a string.

Comparing these two methods in Table 4-12, we see that Hadoop outperforms the single line implementation by 63%.

Table 4-12: Runtime improvements

Single Lines Hadoop

Average runtime (ms) 20,880 7,670

However, our multithreaded approach still gave better results. We believe this is because the data that we tested is not big enough to compensate for the expensive writes to disk. This led us to explore another alternative to MapReduce. Spark is a fast general compute engine that can run on top of Yarn and HDFS. In addition to achieving Hadoop’s objectives, Spark is able to evolve past these main attributes and solve several of Hadoop’s main issues. The discussion on Spark is beyond the scope of this chapter, but will be further discussed in the next one.

4.8 Conclusion

We have presented a tool that detects a user’s function call to search for a pattern and seamlessly replaces it with our own distributed MPI-based implementation. We have used bytecode injection and the Open MPI Java bindings to run our optimized code. We have used three different approaches for message passing: point-to-point using multiple sends and receives, collective using the MPI scatter routine, and local in which the data resides on all nodes. Our experiments show a reduction in total execution time for all three approaches, and we have discussed the advantages and disadvantages of selecting one over the other. In addition to that, we have explored another form of distributed communication represented by Hadoop. Because data exchange using Hadoop is based on the filesystem, we have chosen to compare its

performance to our multithreaded approach. This allowed us to learn how Hadoop works and what its limitations are.

In document Programación de un software para el prediseño del sistema de potencia de un satélite de órbita baja "Satellite Power simulation" (página 101-168)