• No se han encontrado resultados

VI. METODOLOGÍA GENERAL

2. Actores sociales y sus características

2.1. Científicos y Tecnólogos

One interesting field of application is capacity planning. In collaboration with Tom Cornebize, we studied how HPL benefits from more resources and the respective energy consumption. As is generally the case for capacity planning, the machine does not (yet) exist. The platform we imagined contains 256 nodes with 12 cores per node. This is deliberately similar to the taurus-cluster that was used for the validation study. This time, however, we decided to connect all nodes with a hypothetical fat-tree network, built with 16-port switches on two levels: the top layer consists of 2 switches whereas the bottom layer comprises 16 The links were set up as 10 Gbit/s Ethernet links.

The previous validation study leveraged SMPI’s emulation mechanism and hence executed every instruction of the unmodified applications. At most 144 processes were used and the workload size was limited: For HPL, a matrix of size 20, 000 ×

20, 000, as previously used, consumes 3.2 GB of memory. For this (very small) use- case, the main problem was hence not the memory consumption but the emulation time of almost two hours. This is a drastic, but not unexpected increase from the 20 seconds it took to execute HPL with 144 MPI-processes on 12 nodes of the taurus cluster. This problem exacerbates with more processes, a more complex network and larger problem instances. Furthermore, memory does become an issue once the input matrix reaches a certain size: For a still relatively small 65, 536 × 65, 536 matrix, 34.3 GB of memory are required which surpasses the entire memory of a single taurus node. To alleviate this prohibitive resource hunger, we resorted to the two techniques presented in Section4.3.3that allow SMPI to emulate runs at larger scale by exploiting HPL’s regularity: First, kernel modeling to reduce the execution time by skipping their execution and second, shared memory usage to reduce the required memory. We have detailed the necessary modifications that finally allowed us to simulate HPL on a single node using a model and the scale of the Stampede supercomputer in a technical report [Cor+17] that is currently under preparation for publication. For the above matrix (20, 000 × 20, 000), these modifications allowed us to run the (sequentially executed) emulation of 144 processes on a commodity laptop in under two minutes with as little as 43 MB of RAM used. Using all nodes and all cores of our (hypothetical) platform, the emulation of 256 × 12 = 3, 072 MPI processes took around 90 minutes and required not even 1.5 GB of RAM. Our study was limited to the two aforementioned input sizes. For each size, in total 5 scenarios were executed and results are depicted in Figure8.11. The simulation results on up to 12 nodes yield the same results as were obtained in real-life (red line in the top figure). This is expected, because all nodes are connected to the same switch in this case and the new network-topology is hence ignored. When using more nodes than can be connected to the same switch, however, a slowdown can be observed due to the added latency. This also increased the rate of power consumption, which is already elevated due to the added nodes. The larger matrix, on the other hand, was more suitable for scaling. The energy consumption continued to grow but this is partly also due to a different ratio of communication and computation.

8.5 Limitations

8.5.1 Model Limitations

Recall from Section8.2that our model is calibrated with and hence depends on the workload w. Since the model calculates the power consumption for an entire node, it even implicitly assumes that all cores either execute w, a workload with similar characteristics (e.g., memory accesses, cache usage, I/O, . . . ), communicate or are fully idle and that the workload only changes within these three cases throughout

HPL Reality Simulation Matrix Size: 20,000 Ideal scaling Above 1 switch 0 10 20 30 1x12 64x12 128x12 192x12 256x12 Run−time (in s) ●●●● Above 1 switch 0 200 400 600 800 1x12 64x12 128x12 192x12 256x12 Energy (in kJ) Matrix Size: 65,536 Ideal scaling Above 1 switch 0 250 500 750 1,000 1,250 1x12 64x12 128x12 192x12 256x12 Run−time (in s) Above 1 switch 0 1,000 2,000 3,000 1x12 64x12 128x12 192x12 256x12 Energy (in kJ)

nodes x processes per node

Figure 8.11: Time- and energy-to-solution extrapolated for two different matrix sizes with

up to 256 × 12 = 3, 072 MPI processes, interconnected by a fat-tree topol- ogy. Once a threshold is reached, adding more nodes does not yield faster performance but only increased energy consumption.

the entire execution. For HPC applications, this is often a reasonable assumption due to their regularity. However, this restriction can be violated in three scenarios that are more difficult or currently impossible to model because the memory and cache usage is difficult to predict.

First, an application can consist of phases, i.e., the characteristics of the currently executed code changes (for instance from a memory or cache heavy computation to an I/O heavy checkpointing procedure (see Section2.2.1) or when a workload is offloaded to a previously idle GPU). The energy profile of the computational workload of the application therefore does not remain constant throughout the execution. Each of these phases should then be characterized individually. This is already possible and can be done analogously to the previously described “special states” such as booting. However, further problems may be encountered by the user when the phases are of microscopic length (e.g., when kernels constitute each an individual phase) because power measurement and tracing tools that support

this precision are rarely available. As explained in Section5.1, the wattmeter we used for our experiments returned a single value per second. The measured power is in such a case not clearly associable to a single phase, however, one can track how much time tpi was spent during the i − th measurement interval working on phase

p. Likewise, the amount of energy ei during the i − th measurement can easily be measured. With a large enough set of samples (ei, t1i, ..., tNi ), the application of statistical estimators should make it possible to infer the consumption of each phase.

In the second problematic scenario, the CPU is shared between different applica- tions, each with different but constant energy consumption. In-situ applications are an example. These are applications that can be separated into mainly two parts: The first part (simulation component) generates data that are subsequently statistically analyzed by the second part (statistics component). This case is very different than the previous as several kernels are executed at the same time and hence potentially impact each other. Providing a single energy profile per application might therefore not be sufficient. It may therefore be necessary to obtain an energy and performance profile of the concurrently executed components that depends on the number of cores alotted to each component.

Finally, in the third scenario, execution of kernels and applications is no longer structured, i.e., no assumptions on which workload is executed at a specific time can be made. Highly dynamic applications and runtimes (such as Star-PU, see Sec- tion4.4) often fall into this category because the number of cores and kernels cause a combinatorial explosion of kernels that could potentially execute in parallel. Alas, the cache and memory usage directly influences the energy consumption and the only real option for faithful predictions with our model is to obtain measurements for all possible combinations.

8.5.2 Experimental Limitations

Recall from Section5.1that our wattmeter provides a single, averaged sample per second. It lies in the very nature of an average of a non-constant series that some values must be larger and others smaller than the average. To instantiate our model, the user is required to supply the consumption when all cores are active. This value was measured by running the application on all cores at the same time, but we only later did we find (through simulation) that the node’s load changes many times per second when executing NAS-LU. To exemplify this, Table8.1lists for each core count the absolute time that only this many cores are active. As one can see, the application computes concurrently on all cores only about 62.76 % of the entire execution time. This means that, since our real-life watt measurement is an

Cores Load Total time Percent of total execution time 0 0.000 0.105638 0.12987747 1 0.083 0.316098 0.38862918 2 0.166 0.016594 0.020401624 3 0.250 0.247750 0.30459819 4 0.333 0.284455 0.34972544 5 0.416 0.473350 0.58196389 6 0.500 1.607626 1.9765085 7 0.583 0.977106 1.2013107 8 0.666 2.194727 2.6983244 9 0.750 5.018271 6.1697528 10 0.833 5.915051 7.2723061 11 0.916 13.126249 16.138170 12 1.000 51.053753 62.768439

Table 8.1: An entire execution of NAS-LU (class C) on a single node with 12 cores, broken

down by the time spent with each possible load factor and the percentage relative to the total execution time of 81.336 s.

average, the actual consumption for all cores must be higher because the samples also include the consumption during the remaining 37.24 % of the execution when less cores are used. This implies for our energy model that the energy consumption we extrapolate when 2 to 11 cores are used must be an underestimation of the actual power usage since the slope of the linear function should be steeper. We believe that this error should be accounted for in future versions but that our results are still valid: Table8.1shows that only very few cores are idle during this remaining time and that the energy consumption hence remains relatively high. This means that the actual maximal consumption should only differ by a few watts. Another reason is that, when compared to the measured consumption of over 200 W s, the actual error should only be a few percent.