VI. METODOLOGÍA GENERAL
6. Perspectivas de aplicaciones potenciales
It was already mentioned in Section4.1.2 that SimGrid is currently undergoing a major rewrite through iterative refactoring. This changing infrastructure and complex simulated software-codes made it often necessary to contribute additional, refactored or fully rewritten code, tests, or bugfixes in parts of the code that I had started to (co-)maintain, most notably SMPI.
The development of SimGrid itself has taken an important amount of time during this dissertation project. A short overview over my main contributions is hence given in the following sections.
4.5.1 Platform description
The XML description is both verbose and rigid, which makes it ill-suited for the modeling of large and complex platforms. Usage of the scripting language Lua has been tested and was found to be promising to describe complex platforms programmatically and could be extended to be used for routing-algorithms as well. Since its current implementation needs to be maintained besides the XML- based default, meaning that all changes need to be implemented once for XML and
once for Lua, it was decided to not improve support for Lua but rather move to a python-based implementation in the future as bindings can be automatically generated from the C++ API. This has the benefit that by using existing, well-tested python-based XML-parsers, the C-based FleXML parser can be removed in the future. Furthermore, python is more universally known and may hence be easier to use.
4.5.2 PAPI support
During my research, several cases of application-dependent issues were encoun- tered when simulating with SMPI, such as seemingly unexplicably optimistic perfor- mance estimations. The investigation proved to be extremely difficult. To determine the cause, further information, as for example provided by hardware counters, was necessary and lead us to identify cache-related issues (see Chapter6for details).
One way of retrieving these counters is through PAPI [Ter+10], a well-known, robust and portable API that provides means to obtain performance information by inspecting and reporting hardware counters of the CPU. Support for PAPI was contributed to SMPI and can be used to collect PAPI-counters for each actor (and not just for the entire simulation). Since SimGrid’s own code can impact counters as well (e.g., total number of instructions), counter values must be stored before the execution of an actor and immediately after the execution has finished. The difference of these two counter values is then attributed to the actor itself and stored in a trace.
To investigate the actor’s behavior with PAPI, counters of interest must first be declared. Currently, this is only possible for all ranks (i.e., all ranks inspect the same counters), but first steps were already taken to assign a different set of counters to each rank. However, this was not required for my work and hence not fully pursued.
4.5.3 Privatization
Privatization has been discussed in Section 4.3.1. These techniques have been implemented for several years but the implementation was prohibitively static and did neither support the introduction of daemons in SimGrid (i.e., of non-MPI based processes that execute work in the background) nor the dynamic addition of processes. Through a refactoring process, limiting code such as fixed-size arrays and double indirections were identified and removed.
Furthermore, instead of being stored in global variables, the privatization segments are now directly associated to each actor through a member in the corresponding class.
Although these refactorings were important to support daemons (required for our investigation of DVFS governors in Chapter 9), other projects have already benefitted from these changes as well: most notably, they leveraged scheduling- simulations with multiple applications executing at the sime time as done by the BatSim [Dut+16a] project.
This support for dynamicity is also required for future support of functions that spawn child-processes. These functions are also required by the MPI standard, for instance MPI_Comm_spawn(), and can now be implemented in SMPI.
4.5.4 Energy plugin
It was already discussed in Section2.2that energy consumption will play a critical role in the future. An interesting question is therefore the energy efficiency of an application or of an algorithm. The energy-model that was developed and evaluated during this dissertation (see Chapter 8 for details) was implemented in SimGrid as a plugin. This plugin does not depend on SMPI and is therefore available for any simulation using SimGrid.
4.5.5 DVFS plugin
To investigate further options for increased energy efficiency, a new plugin provid- ing support for several DVFS governors will be presented in Chapter9. This plugin allows users to select DVFS governors on a per-host basis and provides several classical algorithms (performance, powersave, on-demand, . . . ).
4.5.6 Load Balancing
Distributing the load inequally over all nodes is a well-known cause for subpar performance, even on medium-sized machines. The upcoming massive increase in parallelism of exascale machines will further exacerbate the situation (see also Sec- tion2.2.1for a brief problem presentation). The HPC community has hence declared load-balancing to be of critical importance for exascale performance [Don+11]. Un- fortunately, understanding load related issues is often a very tedious task but without this knowledge the development of efficient algorithms is almost impos- sible. The development of better tools is therefore necessary [Don+11, pp. 40, 54]
and the simulation approach as provided by SimGrid certainly provides valuable insight. Alas, there was no high-level (i.e., without directly querying the SimGrid- core) API call to obtain the load of a particular node. To alleviate this situation, I developed a plugin that can be used to obtain the load of one or more arbitrary hosts at any time.
Unfortunately, (extreme) scale comes with (extreme) complexity. Load imbalances can therefore be easily misunderstood and solutions built into applications may work on one but not on other machines. Programmers should hence not attempt to implement their own load-balancer in their applications [Don+11, p. 57] as this may in fact lead to adverse results due to the complexity and diversity of platforms. On real machines, runtime systems such as StarPU [Aug+10] or Charm++ [Acu+14] are therefore required to relieve the programmer of the responsibility to load balance the application (see also the brief discussion in Section2.2.4).
Naturally, given the importance of the subject, the evaluation of load-balancing techniques is very interesting. Rafael Keller Tesser studied the impact of selected Charm++/AMPI [Acu+14] based load balancers with SimGrid in his dissertation project [Kel18]. Unfortunately, this implementation was based on a forked SimGrid version that was quickly outdated due to the rapid development of SimGrid and its APIs. As a contribution to a joint work [Tes+18], I rewrote the entire code-base, including the load-balancing algorithm from Charm++. Chapter9details how this contribution was later used in my own research.
5
Experimental Methodology
Parts of this chapter were published previously as part of a preprint [Hei+17a].
5.1 Experimental Setup
All experiments for this thesis were executed using a cluster provided by the Grid’5000 infrastructure project [Bal+13]. Grid’5000 provides clusters at seven sites within France plus one in Luxemburg. We were only able to choose among Lyon- based Grid’5000 clusters, as only they offered a hardware wattmeter so that we could measure the power consumption of a node during our experiments. Power consumption of CPUs can be obtained on modern processors through hardware counters, however, we are interested in the total node consumption, making the usage of a wattmeter more convenient. We therefore chose particularly the Taurus cluster1because it was the largest and most “recent” (from 2011 ) cluster at the time. The measured power values were accessed through a specific server that queried 4 wattmeters located on-site. For each plug, a total of 3600 measurements per second were made, each with an accuracy of 0.125 W, that were subsequently averaged and returned to the wattmeter server as a single value.2