• No se han encontrado resultados

GUÍA DE PRÁCTICA DE FORMACIÓN 3 Controladores PID Integrados en PLC

The previous sections provided a short overview of typical HPC systems as well as discussed a number of parallel programming paradigms programmers use to harness the computing power of these systems. Parallel programming entails non-trivial challenges and, remembering the old saying that “premature optimization is the root of all evil” [33], programmers focus initially on making their parallel programs produce correct results. However, merely achieving correct re- sults is a necessary condition to begin realizing the potential of HPC systems, but it is definitely not a sufficient one. The goal is to maximize the amount of “completed science per cost and time” [34]. Since HPC systems are limited in lifetime and expensive, we have to optimize the applications as well as system architecture, scheduling, and topology mapping. In this work, however, we specifically focus on applications and the optimization goals in this case are execu- tion time, scalability, efficiency, energy consumption, memory usage, and so on. For example, faster execution means more “science per time”, better scalability translates into higher-fidelity simulations, and better efficiency means that less resources are wasted.

The process of systematic performance analysis and tuning of applications is called perfor-

mance engineering[34,35]. Figure 1.6presents a diagram of this process. It consists of three

steps (plus 2 optional steps) arranged in a cycle that might be repeated a number of times until our performance goals are achieved. We start with initial observations that provide us with per- formance data. This step usually involves performing measurements by means of profiling and

benchmarking. We then continue to the analysis step in which we study the performance data along with our code in more depth using various tools and visualization techniques. We might also perform more measurements to gather specific counter and metrics data. The goal in this step is to gain an initial understanding of potential bottlenecks and identify optimization op- portunities. One particular example is identifying hot spots, specific places in the code in which the application spends considerable amount of time and thus are good candidates for optimiza- tion. Following the analysis step are two optional steps, namely, simulation and performance modeling, respectively. Simulation is a form of rudimentary modeling as it allows us to simu- late isolated aspects of our application (e.g., specific functions or communication patterns) on a hypothetical hardware, thereby providing us with accurate predictions. It appears as a separate step in the figure to emphasize that it is different from performance modeling since it does not give us an analytical expression and might be too slow and expensive for analyzing behavior at larger scale [34]. The performance modeling step, which includes analytical modeling and empirical modeling, has a number of advantages that will be discussed below. Finally, we reach the step of code optimization, in which we apply suitable optimization strategies that might involve, for example, efficient cache use and computation-communication overlap. In general, optimization is a separate, very rich field of research and there is no single solution that fits all the HPC applications. In most of the cases, we would continue to another iteration of the engineering cycle to verify the optimized sections of the code and to identify new bottlenecks.

In the subsections below, we provide a brief overview of the observation, analysis, and per- formance modeling steps. The goal is to construct the necessary context for this dissertation rather than provide a comprehensive overview. For this reason, we do not cover the simulation and optimization steps in detail.

1.3.1 Observation

In the observation step, we collect the performance data of our code. The simplest approach is just benchmarking, that is running the code repeatedly (i.e., repetitions increase our confidence in the results), measuring the execution time, and then collating the results. In most cases this approach is too coarse grained and will not expose any hot spots in the code. We have to obtain more fine grained performance data by means of either instrumentation or sampling.

Instrumentation. Instrumentation is a technique in which performance measurement calls are inserted into the original code or are placed as hooks that intercept API calls. During the code execution these calls are invoked and allow the performance measurement tool to record various performance data such as timestamps, memory usage, and so on. Well known tools, such as Score-P [36], intercept MPI calls and insert calls before and after function invocations in the code. As a result, Score-P is able to construct a call tree with performance information (e.g., execution time, number of visits, bytes sent and received) for each node (including MPI functions). In Score-P, such a call tree is called performance profile as it summarizes the perfor- mance of the entire execution and can be produced without keeping all the recorded data. In

either the entire execution or parts of the execution. This data is called a trace and is kept for later, post-mortem analysis.

Sampling. One problem with instrumentation is performance perturbation since calling func- tions to record performance data introduces overheads, such as longer execution times, or ex- acerbates existing issues, such as late arrivals of processes to MPI calls. As a solution, sampling allows performance tools to interrupt the execution of the code at periodic intervals and record relevant performance data (e.g., function visits). Some of the sampling tools unwind the stack to retrieve call-path information and construct the call tree [37,38]. The execution time of a function is estimated by multiplying the percentage of visits by the total execution time. The interrupt interval, or sampling frequency, should be chosen carefully, since a shorter interval might cause significant perturbation, while a longer one might give us less accurate perfor- mance data. Since no instrumentation is involved, sampling can be used without recompiling the code.

1.3.2 Analysis

In the analysis step, we analyze the performance data collected earlier. In most cases, the goal is to reveal potential bottlenecks or identify hot spots for optimization. Good analysis tools facilitate the identification of hot spots by allowing users to easily navigate and explore the collected data. For example, Figure1.7 shows two screenshots from well known performance analysis tools for HPC applications, namely, Scalasca [39] and Vampir [40]. Scalasca collects performance profiles using Score-P and uses the CUBE tool [41] for their visualization and exploration. Figure 1.7ashows a snapshot of the CUBE graphical user interface that displays performance data in three panes. The panes correspond to different dimensions of performance data, namely, metric, call-tree, and system. In the left pane, users select a metric such as execution time, transferred bytes, and so on. The middle pane shows the call tree that can be expanded or collapsed. Values for the selected metric are shown next to the tree nodes. The right pane is reserved for plugins. The default plugin is the system dimension, which presents a tree of nodes, processes, and threads. Vampir and other tools such as Extrae [42] are based on traces and visualize them along the time axis. Figure1.7b, for example, shows a snapshot of the Vampir tool. The main pane in the upper left side presents the traces—one for each process. The choice of colors for different parts of the trace provides a clear delineation between the communication phase (red color) and the computation phase (blue color). Such information is instrumental for identifying imbalances in the execution of the code. The lower left pane shows the trace data in the form of a plot for the floating point operations counter in process 0. Note that the timeline and the counter plot are aligned so that it is clear that the drop in the floating point rate in process 0 is due to communication.

Mature performance analysis tools of HPC applications, such as Scalasca and Vampir in Fig- ure1.7, are the first choice for our efforts to understand and optimize our code. However, these tools are also limited in scope since they do not show us how close we are to the optimum per- formance or whether issues which seem negligible at current scale will become major problems at extreme scale.

(a)Scalasca

(b)Vampir

1.3.3 Performance modeling

Performance modeling expresses different aspects of application performance with analytical expressions. A model of execution time as a function of the number of processes, for example, shows us how the applications scales by predicting execution time at a higher scale. A model of execution time as a function of the input gives us an analytical expression of the algorithm complexity (e.g.,T(n) = 2.2n3for simple matrix-matrix multiplication [34]). These models can

be analytical models, that is constructed manually by careful analysis of the code, or empirical

models, that is models driven entirely by measurements of the code performance. Another

class of models are requirements models, which express requirements, such as the number of floating point operations needed to solve the problem, independently of the architecture. For example, the number of floating point operations required to solve the (naïve) n × n matrix-

matrix multiplication problem is f (n) = 2n3 since we have three levels of nested n-iterations

loops, and one multiplication and one addition in the inner-most loop.

Usually the performance engineering cycle in Figure1.6is repeated until we reach a perfor- mance level that is sufficient for our needs. The question, however, is how do we know we are close enough to the optimum performance? First, to estimate the optimum we can combine requirements models with system characteristics such as peak floating point performance. This provide us with an understanding of the full performance potential. Second, a better goal for performance tuning would be to reach a performance level that is within some fraction of the optimum. This is where the advantage of analytical and empirical models could be used to the fullest. Not only can they give us insight into how close we are to the optimum, but they also allow us to see whether further optimization opportunities are still available.

Although analytical modeling is very useful for better performance engineering, it is not easy to construct models for different parts of an application. It is a very laborious process that requires time and expert knowledge of the code. In Chapter2, we provide an overview of the state-of-the-art of empirical performance modeling. Specifically, we describe a technique for automated construction of empirical models. This technique generates empirical models for each function (or code section) accurately and quickly [43]. In this way, performance modeling can be applied systematically to tune the performance of applications, as well as optimize the system architecture. The latter aspect relates directly to the process of co-designing future systems.