• No se han encontrado resultados

In order to keep pace with the increasing level of parallelism of modern supercomputers and to allow for analyzing large-scale applications with thousands of processes, Periscope utilizes a distributed approach for performance analysis. That is, it spawns a hierarchy of communication and analysis agents (see Figure 4.3) for collecting and analyzing the performance-critical data of a parallel application. Each of the analysis agents, i.e., the leaves of the agent hierarchy, searches autonomously for inefficiencies in a subset of the application processes. The communication agents collect the information gathered by the analysis agents, aggregate it, and propagate it through the hierarchy to the master agent at the top. The master agent connects to the frontend that interacts with the user.

Upon startup, the frontend analyzes the set of processors available, determines a suitable mapping of application processes and agents to the processors, and then starts the application and the hierarchy of communication and analysis agents. Afterwards, a command is propagated from the frontend down to the analysis agents that subsequently start the search for performance bottlenecks. The application’s performance data are collected by the application processes themselves by utilizing the Periscope performance monitoring library that provides functions for gathering the relevant data. The necessary calls to the library functions are added to the application by instrumenting the application. This is achieved by means of source code instrumentation which allows for selectively instrumenting code regions, i.e., functions, loops, vector statements, OpenMP blocks, IO statements, and call sites. The monitoring library also provides the Monitoring Request Interface (MRI) to which the analysis agents connect. The MRI allows for configuring the measurements and retrieving performance data as well as starting, halting, and resuming the execution of the application.

The foundation of Periscope’s performance analysis are so-called performance properties that formally represent the performance characteristics of an application. They are specified in the APART specification language ASL [40], translated into C++ classes, and loaded at runtime. An ASL performance property specification consists of three parts: condition, confidence, and severity. Condition specifies the condition that has to be met for the property to hold. It can be derived from the performance metrics that are measured by the performance monitoring library.

Frontend

Performance Analysis Agent Network

Master Agent Analysis Agents MRI Monitor Application Processes Communication Agents MRI Monitor MRI Monitor

Figure 4.3: Periscope consists of a frontend and a hierarchy of communication and analysis agents. The analysis agents configure the MRI-based monitors of the application processes and retrieve performance data.

Confidence is a value in the interval [0 − 1] that quantifies the certainty that a property holds. Severity is an indicator of the importance of the property and of its impact on the application’s performance. For example, in case of an MPI application, an interesting performance property is Early Receiver. It indicates that a receiving process called the receive function before the corresponding send function has been called and therefore had to wait for the message to arrive. The condition of the property can be derived from the waiting time within the MPI receive operation, i.e., if the process had to wait for more than a certain threshold, the property holds. For most properties, the confidence is typically 1.0 as they are based on reliable measurements. The severity usually relates the time that was wasted to the total runtime of the application. That is, if the waiting time is short compared to the total runtime, the property has only little impact on the application’s performance and therefore is not that severe. A collection of relevant OpenMP and MPI performance properties has been assembled by the APART working group [37, 38].

The search for performance bottlenecks is performed in one or more experiments according to a search strategy determined by the user at startup. The search strategies are implemented as C++ classes and custom classes may be provided by the user. Each strategy defines an initial set of hypotheses, i.e., performance properties that are to be checked in an experiment, as well

4.4. AN OUTLOOK ON AUTOMATED PERFORMANCE ANALYSIS 67

as the refinement of detected properties to a new set of hypotheses that are to be checked in further experiments. Many applications in HPC have an iterative behavior, e.g., a loop where in each iteration the next time step of the simulated time is performed. If the application has such an iterative behavior, each iteration can be regarded as a single experiment. Otherwise, the whole program is executed for an experiment. The agents start from the initial set of hypotheses, request the necessary information for proving the hypotheses via MRI, release the application for performing the experiment, retrieve the information from the monitor after the processes were suspended again, and evaluate which hypotheses hold. If necessary, a proven hypotheses might be refined and the next experiment is performed. At the end of the local search, the detected performance properties are reported back via the agent hierarchy to the frontend. The communication agents combine similar properties found in their child agents and forward only the combined properties.

The current version of Periscope has been adapted to the Intel Itanium2-based HLRB2 supercomputer at the Leibniz Computing Center (LRZ). It allows for both detection of per- formance bottlenecks limiting the scalability on parallel systems (see next section) as well as analyzing the single-node performance of an application by means of hardware performance counters (see [52] for details). Additionally, it takes the machine topology into account when distributing the analysis agents. It can be used in interactive as well as in batch processing mode.