• No se han encontrado resultados

5 Desarrollo de la unidad didáctica

5.2 Unidad didáctica “Con biotecnología construimos ambiente”

We have explored the potential for the automatic exploration of task decomposition strategies in a sequential code. We have presented an effective search algorithm based on three simple metrics, a parametrizable cost function and a couple of heuristics. The cost function and metrics take into account the length (duration) of the tasks, the de- pendences among tasks and tasks’ concurrency level. The search algorithm has been implemented leveraging a tool (Tareador) previously developed in the group. In our experiments, we demonstrate that our search algorithm is able to find task decom- positions that provide sufficient parallelism, often higher than the parallelism of the decompositions specified by an experienced programmer.

As future work, we have identified the need to include a new metric that evaluates the cost of expressing the task decomposition using the syntax (and constraints) offered by the target parallel programming model (for example, traditional fork-join, data flow, ...). The automatic search should be able to quantify how expressible (or viable) a decomposition could be and to use this information to guide the iterative exploration process. The result would be the best task decomposition that can be expressed in the target programming model.

Our next step in this research is to try our methodology on real-world applications. So far, we worked in the fashion of reversed engineering – we start from a legacy OmpSs code and remove all OmpSs pragma annotations to obtain the sequential code. However, a sequential application obtained this way inherits structure that is very fa- vorable for parallelization. Thus, starting from a legacy sequential code, automatic search for a good task decomposition would be much harder. Still, proving that our environment can explore parallelism in legacy sequential applications is the only proof of our concept. Therefore, that is our definite future work.

7

Related Work

In relation to the topics covered in this thesis, we present the related work in four distinct fields of research:

• Methodologies for simulating parallel execution • Overlapping communication and computation • Identifying bottlenecks in parallel execution • Tools for assisted parallelization

7.1

Simulation methodologies for parallel computing

The simulation of parallel computing systems is still an unsolved issue. The simulation can be very computation intensive, since the target machine may consist of numerous processing units. Moreover, the simulation is hard to parallelize, because the sepa- rate processing units may have very complex interactions. Thus, simulating low level

details in a large-scale parallel machine generates a very computation intensive se- quential execution. Consequently, these simulations are often unfeasible, due to time or memory constraints.

The conventional trace-driven simulators successfully simulate MPI executions, but they fail to simulate multicore systems. Trace driven simulators, such as Dimemas [45] or MPI-SIM [71], simulate MPI parallel execution. They replay the collected traces and reconstruct the potential parallel time-behavior. However, the conventional traces fail to capture time-dependent executions – executions with dynamic thread scheduling and inter-thread synchronizations. Therefore, with the appearance of mul- ticore systems, the trace-driven simulators failed to provide satisfactory simulations.

In order to simulate time-dependent execution, many recent studies turned to execution- driven simulators. These simulators provide cycle accurate simulations, capturing all possible time-dependent influences. Most of the current execution-driven simulators are based on the off-the-shelf simulation infrastructures such as M5 [14], Simics [60], Simplescalar [8], PTLsim [90], etc. These simulators introduce extremely high over- head. Therefore, for simulating a very large system, the execution-driven approach becomes unfeasible.

Finally, the newest simulation proposals tend to find a sweet spot between execution-

driven and trace-driven approaches. COTSon [7] uses AMD functional emulator to-

gether with timing models, to achieve a proper combined timing. Compared to execution- driven simulators, COTSON reduces the simulation time, but also reduces simulation flexibility. On the other hand, TaskSim [74] tries to differentiate the applications in- trinsic computation from the parallelism related computation. Then, the application intrinsic computation is replayed as in the conventional trace-driven simulation, while the parallelism related computation is recomputed during the simulation.

In this thesis, we introduced a novel simulation methodology called simulation aware tracing. Our methodology allows simulating very low-level architectural fea- tures in a large-scale parallel machine. This is enabled by modeling the introduced low-level feature already in the process of tracing. Since each MPI process is traced independently, the computation related to modeling a new feature is naturally paral- lelized across all MPI processes in the execution. The tracer includes the effects of the new feature into the trace, while the regular replay simulator replays the trace and spreads the effect of the modeled feature across the whole parallel MPI execution.

However, a drawback of our techniques is that the influence of the new feature can spread only bottom-up – a change at low-level can change the performance of parallel execution of MPI processes, but a change in parallel execution of MPI processes can- not change the performance at low-level. For instance, a change in cache performance can affect the scheduling of tasks. On the other hand, change in the scheduling of tasks, cannot affect the cache performance.

Our methodology deals with a very complicated parallel execution, offers fast and flexible simulation and provides a rich output. Up to our knowledge this is the first simulation methodology that can simulate parallel execution that integrates MPI with task-based programming model. Although our methodology originally targets dataflow parallelism, it can be easily adapted to simulate other fork-join based programming models such as OpenMP or Cilk. Finally, we believe that the biggest contribution of the environment is its flexibility and rich output. The user can easily change the target platform and visually (qualitatively) inspect the effects on the parallel execution.