• No se han encontrado resultados

UN MODELO DE DOCENCIA VIRTUAL EN EL NUEVO ESCENARIO UNIVERSITARIO: EJEMPLO COMPARATIVO DE

IV. EL MÁSTER EN HERIDAS CRÓNICAS

it is clear that the collective communication has a larger effect on overall communication performance than the point-to-point communication. However, as already discussed, for other applications the individual point-to-point and collective models could be used to generate a model for the total communication. Again, this could be tuned using an existing message passing version of a code and the performance of a hybrid version could then be predicted. With the models used to predict the overheads from shared memory additions the applicability

is not as straightforward. The model for direct shared memory overheads requires some

knowledge of the OpenMP overheads or the performance of an OpenMP version of the work loop being modelled in order to be used to model performance of the hybrid message passing + shared memory code. It may therefore be necessary to prototype these shared memory sections, or to carry out an in-depth profiling of the OpenMP software and hardware ecosystem being used before this model can be applied to any degree of accuracy. A similar situation exists with the overall performance model, as stated above. A possible use for these models could be in applying existing hybrid codes to larger clusters. If an existing piece of hardware was to be expanded, the models could be used to predict the runtime on an enlarged cluster by using the existing performance data to tune the model to the current cluster size, then extrapolating this forwards onto larger cluster counts.

Overall, the communication performance models can be used with an existing pure message passing version of a code to estimate parameters using curve fitting that will give a prediction of how a hybrid message passing + shared memory version of a code will perform in terms of the point-to-point or collective communication.

6.6

Future Hybrid Code Performance

These performance models may be useful in predicting the future performance of hybrid message passing + shared memory codes when compared to pure message passing codes. In order to predict this performance we must make some assumptions about the type and

6.6 Future Hybrid Code Performance 183

performance of future generations of HPC hardware. Predicting the performance and types of hardware architecture in future HPC systems is a difficult task [109], but it seems reasonable to make some basic assumptions. It has already been shown that the most important factors in performance when considering hybrid message passing + shared memory codes are the number of cores per node (which affects the ratio between number of MPI processes and the number of OpenMP threads) and the communication performance of the cluster interconnect. It is therefore sensible to focus the assumptions on these hardware characteristics. Assuming that a cluster architecture based around nodes with multiple processors with multiple cores per node remains the dominant architecture for the foreseeable future, the following assumptions seem reasonable and logical given the direction of hardware development over the last few years:

• The number of cores per node is likely to increase as the number of cores per processor increases.

• The bandwidth of network interconnects is likely to increase. • The latency of network interconnects is likely to drop.

The results examined in this thesis and the performance models in this chapter may suggest that the hybrid message passing + shared memory model would behave in a predictable way on such a future architecture. The increase in the number of cores per node would result in an increase in the number of OpenMP threads per node. This would result in an increase in shared memory overheads (direct or otherwise), suggesting that the main work of the application would take longer on lower processor numbers (with a larger problem grain size) than in the pure message passing code. However, these overheads would reduce as the number of nodes increased (and grain size decreased). The increase in the bandwidth and reduction of the latency of the interconnect should benefit the point-to-point and collective communication of both the pure message passing and hybrid message passing + shared memory codes. However, the large increase in number of MPI processes in the pure message passing version is likely to lead to a much poorer collective communication performance than in the hybrid message passing +

184 6.6 Future Hybrid Code Performance

shared memory version, where the number of MPI processes does not actually increase as the number of cores per node increases.

Given the assumptions on the type of future hardware above it is possible to suggest a potential future architecture of cluster that can be used for performance predictions. Taking the Merlin cluster as a basis, an architecture is proposed consisting of nodes, each containing two eight core processors, linked by some interconnect with a higher bandwidth and lower latency than the current interconnect, specifically twice the bandwidth (giving 40 Gb/s) and half the latency (0.9 microseconds). This effectively represents a doubling of the characteristics of the Merlin cluster. This possible hardware configuration can be used to estimate future performance. For brevity, the pure MPI and hybrid (1 MPI) case will be considered, as these are the most accurate in the performance model study above.

Problem sizes will be doubled when considering data transferred or work completed to take the better theoretical performance of the cluster into account, supposing that with a higher performance cluster, larger problems can be considered.

The overall performance model can then be used to predict a possible future performance for both the hybrid message passing + shared memory code and pure message passing code. It is reasonable to assume that the overheads of communication will drop given the faster interconnect. It is also reasonable to assume that the shared memory overheads will remain constant, as faster hardware may reduce the overheads, but the larger number of threads per node (and threads per MPI process) may increase them. The parameters estimated using the current code performance can then be adjusted to predict future performance of the code. Halving the parameters related to communication terms represents an increase in the performance of the cluster interconnect, while the parameters concerning shared memory overheads remain the same as discussed above.

Using the overall performance model, adjusting for a larger problem size, double the number of processor cores per node and the faster interconnect with the lower latency, a possible predicted performance for Test 20 of DL_Poly can be seen in Figure 6.20. As can be seen, it tallies

6.6 Future Hybrid Code Performance 185 0   50   100   150   200   250   16   32   64   128   256   512   1024   Ti me  ( seco nd s)   Cores   Pure  MPI   Hybrid  (1  MPI)  

Figure 6.20: Predicted performance on possible future hardware, 16 cores per node, doubled communication performance.

closely with the suggested performance above. The hybrid message passing + shared memory code performs worse than the pure message passing code at lower numbers of cores, while it performs much better at higher numbers, where the pure message passing code scalability is very poor. The difference seen between the two codes is larger at both ends of the scale than the current performance results show.

Of course, the models presented in this chapter are quite simplistic, and only allow tuning of the models in terms of the communication overhead and the shared memory overheads, with the other parts of the models being functions of how the code is run (number of MPI processes in total, number of MPI processes per node, number of OpenMP threads etc.) It is unknown how future hardware architecture changes may affect these overheads. For instance, a change in the architecture of memory access from processor cores could significantly reduce shared memory overheads, while an increase in the speed and use of direct remote memory access in MPI implementations could result in the collective communication performance of pure message passing codes improving greatly. In order to capture these effects fully the models would need