Demolición por empuje
2.9 Pavimentos interiores
2.10.1 Pintura plástica lisa
This section deals with the critical points of the parallelization method when HERMESH is applied. Before discussing the relevant aspects related to the parallelization with the HERMESH method, we would like to make clear the nomenclature through a simple picture shown in Figure 3.28. We have adopted the term interface to the boundary between one independent mesh to the other, while the term boundary is used to refer to the boundary between different processors also called CPU’s. The boundary of the whole domain is referred by us as real boundary.
Within the modular structure of the Alya code described in Section 2.7, the HERMESH method corresponds to a service in such a way that the method could be valid for any of the modules corresponding to different physics inside the code and could be run in parallel maintaining the same performance but with some particularities. More precisely, the HERMESH service is a prepro-
Figure 3.27: Implementation of the HERMESH method for the Navier-Stokes equa- tions. Extension element is completely assembled for Q .
cess step in the code.
As we have explained in that Section 2.7, for mesh partitioning, METIS requires the element graph and to construct this, there are two possible op- tions, referred to as by-nodes or by-faces. Although the latter strategy requires much less memory, when we use extension elements, we have to apply the first strategy to divide the mesh, as illustrated in Figures 3.29 for a 2D mesh. We want to remember that fringe nodes are connected to the extension nodes via elements and not faces. This particularity leads to a problem when METIS divides the mesh if the graph is given by-faces. This is because METIS has no way to know the neughbors of the extension elements on the side where they extend.
We have run a test in order to evaluate the differences between both options in mesh partitioning with the HERMESH method, by-faces or by-nodes. The test corresponds to a cube domain formed by two meshes, one inside the other with 374832 tetrahedra in total. After the creation of the extension elements to couple both meshes, we have divided the composite mesh into 63 processors and with both classes of graphs. We show three important issues in the division performed by METIS for this problem through Figure 3.30:
1. The top graphic represents the number of elements in each processor. We can observe in it that there is one processor, the number fourteen, with much more elements than the others which will imply a load imbalance. 2. The middle graphic represents the number of boundaries between pro- cessors. Again we observe that processor fourteen contains much more boundary nodes, which implies a big size in the communications and in turn this slows down the point-to-point communication steps (e.g. matrix-vector-product).
3. The bottom graphic represents the number of neighboring processors for both cases. We observe that in average there are more neighbors with the graph given by-faces (which could imply more time in the communi- cations, depending on the length of the communications if one consider that data transfer time is given by t = na+Pn
i=1lib, where the first term
corresponds to the latency or startup time and the second one depends on the messages size).
To analyze all these questions at more depth, we show a trace visualized with the program paraver, PARAVER (2014). Paraver describes in a graphical way the status of the processes involved in the parallel execution of the code. The x-asis is time, and the y-axis shows, according to a specific color, the status of each process. In particular, blue color means idle process. Figure 3.31 compares the parallel performance of the code using by-nodes and by- faces graphs. The top graphics of the figure represents the assembly process
Figure 3.29: 2D Mesh partitioned into 3 subdomains. (Left) Element graph based on nodes. (Right) Element graph based on faces.
(loop over the elements painted in green) while the bottom graphics show some matrix-vector products of the GMRES algebraic solver (painted in red). The top figures confirm the load imbalance already observed in Figure 3.31 (Top). We note that the assembly step in process 14 is longer than that of other processes. On the other hand, we observe that using the by-nodes graph, five matrix-vector products can be carried in the same time as the by-faces graph can only perform three of these. This is due to the fact that in average, a process has to communicate with much more processes, as already shown by Figure (Bottom). These communications are depicted with yellow arrows in the bottom right part of the figure.
With Figures 3.30 and 3.31, we have shown that the by-faces graph has sev- eral shortcomings: (1) Load imbalance; (2) More communications. The reason is that the by-faces graph gives a wrong information to METIS. Let us try to explain why. The by-faces graph connects elements which share faces. We know that in a normal finite element mesh, a face is shared by at most two elements. But the HERMESH method (as it is implemented in this work) can construct lots of extension elements from a single face. It means that these elements can
5600 5800 6000 6200 6400 6600 6800 7000 7200 0 10 20 30 40 50 60 70 elements processor faces nodes 0 500 1000 1500 2000 2500 3000 3500 4000 0 10 20 30 40 50 60 70 boundaries processor faces nodes 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 neighbours processor faces nodes
Figure 3.30:(Top) Number of elements by process. (Middle) Number of boundaries
by process. (Bottom) Number of neighbors subdomains by processor.
end up lost in the graph, unconnected to any other element. That said, we know that, using the element graph information, METIS tries to balance the work by equalizing the number of elements per process while minimizing the boundaries between the processes. Therefore, for METIS, this lost elements do not communicate, enabling METIS to give lots of these elements to some pro- cesses. But, this elements do involve communication, if their nodes are located on the process boundaries. In the present example, we observe that process 14 inherit much more elements than others.
Let us finally note that processes that have hole elements present a load imbalance in the assembly step, as these hole elements are not assembled. This is confirmed by Figure 3.31 (Top). However, the corresponding degrees of freedom do exist and participate to the matrix-vector product, as seen in Figure 3.31 (Bottom). This is why we do not observe such load imbalance in the matrix-vector product (although the corresponding matrix rows are null), and useless work is carried out.
Figure 3.31: Trace extracted from paraver program.