matrices in the Compressed Diagonal Storage format, both A and B of the compact schemes.
At the time the datatype was coded, the diagonals were mistakenly decided to be stored along the columns of thematrix member of CDS; for
instance the jth diagonal of B is stored in hB var.i%matrix(:,j). When
B undergoes matrix-vector multiplications, the ith element of the result- ing array is computed as the matrix product between a small portion of the given column array, times the in-band portion of the of B’s ith row,
which is stored in the row hB var.i%matrix(i,:), whose elements are not
consecutive in memory, since Fortran stores matrices in a column major fashion, unlike C++, which uses the row major storage.
So a mandatory edit is to move from the current row-oriented imple- mentation of the CDS format, which fits with C++ more then with For-
tran, to a column-oriented implementation (the jth diagonal of B stored
in hB var.i%matrix(j,:) ), in order to save several percent points on the
wall clock time elapsed for each multiplication.2
This kind of storage is the most space-efficient, as long as the matrices have a constant bandwidth, which is certainly the case when a periodic direction is concerned; otherwise several zeros are stored, thus wasting memory, as is the case for non-periodic directions. Since the bandwidth of compact schemes’ matrices vary only close to the boundaries, and since each matrix is scattered among several processes, it is clear that pro-
2
The edit was actually attempted already, and the comparison was performed in a standalone, non-parallel Fortran program, and resulted in an unexpectedly significant reduction of the wall clock time needed for a matrix-vector product. The edit, however, was not pulled in CFD code, since it would have meant substantial a rewrite of the procedures responsible for distribution of the compact matrices among processes, which is a fairly delicate task.
cesses not adjacent to the boundary do not suffer from this problem, since the outer diagonals are identically zero, in the inner part, and can sim- ply be dropped, as a beneficial consequence. The processes touching the boundaries, on the contrary, should store those zeros, which can repre- sent up to half the content of thehmatrixi%matrixmember of aCDS-stored
banded matrix; what is more, those zeros will uselessly undergo all scalar multiplications arising from matrix-vector products, thus increasing the computational cost suffered by near-boundary processes.
One suggested solution is to implement the JDS format, more space- efficient for non-constant bandwidth matrices, as defined in [64]. This format should be used only along non-periodic directions, and only for those processes who touch the minus or plus boundaries.
6.3.4 Condensation of MPI calls
The possible advantage in abandoning/changing the master-slave logic intrinsic to the SPIKE algorithm should be investigated.
Considering only those processes belonging to a single pencil within the 3D Cartesian MPI topology, at the current state the first process in the pencil plays the master, and usesMPI_GATHERto gather two layers of data
from each remaining process, in order to build the array ψ0 of Eq. (3.18),
then solves the reduced system for ψ, and finally scatters the result back to the processes through a call toMPI_SCATTER. Within this strategy, each
process sends two layers of data to the master, then waits for the master, and finally receives back two new layers, whereas the master has to receive 2× n layers, solve the system, then send the 2 × n new layers back. It is worthwhile to underline that all first processes along the pencil are master to their pencil. As a consequence, in a n×n×n MPI Cartesian grid, there are 3n − 2 processes which are masters to two pencils, and one process (the process 0) which is master to three pencils. This is a possible reason for poor performance.
An alternative is a call toMPI_ALLGATHER, which results in all processes
performing 2×n sends and 2×n receives, so that each of them can solve the reduced system on its own and proceed. With this hypothetical strategy, all processes would do the same work as only the master process does in the strategy currently implemented, with apparently no gain. Actually the author believes that a single call to MPI_ALLGATHER, in place of the twin
calls toMPI_GATHER/MPI_SCATTER, gives an opportunity to the compiler for
6.3. Programming-related 123
The subroutine MPI_TYPE_CREATE_STRUCT could be used to group in
one datatype the three datatypes used byMPI_NEIGHBOR_ALLTOALLWto ex-
change each component of velocity, thus allowing a single call to exchange all the near-boundary flowfield between neighboring processes, hopefully to the benefit of the compiler’s optimization capabilities. Indeed, it is common knowleknowledge that reducing the number of messages likely improves the performance of an MPI program [88].
6.3.5 Fortran-native vectorization and OpenMP-MPI