Results for CPU: Archer2¶
Archer2 is the UK National Supercomputer service capable of 28 Pflops/s at peak performance. The systems has 5,860 compute nodes, each with dual AMD EPYCTM 7742 64-core processors at 2.25GHz, giving 750,080 cores in total.
Resuls are listed below
Transpose real 3D array
Transpose complex 3D array
FFT transform of a 3D real array starting from
X
physical directionFFT transform of a 3D complex array starting from
X
physical directionFFT transform of a 3D real array starting from
Z
physical directionFFT transform of a 3D complex array starting from
Z
physical direction
Discussion on Archer2 results¶
The results above show that the the version 2.0 of 2DECOMP&FFT library keeps on having extremely good scalability performances. The transpose tests show no difference between compilers since the tests mainly focus on MPI communication and for all executable CRAY MPICH (Version 8.1.23) has been used. It is interesting to notice that a 1D decomposition, when possible, can give up to a 80% speedup in comparison with the optimal 2D decomposition. This is because of the new feature of the library where a simple copy, avoiding completely MPI communication, is performed when data are all co-located in the local memory. This was not the case with the previous version of the library. CRAY and GNU compilers performances using the generic FFT tends to differ for a low core count with the GNU performing a bit better in some cases (up to 50% performace increase), however results tends to converge with the increase of the numbers of nodes. This gives some superlinear behaviour when looking at the speedup.
The FFTW has been tested only with the CRAY compiler and it gives a speed up of about 3 for a low core count decreasing to something in between 1.5 and 2 for the larger number of nodes. The speed up with the FFTW is generally very close to the ideal lineat behaviour.