Results for GPU: Cirrus¶
Cirrus is a national UK service provided by EPCC. The GPU partition of the cluster is composed by 2xIntel Xeon “Cascade Lake”, 2.4 Ghz, 20-core per node together with 4xNvidia Tesla V100-SXM2-16GB GPU, 640 Tensor core, 5,120 CUDA core per node. The number of nodes available for the GPU partition is 36 for a total of 150 GPUs.
Resuls are listed below
Transpose real 3D array
Transpose complex 3D array
FFT transform of a 3D real array starting from
X
physical directionFFT transform of a 3D complex array starting from
X
physical directionFFT transform of a 3D real array starting from
Z
physical directionFFT transform of a 3D complex array starting from
Z
physical direction
Discussion on Cirrus results¶
This page present the first scalability tests of the GPU version of the version 2.0 of 2DECOMP&FFT library.
All results have been obtained using the NVHPC compiler version 22.11 together with openMPI 4.1.4.
The results for the GPU compilation tests both CUDA aware MPI and NVIDIA Colletive Communication Library (NCCL).
The smallest resolution case NX=NY=NZ=512
can also fit into a single GPU therefore results are reported
also for a 1/4 (1 GPU) and for 1/2 (2 GPUs) of a node. Results with the pure MPI version instead use always
at least a full node with the full 40 cores available.
Speedup GPU/CPU is computed using as reference time for the CPU the case with 1 full node for the NX=NY=NZ=512
resolution and 2 full nodes for NX=NY=NZ=1024
since the largest case needs at least 8 GPUs to fit in memory.
The CPU results, particularly the ones for the real and complex transposes, show an acceptable scalability but they are not comparable with the one presented for Archer2 particularly for the coarses mesh resolution. This could be mainly attributed to the network which is considerebly slower that the one available on Archer2.
Communication greatly improves when using GPUs with both CUDA aware MPI and particularly with NCCL. For the GPU cases the slabs decomposition tends also to give better and more consistent performances with NCCL generally 50% or above faster than the CUDA aware MPI. For the low resolution case it is very noticeable a drop in performances when moving above 1 node, that can be attributed to the already mentioned to the relatively slow interconnect. For the larger case, where at least 2 nodes are necessary to fir the case in the GPUs memory the interconnect issue is less visible. The speedup between between GPU acceleration and CPU is from a factor of 5 or above dependinng on the case and the resolution.