|
Neko 1.99.2
A portable framework for high-order spectral element flow simulations
|
This is a short best practices guideline on how to achieve good performance with Neko across various architectures. The guide covers various configuration options, as well as advice on how to set up and run simulations to achieve the best possible performance.
When building Neko, it is important to enable full optimisation for all configured backends. Configuring with FCFLAGS and CFLAGS to -O3 will provide an optimal build for CPUs and SX-Aurora backends, while for accelerators, a user needs to set CUDA_CFLAGS or HIP_HIPCC_FLAG depending on the backend type (OpenCL optimisation is set in via the CFLAGS variable). Additional performance could also be gained for both CPUs and GPUs by passing architecture-specific configuration flags, either via FCFLAGS and CFLAGS (CPU, SX, and OpenCL), HIP_HIPCC_FLAGS (HIP), or CUDA_ARCH (CUDA).
To avoid unnecessary data movement between host and device, it is important to use a device-aware MPI implementation and configure Neko with --enable-device-mpi for the CUDA and HIP backend. Furthermore, to improve GPU utilisation, especially with few elements per device, it could be beneficial to configure Neko to use OpenMP --enable-openmp, and launch the simulation with two threads. This would enable the task-parallel preconditioners inside Neko.
This section explains how to set up and select options for a case to achieve the best possible simulation performance.
Load balancing is essential for good simulation performance. This can be performed using graph partitioning of the mesh, either offline or online (during a simulation with Neko). Both options require Neko to be built with the optional dependency ParMetis (see Installing Neko).
The offline tool is called prepart, and is installed in the same installation directory as neko and makeneko. To partition a mesh into nparts, launch prepart as below:
The output will be a new mesh, with an additional _<nparts> in the filename to indicate that it has been balanced for nparts.
To enable load balancing (mesh partitioning) during a simulation, set the load_balancing option to true (see Case File). Runtime load balancing will partition the mesh (in parallel) before the time-stepping loop starts, and it will also save the balanced mesh with an additional _lb in the filename.
_lb in the filename.Depending on the configured backend, some simulation parameters can significantly impact performance, for example, the polynomial order. For CPU backends, performance is less sensitive to the chosen order and should be set depending on the accuracy and time-to-solution needs of a given case. On accelerators, performance and device utilisation are closely tied to the polynomial order. In general, accelerators run best with the highest possible polynomial order. However, using a very high order will force down the required time-step size and, in turn, increase the time-to-solution for a given case. A good rule of thumb is to use at least seventh-order polynomials for accelerator backends.
Another important parameter to experiment with to improve performance is the combination of linear solver and preconditioner, and their associated tolerance criteria. The best combination is very case dependent, thus it is best to experiment with the various types provided (see case-file).
When running a simulation, the only parameter a user has some control over is the number of MPI ranks to use. Neko will always distribute a mesh across all ranks in a job using a load-balanced linear distribution. With too few elements per rank, communication costs will start to dominate, reducing both scalability and performance. The opposite, with too many elements per rank, will cause computational cost (per rank) to increase at the cost of a reduced (overall) parallel efficiency.
Neko has been optimised with strong scalability in mind, and has demonstrated nearly ideal scaling and parallel efficiency with as few as 10 elements per MPI rank on CPUs or 7000-10000 elements per MPI rank on GPUs.
Finally, achieved performance depends on many factors, for example, the interconnect; it is therefore advisable to invest time in establishing a performance baseline the first time a new machine is used.