.. role:: cpp(code) :language: c++ .. role:: fortran(code) :language: fortran .. _sec:gpu:overview: Overview of AMReX GPU Strategy ============================== AMReX's GPU strategy focuses on providing performant GPU support with minimal changes and maximum flexibility. This allows application teams to get running on GPUs quickly while allowing long term performance tuning and programming model selection. AMReX uses the native programming language for GPUs: CUDA for NVIDIA, HIP for AMD and SYCL for Intel. This will be designated with ``CUDA/HIP/SYCL`` throughout the documentation. However, application teams can also use OpenACC or OpenMP in their individual codes. At this time, AMReX does not support cross-native language compilation (HIP for non-AMD systems and SYCL for non Intel systems). It may work with a given version, but AMReX does not track or guarantee such functionality. When running AMReX on a CPU system, the parallelization strategy is a combination of MPI and OpenMP using tiling, as detailed in :ref:`sec:basics:mfiter:tiling`. However, tiling is ineffective on GPUs due to the overhead associated with kernel launching. Instead, efficient use of the GPU's resources is the primary concern. Improving resource efficiency allows a larger percentage of GPU threads to work simultaneously, increasing effective parallelism and decreasing the time to solution. When running on CPUs, AMReX uses an ``MPI+X`` strategy where the ``X`` threads are used to perform parallelization techniques, like tiling. The most common ``X`` is ``OpenMP``. On GPUs, AMReX requires ``CUDA/HIP/SYCL`` and can be further combined with other parallel GPU languages, including ``OpenACC`` and ``OpenMP``, to control the offloading of subroutines to the GPU. This ``MPI+X+Y`` GPU strategy has been developed to give users the maximum flexibility to find the best combination of portability, readability and performance for their applications. Presented here is an overview of important features of AMReX's GPU strategy. Additional information that is required for creating GPU applications is detailed throughout the rest of this chapter: - Each MPI rank offloads its work to a single GPU. ``(MPI ranks == Number of GPUs)`` - Calculations that can be offloaded efficiently to GPUs use GPU threads to parallelize over a valid box at a time. This is done by launching over a large number GPU threads that only work on a few cells each. This work distribution is illustrated in :numref:`fig:gpu:threads`. .. |a| image:: ./GPU/gpu_2.png :width: 100% .. |b| image:: ./GPU/gpu_3.png :width: 100% .. _fig:gpu:threads: .. table:: Comparison of OpenMP and GPU work distribution. Pictures provided by Mike Zingale and the CASTRO team. +-----------------------------------------------------+------------------------------------------------------+ | |a| | |b| | +-----------------------------------------------------+------------------------------------------------------+ | OpenMP tiled box. | GPU threaded box. | | OpenMP threads break down the valid box | Each GPU thread works on a few cells of the | | into two large boxes (blue and orange). | valid box. This example uses one cell per | | The lo and hi of one tiled box are marked. | thread, each thread using a box with lo = hi. | +-----------------------------------------------------+------------------------------------------------------+ - C++ macros and GPU extended lambdas are used to provide performance portability while making the code as understandable as possible to science-focused code teams. - AMReX can utilize GPU managed memory to automatically handle memory movement for mesh and particle data. Simple data structures, such as :cpp:`IntVect`\s can be passed by value and complex data structures, such as :cpp:`FArrayBox`\es, have specialized AMReX classes to handle the data movement for the user. Tests have shown CUDA managed memory to be efficient and reliable, especially when applications remove any unnecessary data accesses. However, managed memory is not used by :cpp:`FArrayBox` and :cpp:`MultiFab` by default. - Application teams should strive to keep mesh and particle data structures on the GPU for as long as possible, minimizing movement back to the CPU. This strategy lends itself to AMReX applications readily; the mesh and particle data can stay on the GPU for most subroutines except for of redistribution, communication and I/O operations. - AMReX's GPU strategy is focused on launching GPU kernels inside AMReX's :cpp:`MFIter` and :cpp:`ParIter` loops. By performing GPU work within :cpp:`MFIter` and :cpp:`ParIter` loops, GPU work is isolated to independent data sets on well-established AMReX data objects, providing consistency and safety that also matches AMReX's coding methodology. Similar tools are also available for launching work outside of AMReX loops. - AMReX further parallelizes GPU applications by utilizing streams. Streams guarantee execution order of kernels within the same stream, while allowing different streams to run simultaneously. AMReX places each iteration of :cpp:`MFIter` loops on separate streams, allowing each independent iteration to be run simultaneously and sequentially, while maximizing GPU usage. The AMReX implementation of streams is illustrated in :numref:`fig:gpu:streams`. The CPU runs the first iteration of the MFIter loop (blue), which contains three GPU kernels. The kernels begin immediately in GPU Stream 1 and run in the same order they were added. The second (red) and third (green) iterations are similarly launched in Streams 2 and 3. The fourth (orange) and fifth (purple) iterations require more GPU resources than remain, so they have to wait until resources are freed before beginning. Meanwhile, after all the loop iterations are launched, the CPU reaches a synchronize in the MFIter's destructor and waits for all GPU launches to complete before continuing. - The Fortran interface of AMReX does not currently have GPU support. AMReX recommends porting Fortran code to C++ when coding for GPUs. .. raw:: latex \begin{center} .. _fig:gpu:streams: .. figure:: ./GPU/Streams.png Timeline illustration of GPU streams. Illustrates the case of an MFIter loop of five iterations with three GPU kernels each being ran with three GPU streams. .. raw:: latex \end{center} .. _sec:gpu:build: Building GPU Support ==================== Building with GNU Make ---------------------- To build AMReX with GPU support, add ``USE_CUDA=TRUE``, ``USE_HIP=TRUE`` or ``USE_SYCL=TRUE`` to the ``GNUmakefile`` or as a command line argument. AMReX does not require OpenACC, but application codes can use them if they are supported by the compiler. For OpenACC support, add ``USE_ACC=TRUE``. PGI, Cray and GNU compilers support OpenACC. Thus, for OpenACC, you must use ``COMP=pgi``, ``COMP=cray`` or ``COMP=gnu``. Currently, only IBM is supported with OpenMP offloading. To use OpenMP offloading, make with ``USE_OMP_OFFLOAD=TRUE``. Compiling AMReX with CUDA requires compiling the code through NVIDIA's CUDA compiler driver in addition to the standard compiler. This driver is called ``nvcc`` and it requires a host compiler to work through. The default host compiler for NVCC is GCC even if ``COMP`` is set to a different compiler. One can change this by setting ``NVCC_HOST_COMP``. For example, ``COMP=pgi`` alone will compile C/C++ codes with NVCC/GCC and Fortran codes with PGI, and link with PGI. Using ``COMP=pgi`` and ``NVCC_HOST_COMP=pgi`` will compile C/C++ codes with PGI and NVCC/PGI. You can use ``amrex-tutorials/ExampleCodes/Basic/HelloWorld_C/`` to test your programming environment. For example, building with: .. highlight:: console :: make COMP=gnu USE_CUDA=TRUE should produce an executable named ``main3d.gnu.DEBUG.CUDA.ex``. You can run it and that will generate results like: .. highlight:: console :: $ ./main3d.gnu.DEBUG.CUDA.ex Initializing CUDA... CUDA initialized with 1 GPU AMReX (19.06-404-g0455b168b69c-dirty) initialized Hello world from AMReX version 19.06-404-g0455b168b69c-dirty Total GPU global memory (MB): 6069 Free GPU global memory (MB): 5896 [The Arena] space (MB): 4552 [The Managed Arena] space (MB): 8 [The Pinned Arena] space (MB): 8 AMReX (19.06-404-g0455b168b69c-dirty) finalized SYCL configuration variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When building with ``USE_SYCL=TRUE``, one can set the following makefile variables to configure the build .. raw:: latex \begin{center} .. _tab:gnumakesyclvar: .. table:: AMReX SYCL-specific GNU Make build options +------------------------------+-------------------------------------------------+-------------+-----------------+ | Variable Name | Description | Default | Possible values | +==============================+=================================================+=============+=================+ | SYCL_AOT | Enable SYCL ahead-of-time compilation | FALSE | TRUE, FALSE | +------------------------------+-------------------------------------------------+-------------+-----------------+ | SYCL_AOT_GRF_MODE | Specify AOT register file mode | Default | Default, Large, | | | | | AutoLarge | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMREX_INTEL_ARCH | Specify target if AOT is enabled | None | pvc, etc. | +------------------------------+-------------------------------------------------+-------------+-----------------+ | SYCL_SPLIT_KERNEL | Enable SYCL kernel splitting | FALSE | TRUE, FALSE | +------------------------------+-------------------------------------------------+-------------+-----------------+ | USE_ONEDPL | Enable SYCL's oneDPL algorithms | NO | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | SYCL_SUB_GROUP_SIZE | Specify subgroup size | 32 | 64, 32, 16 | +------------------------------+-------------------------------------------------+-------------+-----------------+ | SYCL_PARALLEL_LINK_JOBS | Number of parallel jobs in device link | 1 | 1, 2, 3, etc. | +------------------------------+-------------------------------------------------+-------------+-----------------+ .. raw:: latex \end{center} Building with CMake ------------------- To build AMReX with GPU support in CMake, add ``-DAMReX_GPU_BACKEND=CUDA|HIP|SYCL`` to the ``cmake`` invocation, for CUDA, HIP and SYCL, respectively. By default, AMReX uses 256 threads per GPU block/group in most situations. This can be changed with ``-DAMReX_GPU_MAX_THREADS=N``, where ``N`` is 128 for example. Enabling CUDA support ^^^^^^^^^^^^^^^^^^^^^ To build AMReX with CUDA support in CMake, add ``-DAMReX_GPU_BACKEND=CUDA`` to the ``cmake`` invocation. For a full list of CUDA-specific configuration options, check the :ref:`table ` below. .. raw:: latex \begin{center} .. _tab:cmakecudavar: .. table:: AMReX CUDA-specific build options +------------------------------+-------------------------------------------------+-------------+-----------------+ | Variable Name | Description | Default | Possible values | +==============================+=================================================+=============+=================+ | AMReX_CUDA_ARCH | CUDA target architecture | Auto | User-defined | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_FASTMATH | Enable CUDA fastmath library | YES | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_BACKTRACE | Host function symbol names (e.g. cuda-memcheck)| Auto | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_COMPILATION_TIMER | CSV table with time for each compilation phase | NO | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_DEBUG | Device debug information (optimizations: off) | YES: Debug | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_ERROR_CAPTURE_THIS| Error if a CUDA lambda captures a class' this | NO | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_ERROR_CROSS | Error if a host function is called from a host | NO | YES, NO | | _EXECUTION_SPACE_CALL | device function | | | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_KEEP_FILES | Keep intermediately files (folder: nvcc_tmp) | NO | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_LTO | Enable CUDA link-time-optimization | NO | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_MAXREGCOUNT | Limits the number of CUDA registers available | 255 | User-defined | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_PTX_VERBOSE | Verbose code generation statistics in ptxas | NO | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_SHOW_CODELINES | Source information in PTX (optimizations: on) | Auto | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_SHOW_LINENUMBERS | Line-number information (optimizations: on) | Auto | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ | AMReX_CUDA_WARN_CAPTURE_THIS | Warn if a CUDA lambda captures a class' this | YES | YES, NO | +------------------------------+-------------------------------------------------+-------------+-----------------+ .. raw:: latex \end{center} The target architecture to build for can be specified via the configuration option ``-DAMReX_CUDA_ARCH=``, where ```` can be either the name of the NVIDIA GPU generation, i.e. ``Turing``, ``Volta``, ``Ampere``, ``...`` , or its `compute capability `_, i.e. ``10.0``, ``9.0``, ``...`` . For example, on Cori GPUs you can specify the architecture as follows: .. highlight:: console :: cmake [options] -DAMReX_GPU_BACKEND=CUDA -DAMReX_CUDA_ARCH=Volta /path/to/amrex/source If no architecture is specified, CMake will default to the architecture defined in the *environment variable* ``AMREX_CUDA_ARCH`` (note: all caps). If the latter is not defined, CMake will try to determine which GPU architecture is supported by the system. If more than one is found, CMake will build for all of them. If autodetection fails, a list of "common" architectures is assumed. `Multiple CUDA architectures `__ can also be set manually as semicolon-separated list, e.g. ``-DAMReX_CUDA_ARCH=7.0;8.0``. Building for multiple CUDA architectures will generally result in a larger library and longer build times. **Note that AMReX supports NVIDIA GPU architectures with compute capability 6.0 or higher and CUDA Toolkit version 11.0 or higher.** In order to import the CUDA-enabled AMReX library into your CMake project, you need to include the following code into the appropriate CMakeLists.txt file: .. highlight:: console :: # Find CUDA-enabled AMReX installation find_package(AMReX REQUIRED CUDA) If instead of using an external installation of AMReX you prefer to include AMReX as a subproject in your CMake setup, we strongly encourage you to use the ``AMReX_SetupCUDA`` module as shown below if the CMake version is less than 3.20: .. highlight:: console :: # Enable CUDA in your CMake project enable_language(CUDA) # Include the AMReX-provided CUDA setup module -- OBSOLETE with CMake >= 3.20 if(CMAKE_VERSION VERSION_LESS 3.20) include(AMReX_SetupCUDA) endif() # Include AMReX source directory ONLY AFTER the two steps above add_subdirectory(/path/to/amrex/source/dir) To ensure consistency between CUDA-enabled AMReX and any CMake target that links against it, we provide the helper function ``setup_target_for_cuda_compilation()``: .. highlight:: console :: # Set all sources for my_target target_sources(my_target source1 source2 source3 ...) # Setup my_target to be compiled with CUDA and be linked against CUDA-enabled AMReX # MUST be done AFTER all sources have been assigned to my_target setup_target_for_cuda_compilation(my_target) # Link against amrex target_link_libraries(my_target PUBLIC AMReX::amrex) Enabling HIP Support ^^^^^^^^^^^^^^^^^^^^ To build AMReX with HIP support in CMake, add ``-DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH= -DCMAKE_CXX_COMPILER=`` to the ``cmake`` invocation. If you don't need Fortran features (``AMReX_FORTRAN=OFF``), it is recommended to use AMD's ``clang++`` as the HIP compiler. (Please see these issues for reference in rocm/HIP <= 4.2.0 `[1] `__ `[2] `__.) In AMReX CMake, the HIP compiler is treated as a special C++ compiler and therefore the standard CMake variables used to customize the compilation process for C++, for example ``CMAKE_CXX_FLAGS``, can be used for HIP as well. Since CMake does not support autodetection of HIP compilers/target architectures yet, ``CMAKE_CXX_COMPILER`` must be set to a valid HIP compiler, i.e. ``clang++`` or ``hipcc``, and ``AMReX_AMD_ARCH`` to the target architecture you are building for. Thus **AMReX_AMD_ARCH and CMAKE_CXX_COMPILER are required user-inputs when AMReX_GPU_BACKEND=HIP**. We again read also an *environment variable*: ``AMREX_AMD_ARCH`` (note: all caps) and the C++ compiler can be hinted as always, e.g. with ``export CXX=$(which clang++)``. Below is an example configuration for HIP on Tulip: .. highlight:: console :: cmake -S . -B build -DAMReX_GPU_BACKEND=HIP -DCMAKE_CXX_COMPILER=$(which clang++) -DAMReX_AMD_ARCH="gfx906;gfx908" # [other options] cmake --build build -j 6 Enabling SYCL Support ^^^^^^^^^^^^^^^^^^^^^ To build AMReX with SYCL support in CMake, add ``-DAMReX_GPU_BACKEND=SYCL -DCMAKE_CXX_COMPILER=`` to the ``cmake`` invocation. For a full list of SYCL-specific configuration options, check the :ref:`table ` below. In AMReX CMake, the SYCL compiler is treated as a special C++ compiler and therefore the standard CMake variables used to customize the compilation process for C++, for example ``CMAKE_CXX_FLAGS``, can be used for SYCL as well. Since CMake does not support autodetection of SYCL compilers yet, ``CMAKE_CXX_COMPILER`` must be set to a valid SYCL compiler. i.e. ``icpx``. Thus **CMAKE_CXX_COMPILER is a required user-input when AMReX_GPU_BACKEND=SYCL**. At this time, **the only supported SYCL compiler is icpx**. Below is an example configuration for SYCL: .. highlight:: console :: cmake -DAMReX_GPU_BACKEND=SYCL -DCMAKE_CXX_COMPILER=$(which icpx) [other options] /path/to/amrex/source .. raw:: latex \begin{center} .. _tab:cmakesyclvar: .. table:: AMReX SYCL-specific build options +-------------------------------+----------------------------------------------+-------------+------------------+ | Variable Name | Description | Default | Possible values | +===============================+==============================================+=============+==================+ | AMReX_SYCL_AOT | Enable SYCL ahead-of-time compilation | NO | YES, NO | +-------------------------------+----------------------------------------------+-------------+------------------+ | AMReX_SYCL_AOT_GRF_MODE | Specify AOT register file mode | Default | Default, Large, | | | | | AutoLarge | +-------------------------------+----------------------------------------------+-------------+------------------+ | AMREX_INTEL_ARCH | Specify target if AOT is enabled | None | pvc, etc. | +-------------------------------+----------------------------------------------+-------------+------------------+ | AMReX_SYCL_SPLIT_KERNEL | Enable SYCL kernel splitting | YES | YES, NO | +-------------------------------+----------------------------------------------+-------------+------------------+ | AMReX_SYCL_ONEDPL | Enable SYCL's oneDPL algorithms | NO | YES, NO | +-------------------------------+----------------------------------------------+-------------+------------------+ | AMReX_SYCL_SUB_GROUP_SIZE | Specify subgroup size | 32 | 64, 32, 16 | +-------------------------------+----------------------------------------------+-------------+------------------+ | AMReX_PARALLEL_LINK_JOBS | Specify number of parallel link jobs | 1 | positive integer | +-------------------------------+----------------------------------------------+-------------+------------------+ .. raw:: latex \end{center} .. =================================================================== .. _sec:gpu:namespace: Gpu Namespace and Macros ======================== Most GPU related classes and functions are in ``namespace Gpu``, which is inside ``namespace amrex``. For example, the GPU configuration class ``Device`` can be referenced to at ``amrex::Gpu::Device``. For portability, AMReX defines some macros for CUDA/HIP function qualifiers and they should be preferred to allow execution when ``USE_CUDA=FALSE`` and ``USE_HIP=FALSE``. These include: .. highlight:: c++ :: #define AMREX_GPU_HOST __host__ #define AMREX_GPU_DEVICE __device__ #define AMREX_GPU_GLOBAL __global__ #define AMREX_GPU_HOST_DEVICE __host__ __device__ Note that when AMReX is not built with ``CUDA/HIP/SYCL``, these macros expand to empty space. When AMReX is compiled with ``USE_CUDA=TRUE``, ``USE_HIP=TRUE``, ``USE_SYCL=TRUE``, or ``USE_ACC=TRUE`` the preprocessor macros ``AMREX_USE_CUDA``, ``AMREX_USE_HIP``, ``AMREX_USE_SYCL``, or ``AMREX_USE_ACC`` respectively are defined for conditional programming, as well as ``AMREX_USE_GPU``. This ``AMREX_USE_GPU`` definition can be used in application code if different functionality should be used when AMReX is built with GPU support. When AMReX is compiled with ``USE_OMP_OFFLOAD=TRUE``, ``AMREX_USE_OMP_OFFLOAD`` is defined. The macros ``AMREX_IF_ON_DEVICE((code_for_device))`` and ``AMREX_IF_ON_HOST((code_for_host))`` should be used when a ``__host__ __device__`` function requires separate code for the CPU and GPU implementations. .. =================================================================== .. _sec:gpu:memory: Memory Allocation ================= To provide portability and improve memory allocation performance, AMReX provides a number of memory pools. When compiled without GPU support, all :cpp:`Arena`\ s use standard :cpp:`new` and :cpp:`delete` operators. With GPU support, the :cpp:`Arena`\ s each allocate with a specific type of GPU memory: .. raw:: latex \begin{center} .. _tab:gpu:arena: .. table:: Memory Arenas +---------------------+---------------------------------------------------+ | Arena | Memory Type | +=====================+===================================================+ | The_Arena() | managed or device memory | +---------------------+---------------------------------------------------+ | The_Device_Arena() | device memory, could be an alias to The_Arena() | +---------------------+---------------------------------------------------+ | The_Managed_Arena() | managed memory, could be an alias to The_Arena() | +---------------------+---------------------------------------------------+ | The_Pinned_Arena() | pinned memory | +---------------------+---------------------------------------------------+ .. raw:: latex \end{center} The Arena object returned by these calls provides access to two functions: .. highlight:: c++ :: void* alloc (std::size_t sz); void free (void* p); :cpp:`The_Arena()` is used for memory allocation of data in :cpp:`BaseFab`. By default, it allocates device memory. This can be changed with a boolean runtime parameter ``amrex.the_arena_is_managed=1``. When managed memory is enabled, the data in a :cpp:`MultiFab` is placed in device memory by default and is accessible from both CPU host and GPU device. This allows application codes to develop their GPU capability gradually. The behavior of :cpp:`The_Managed_Arena()` likewise depends on the ``amrex.the_arena_is_managed`` parameter. If ``amrex.the_arena_is_managed=0``, :cpp:`The_Managed_Arena()` is a separate pool of managed memory. If ``amrex.the_arena_is_managed=1``, :cpp:`The_Managed_Arena()` is simply aliased to :cpp:`The_Arena()` to reduce memory fragmentation. In :cpp:`amrex::Initialize`, a large amount of GPU device memory is allocated and is kept in :cpp:`The_Arena()`. The default is 3/4 of the total device memory, and it can be changed with a :cpp:`ParmParse` parameter, ``amrex.the_arena_init_size``, in the unit of bytes. The default initial size for other arenas is 8388608 (i.e., 8 MB). For :cpp:`The_Managed_Arena()` and :cpp:`The_Device_Arena()`, it can be changed with ``amrex.the_managed_arena_init_size`` and ``amrex.the_device_arena_init_size``, respectively, if they are not an alias to :cpp:`The_Arena()`. For :cpp:`The_Pinned_Arena()`, it can be changed with ``amrex.the_pinned_arena_init_size``. The user can also specify a release threshold for these arenas. If the memory usage in an arena is below the threshold, the arena will keep the memory for later reuse, otherwise it will try to release memory back to the system if it is not being used. By default, the release threshold for :cpp:`The_Arena()` is set to be a huge number that prevents the memory being released automatically, and it can be changed with a parameter, ``amrex.the_arena_release_threshold``. For :cpp:`The_Pinned_Arena()`, the default release threshold is the size of the total device memory, and the runtime parameter is ``amrex.the_pinned_arena_release_threshold``. If it is a separate arena, the behavior of :cpp:`The_Device_Area()` or :cpp:`The_Managed_Arena()` can be changed with ``amrex.the_device_arena_release_threshold`` or ``amrex.the_managed_arena_release_threshold``. Note that the units for all the parameter discussed above are bytes. All these arenas also have a member function :cpp:`freeUnused()` that can be used to manually release unused memory back to the system. If you want to print out the current memory usage of the Arenas, you can call :cpp:`amrex::Arena::PrintUsage()`. When AMReX is built with SUNDIALS turned on, :cpp:`amrex::sundials::The_SUNMemory_Helper()` can be provided to SUNDIALS data structures so that they use the appropriate Arena object when allocating memory. For example, it can be provided to the SUNDIALS CUDA vector: .. highlight:: c++ :: N_Vector x = N_VNewWithMemHelp_Cuda(size, use_managed_memory, *The_SUNMemory_Helper()); .. =================================================================== .. _sec:gpu:classes: GPU Safe Classes and Functions ============================== AMReX GPU work takes place inside of MFIter and particle loops. Therefore, there are two ways classes and functions have been modified to interact with the GPU: 1. A number of functions used within these loops are labelled using ``AMREX_GPU_HOST_DEVICE`` and can be called on the device. This includes member functions, such as :cpp:`IntVect::type()`, as well as non-member functions, such as :cpp:`amrex::min` and :cpp:`amrex::max`. In specialized cases, classes are labeled such that the object can be constructed, destructed and its functions can be implemented on the device, including ``IntVect``. 2. Functions that contain MFIter or particle loops have been rewritten to contain device launches. For example, the :cpp:`FillBoundary` function cannot be called from device code, but calling it from CPU will launch GPU kernels if AMReX is compiled with GPU support. Necessary and convenient AMReX functions and objects have been given a device version and/or device access. In this section, we discuss some examples of AMReX device classes and functions that are important for programming GPUs. GpuArray, Array1D, Array2D, and Array3D --------------------------------------- :cpp:`GpuArray`, :cpp:`Array1D`, :cpp:`Array2D`, and :cpp:`Array3D` are trivial types that work on both host and device. They can be used whenever a fixed size array needs to be passed to the GPU or created on GPU. A variety of functions in AMReX return :cpp:`GpuArray` and they can be lambda-captured to GPU code. For example, :cpp:`GeometryData::CellSizeArray()`, :cpp:`GeometryData::InvCellSizeArray()` and :cpp:`Box::length3d()` all return :cpp:`GpuArray`\s. .. _sec:gpu:classes:asyncarray: AsyncArray ---------- Where the :cpp:`GpuArray` is a statically-sized array designed to be passed by value onto the device, :cpp:`AsyncArray` is a dynamically-sized array container designed to work between the CPU and GPU. :cpp:`AsyncArray` stores a CPU pointer and a GPU pointer and coordinates the movement of an array of objects between the two. It can take initial values from the host and move them to the device. It can copy the data from device back to host. It can also be used as scratch space on device. The call to delete the memory is added to the GPU stream as a callback function in the destructor of :cpp:`AsyncArray`. This guarantees the memory allocated in :cpp:`AsyncArray` continues to exist after the :cpp:`AsyncArray` object is deleted when going out of scope until after all GPU kernels in the stream are completed without forcing the code to synchronize. The resulting :cpp:`AsyncArray` class is "async-safe", meaning it can be safely used in asynchronous code regions that contain both CPU work and GPU launches, including :cpp:`MFIter` loops. :cpp:`AsyncArray` is also portable. When AMReX is compiled without GPU support, the object only stores and handles the CPU version of the data. An example using :cpp:`AsyncArray` is given below, .. highlight:: c++ :: Real h_s = 0.0; AsyncArray aa_s(&h_s, 1); // Build AsyncArray of size 1 Real* d_s = aa_s.data(); // Get associated device pointer for (MFIter mfi(mf); mfi.isValid(); ++mfi) { Vector h_v = a_cpu_function(); AsyncArray aa_v1(h_v.data(), h_v.size()); Real* d_v1 = aa_v1.data(); // A device copy of the data std::size_t n = ...; AsyncArray aa_v2(n); // Allocate temporary space on device Real* d_v2 = aa_v2.data(); // A device pointer to uninitialized data ... // gpu kernels using the data pointed by d_v1 and atomically // updating the data pointed by d_s. // d_v2 can be used as scratch space and for pass data // between kernels. // If needed, we can copy the data back to host using // AsyncArray::copyToHost(host_pointer, number_of_elements); // At the end of each loop the compiler inserts a call to the // destructor of aa_v* on cpu. Objects aa_v* are deleted, but // their associated memory pointed by d_v* is not deleted // immediately until the gpu kernels in this loop finish. } aa_s.copyToHost(&h_s, 1); // Copy the value back to host Gpu Vectors ----------- AMReX also provides a number of dynamic vectors for use with GPU kernels. These are configured to use the different AMReX memory Arenas, as summarized below. By using the memory Arenas, we can avoid expensive allocations and deallocations when (for example) resizing vectors. .. raw:: latex \begin{center} .. _tab:gpu:gpuvectors: .. table:: Memory Arenas Associated with each Gpu Vector +----------------+----------------------+ | Vector | Arena | +================+======================+ | DeviceVector | The_Arena() | +----------------+----------------------+ | HostVector | The_Pinned_Arena() | +----------------+----------------------+ | ManagedVector | The_Managed_Arena() | +----------------+----------------------+ .. raw:: latex \end{center} These classes behave almost identically to an :cpp:`amrex::Vector`, (see :ref:`sec:basics:vecandarr`), except that they can only hold "plain-old-data" objects (e.g. Reals, integers, amrex Particles, etc... ). If you want a resizable vector that doesn't use a memory Arena, simply use :cpp:`amrex::Vector`. Note that, even if the data in the vector is managed and available on GPUs, the member functions of e.g. :cpp:`Gpu::ManagedVector` are not. To use the data on the GPU, it is necessary to pass the underlying data pointer in to the GPU kernels. The managed data pointer can be accessed using the :cpp:`data()` member function. Be aware: resizing of dynamically allocated memory on the GPU is unsupported. All resizing of the vector should be done on the CPU, in a manner that avoids race conditions with concurrent GPU kernels. Also note: :cpp:`Gpu::ManagedVector` is not async-safe. It cannot be safely constructed inside of an MFIter loop with GPU kernels and great care should be used when accessing :cpp:`Gpu::ManagedVector` data on GPUs to avoid race conditions. MultiFab Reductions ------------------- AMReX provides functions for performing standard reduction operations on :cpp:`MultiFabs`, including :cpp:`MultiFab::sum` and :cpp:`MultiFab::max`. When AMReX is built with GPU support, these functions automatically implement the corresponding reductions on GPUs in an efficient manner. Function template :cpp:`ParReduce` can be used to implement user-defined reduction functions over :cpp:`MultiFab`\ s. For example, the following function computes the sum of total kinetic energy using the data in a :cpp:`MultiFab` storing the mass and momentum density. .. highlight:: c++ :: Real compute_ek (MultiFab const& mf) { auto const& ma = mf.const_arrays(); return ParReduce(TypeList{}, TypeList{}, mf, IntVect(0), // zero ghost cells [=] AMREX_GPU_DEVICE (int box_no, int i, int j, int k) noexcept -> GpuTuple { Array4 const& a = ma[box_no]; Real rho = a(i,j,k,0); Real rhovx = a(i,j,k,1); Real rhovy = a(i,j,k,2); Real rhovz = a(i,j,k,3); Real ek = (rhovx*rhovx+rhovy*rhovy+rhovz*rhovz)/(2.*rho); return { ek }; }); } As another example, the following function computes the max- and 1-norm of a :cpp:`MultiFab` in the masked region specified by an :cpp:`iMultiFab`. .. highlight:: c++ :: GpuTuple compute_norms (MultiFab const& mf, iMultiFab const& mask) { auto const& data_ma = mf.const_arrays(); auto const& mask_ma = mask.const_arrays(); return ParReduce(TypeList{}, TypeList{}, mf, IntVect(0), // zero ghost cells [=] AMREX_GPU_DEVICE (int box_no, int i, int j, int k) noexcept -> GpuTuple { if (mask_ma[box_no](i,j,k)) { Real a = std::abs(data_ma[box_no](i,j,k)); return { a, a }; } else { return { 0., 0. }; } }); } It should be noted that the reduction result of :cpp:`ParReduce` is local and it is the user's responsibility if MPI communication is needed. Box, IntVect and IndexType -------------------------- In AMReX, :cpp:`Box`, :cpp:`IntVect` and :cpp:`IndexType` are classes for representing indices. These classes and most of their member functions, including constructors and destructors, have both host and device versions. They can be used freely in device code. Geometry -------- AMReX's :cpp:`Geometry` class is not a GPU safe class. However, we often need to use geometric information such as cell size and physical coordinates in GPU kernels. We can use the following member functions and pass the returned values to GPU kernels: .. highlight:: c++ :: GpuArray ProbLoArray () const noexcept; GpuArray ProbHiArray () const noexcept; GpuArray isPeriodicArray () const noexcept; GpuArray CellSizeArray () const noexcept; GpuArray InvCellSizeArray () const noexcept; Alternatively, we can copy the data into a GPU safe class that can be passed by value to GPU kernels. This class is called :cpp:`GeometryData`, which is created by calling :cpp:`Geometry::data()`. The accessor functions of :cpp:`GeometryData` are identical to :cpp:`Geometry`. .. _sec:gpu:classes:basefab: BaseFab, FArrayBox, IArrayBox ----------------------------- :cpp:`BaseFab`, :cpp:`IArrayBox` and :cpp:`FArrayBox` have some GPU support. They cannot be constructed in device code unless they are constructed as an alias to :cpp:`Array4`. Many of their member functions can be used in device code as long as they have been constructed in device memory. Some of the device member functions include :cpp:`array`, :cpp:`dataPtr`, :cpp:`box`, :cpp:`nComp`, and :cpp:`setVal`. All :cpp:`BaseFab` objects in :cpp:`FabArray` are allocated in CPU memory, including :cpp:`IArrayBox` and :cpp:`FArrayBox`, which are derived from :cpp:`BaseFab`, and the array data contained are allocated in either device or managed memory. We cannot pass a :cpp:`BaseFab` object by value because they do not have copy constructor. However, we can make an :cpp:`Array4` using member function :cpp:`BaseFab::array()`, and pass it by value to GPU kernels. In GPU device code, we can use :cpp:`Array4` or, if necessary, we can make an alias :cpp:`BaseFab` from an :cpp:`Array4`. For example, .. highlight:: c++ :: AMREX_GPU_HOST_DEVICE void g (FArrayBox& fab) { ... } AMREX_GPU_HOST_DEVICE void f (Box const& bx, Array4 const& a) { FArrayBox fab(a,bx.ixType()); g(fab); } .. _sec:gpu:classes:elixir: Elixir ------ We often have temporary :cpp:`FArrayBox`\ es in :cpp:`MFIter` loops. These objects go out of scope at the end of each iteration. Because of the asynchronous nature of GPU kernel execution, their destructors might get called before their data are used on GPU. :cpp:`Elixir` can be used to extend the life of the data. For example, .. highlight:: c++ :: for (MFIter mfi(mf); mfi.isValid(); ++mfi) { const Box& bx = mfi.tilebox(); FArrayBox tmp_fab(bx, numcomps); Elixir tmp_eli = tmp_fab.elixir(); Array4 const& tmp_arr = tmp_fab.array(); // GPU kernels using the temporary } Without :cpp:`Elixir`, the code above will likely cause memory errors because the temporary :cpp:`FArrayBox` is deleted on cpu before the gpu kernels use its memory. With :cpp:`Elixir`, the ownership of the memory is transferred to :cpp:`Elixir` that is guaranteed to be async-safe. Async Arena ----------- CUDA 11.2 has introduced a new feature, stream-ordered CUDA memory allocator. This feature enables AMReX to solve the temporary memory allocation and deallocation issue discussed above using a memory pool. Instead of using :cpp:`Elixir`, we can write code like below, .. highlight:: c++ :: for (MFIter mfi(mf); mfi.isValid(); ++mfi) { const Box& bx = mfi.tilebox(); FArrayBox tmp_fab(bx, numcomps, The_Async_Arena()); Array4 const& tmp_arr = tmp_fab.array(); // GPU kernels using the temporary } This is now the recommended way because it's usually more efficient than :cpp:`Elixir`. Note that the code above works for CUDA older than 11.2, HIP and SYCL as well, and it's equivalent to using :cpp:`Elixir` in these cases. By default, the release threshold for the memory pool is unlimited. One can adjust it with :cpp:`ParmParse` parameter, ``amrex.the_async_arena_release_threshold``. .. _sec:gpu:launch: Kernel Launch ============= In this section, how to offload work to the GPU will be demonstrated. AMReX supports offloading work with CUDA, HIP, SYCL, OpenACC, or OpenMP. When using CUDA, HIP, or SYCL, AMReX provides users with portable C++ function calls or C++ macros that launch a user-defined lambda function. When compiled without CUDA/HIP/SYCL, the lambda function is ran on the CPU. When compiled with CUDA/HIP/SYCL, the launch function prepares and launches the lambda function on the GPU. The preparation includes calculating the appropriate number of blocks and threads, selecting the CUDA stream or HIP stream or SYCL queue, and defining the appropriate work chunk for each GPU thread. When using OpenACC or OpenMP offloading pragmas, the users add the appropriate pragmas to their work loops and functions to offload to the GPU. These work in conjunction with AMReX's internal CUDA-based memory management, described earlier, to ensure the required data is available on the GPU when the offloaded function is executed. The available launch schema are presented here in three categories: launching nested loops over Boxes or 1D arrays, launching generic work and launching using OpenACC or OpenMP pragmas. The latest versions of the examples used in this section of the documentation can be found in the AMReX source code in the `Launch`_ tutorials. Users should also refer to Chapter :ref:`Chap:Basics` as needed for information about basic AMReX classes. .. _`Launch`: https://amrex-codes.github.io/amrex/tutorials_html/GPU_Tutorial.html#launch AMReX also recommends writing primary floating point operation kernels in C++ using AMReX's :cpp:`Array4` object syntax. It provides a multi-dimensional array syntax, similar in appearance to Fortran, while maintaining performance. The details can be found in :ref:`Array4 ` and :ref:`C++ Kernel `. .. Overview table??? .. _sec:gpu:for: Launching C++ nested loops -------------------------- The most common AMReX work construct is a set of nested loops over the cells in a box. AMReX provides C++ functions and macro equivalents to port nested loops efficiently onto the GPU. There are 3 different nested loop GPU launches: a 4D launch for work over a box and a number of components, a 3D launch for work over a box and a 1D launch for work over a number of arbitrary elements. Each of these launches provides a performance portable set of nested loops for both CPU and GPU applications. These loop launches should only be used when each iteration of the nested loop is independent of other iterations. Therefore, these launches have been marked with ``AMREX_PRAGMA_SIMD`` when using the CPU and they should only be used for ``simd``-capable nested loops. Calculations that cannot vectorize should be rewritten wherever possible to allow efficient utilization of GPU hardware. However, it is important for applications to use these launches whenever appropriate because they contain optimizations for both CPU and GPU variations of nested loops. For example, on the GPU the spatial coordinate loops are reduced to a single loop and the component loop is moved to these inner most loop. AMReX's launch functions apply the appropriate optimizations for compiling both with and without GPU support in a compact and readable format. AMReX also provides a variation of the launch function that is implemented as a C++ macro. It behaves identically to the function, but hides the lambda function from the user. There are some subtle differences between the two implementations, that will be discussed. It is up to the user to select which version they would like to use. For simplicity, the function variation will be discussed throughout the rest of this documentation, however all code snippets will also include the macro variation for reference. A 4D example of the launch function, :cpp:`amrex::ParallelFor`, is given here: .. highlight:: c++ :: int ncomp = mf.nComp(); for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi) { const Box& bx = mfi.tilebox(); Array4 const& fab = mf.array(mfi); amrex::ParallelFor(bx, ncomp, [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { fab(i,j,k,n) += 1.; }); /* MACRO VARIATION: / / AMREX_PARALLEL_FOR_4D ( bx, ncomp, i, j, k, n, / { / fab(i,j,k,n) += 1.; / }); */ } This code works whether it is compiled for GPUs or CPUs. :cpp:`TilingIfNotGPU()` returns ``false`` in the GPU case to turn off tiling and maximize the amount of work given to the GPU in each launch. When tiling is off, :cpp:`tilebox()` returns the :cpp:`validbox()`. The :cpp:`BaseFab::array()` function returns a lightweight :cpp:`Array4` object that defines access to the underlying :cpp:`FArrayBox` data. The :cpp:`Array4`\s is then captured by the C++ lambda functions defined in the launch function. ``amrex::ParallelFor()`` expands into different variations of a quadruply-nested :cpp:`for` loop depending dimensionality and whether it is being implemented on CPU or GPU. The best way to understand this function is to take a look at the 4D :cpp:`amrex::ParallelFor` that is implemented when AMReX is compiled without GPU support, such as ``USE_CUDA=FALSE``. A simplified version is reproduced here: .. highlight:: c++ :: void ParallelFor (Box const& box, int ncomp, /* LAMBDA FUNCTION */) { const Dim3 lo = amrex::lbound(box); const Dim3 hi = amrex::ubound(box); for (int n = 0; n < ncomp; ++n) { for (int z = lo.z; z <= hi.z; ++z) { for (int y = lo.y; y <= hi.y; ++y) { AMREX_PRAGMA_SIMD for (int x = lo.x; x <= hi.x; ++x) { /* LAUNCH LAMBDA FUNCTION (x,y,z,n) */ }}} } } :cpp:`amrex::ParallelFor` takes a :cpp:`Box` and a number of components, which define the bounds of the quadruply-nested :cpp:`for` loop, and a lambda function to run on each iteration of the nested loop. The lambda function takes the loop iterators as parameters, allowing the current cell to be indexed in the lambda. In addition to the loop indices, the lambda function captures any necessary objects defined in the local scope. CUDA lambda functions can only capture by value, as the information must be able to be copied onto the device. In this example, the lambda function captures a :cpp:`Array4` object, ``fab``, that defines how to access the :cpp:`FArrayBox`. The macro uses ``fab`` to increment the value of each cell within the :cpp:`Box bx`. If AMReX is compiled with GPU support, this incrementation is performed on the GPU, with GPU optimized loops. This 4D launch can also be used to work over any sequential set of components, by passing the number of consecutive components and adding the iterator to the starting component: :cpp:`fab(i,j,k,n_start+n)`. The 3D variation of the loop launch does not include a component loop and has the syntax shown here: .. highlight:: c++ :: for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi) { const Box& bx = mfi.tilebox(); Array4 const& fab = mf.array(mfi); amrex::ParallelFor(bx, [=] AMREX_GPU_DEVICE (int i, int j, int k) { fab(i,j,k) += 1.; }); /* MACRO VARIATION: / / AMREX_PARALLEL_FOR_3D ( bx, i, j, k, / { / fab(i,j,k) += 1.; / }); */ } Finally, a 1D version is available for looping over a number of elements, such as particles. An example of a 1D function launch is given here: .. highlight:: c++ :: for (MFIter mfi(mf); mfi.isValid(); ++mfi) { FArrayBox& fab = mf[mfi]; Real* AMREX_RESTRICT p = fab.dataPtr(); const long nitems = fab.box().numPts() * fab.nComp(); amrex::ParallelFor(nitems, [=] AMREX_GPU_DEVICE (long idx) { p[idx] += 1.; }); /* MACRO VARIATION: / / AMREX_PARALLEL_FOR_1D ( nitems, idx, / { / p[idx] += 1.; / }); */ } Instead of passing an :cpp:`Array4`, :cpp:`FArrayBox::dataPtr()` is called to obtain a pointer to the :cpp:`FArrayBox` data. This is an alternative way to access the :cpp:`FArrayBox` data on the GPU. Instead of passing a :cpp:`Box` to define the loop bounds, a :cpp:`long` or :cpp:`int` number of elements is passed to bound the single :cpp:`for` loop. This construct can be used to work on any contiguous set of memory by passing the number of elements to work on and indexing the pointer to the starting element: :cpp:`p[idx + 15]`. GPU block size -------------- By default, :cpp:`ParallelFor` launches ``AMREX_GPU_MAX_THREADS`` threads per GPU block, where ``AMREX_GPU_MAX_THREADS`` is a compile-time constant with a default value of 256. The users can also explicitly specify the number of threads per block by :cpp:`ParallelFor(...)`, where ``MY_BLOCK_SIZE`` is a multiple of the warp size (e.g., 128). This allows the users to do performance tuning for individual kernels. Launching general kernels ------------------------- To launch more general work on the GPU, AMReX provides a standard launch function: :cpp:`amrex::launch`. Instead of creating nested loops, this function prepares the device launch based on a :cpp:`Box`, launches with an appropriate sized GPU kernel and constructs a thread :cpp:`Box` that defines the work for each thread. On the CPU, the thread :cpp:`Box` is set equal to the total launch :cpp:`Box`, so tiling works as expected. On the GPU, the thread :cpp:`Box` usually contains a single cell to allow all GPU threads to be utilized effectively. An example of a generic function launch is shown here: .. highlight:: c++ :: for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi) { const Box& bx = mfi.tilebox(); Array4 const& arr = mf.array(mfi); amrex::launch(bx, [=] AMREX_GPU_DEVICE (Box const& tbx) { pluseone_array4(tbx, arr); FArrayBox fab(arr, tbx.ixType()); plusone_fab(tbx, fab); // this version takes FArrayBox }); /* MACRO VARIATION / / AMREX_LAUNCH_DEVICE_LAMBDA ( bx, tbx, / { / plusone_array4(tbx, arr); / plusone_fab(tbx, FArrayBox(arr,tbx.ixType())); / }); */ } It also shows how to make a :cpp:`FArrayBox` from :cpp:`Array4` when needed. Note that :cpp:`FarrayBox`\ es cannot be passed to GPU kernels directly. :cpp:`TilingIfNotGPU()` returns ``false`` in the GPU case to turn off tiling and maximize the amount of work given to the GPU in each launch, which substantially improves performance. When tiling is off, :cpp:`tilebox()` returns the :cpp:`validbox()` of the :cpp:`FArrayBox` for that iteration. Offloading work using OpenACC or OpenMP pragmas ----------------------------------------------- When using OpenACC or OpenMP with AMReX, the GPU offloading work is done with pragmas placed on the nested loops. This leaves the :cpp:`MFIter` loop largely unchanged. An example GPU pragma based :cpp:`MFIter` loop that calls a Fortran function is given here: .. highlight:: c++ :: for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi) { const Box& bx = mfi.tilebox(); FArrayBox& fab = mf[mfi]; plusone_acc(BL_TO_FORTRAN_BOX(tbx), BL_TO_FORTRAN_ANYD(fab)); } The function ``plusone_acc`` is a CPU host function. The :cpp:`FArrayBox` reference from :cpp:`operator[]` is a reference to a :cpp:`FArrayBox` in host memory with data that has been placed in GPU memory. ``BL_TO_FORTRAN_BOX`` and ``BL_TO_FORTRAN_ANYD`` behave identically to implementations used on the CPU. These macros return the individual components of the AMReX C++ objects to allow passing to the Fortran function. The corresponding OpenACC labelled loop in ``plusone_acc`` is: .. highlight:: fortran :: !dat = pointer to fab's GPU data !$acc kernels deviceptr(dat) do k = lo(3), hi(3) do j = lo(2), hi(2) do i = lo(1), hi(1) dat(i,j,k) = dat(i,j,k) + 1.0_amrex_real end do end do end do !$acc end kernels Since the data pointer passed to ``plusone_acc`` points to device memory, OpenACC can be told the data is available on the device using the ``deviceptr`` construct. For further details about OpenACC programming, consult the OpenACC user's guide. The OpenMP implementation of this loop is similar, only requiring changing the pragmas utilized to obtain the proper offloading. The OpenMP labelled version of this loop is: .. highlight:: fortran :: !dat = pointer to fab's GPU data !$omp target teams distribute parallel do collapse(3) schedule(static,1) is_device_ptr(dat) do k = lo(3), hi(3) do j = lo(2), hi(2) do i = lo(1), hi(1) dat(i,j,k) = dat(i,j,k) + 1.0_amrex_real end do end do end do In this case, ``is_device_ptr`` is used to indicate that :cpp:`dat` is available in device memory. For further details about programming with OpenMP for GPU offloading, consult the OpenMP user's guide. Kernel launch details --------------------- CUDA (and HIP) kernel calls are asynchronous and they return before the kernel is finished on the GPU. So the :cpp:`MFIter` loop finishes iterating on the CPU and is ready to move on to the next work before the actual work completes on the GPU. To guarantee consistency, there is an implicit device synchronization (a GPU barrier) in the destructor of :cpp:`MFIter`. This ensures that all GPU work inside of an :cpp:`MFIter` loop will complete before code outside of the loop is executed. Any kernel launches made outside of an :cpp:`MFIter` loop must ensure appropriate device synchronization occurs. This can be done by calling :cpp:`Gpu::streamSynchronize()`. CUDA and HIP supports multiple streams and kernels. Kernels launched in the same stream are executed sequentially, but different streams of kernel launches may be run in parallel. For each iteration of :cpp:`MFIter`, AMReX uses a different GPU stream (up to 4 streams in total). This allows each iteration of an :cpp:`MFIter` loop to run independently, but in the expected sequence, and maximize the use of GPU parallelism. However, AMReX uses the default GPU stream outside of :cpp:`MFIter` loops. Launching kernels with AMReX's launch macros or functions implement a C++ lambda function. Lambdas functions used for launches on the GPU have some restrictions the user must understand. First, the function enclosing the extended lambda must not have private or protected access within its parent class, otherwise the code will not compile. This can be fixed by changing the access of the enclosing function to public. Another pitfall that must be considered: if the lambda function accesses a member of the enclosing class, the lambda function actually captures :cpp:`this` pointer by value and accesses variables and functions via :cpp:`this->`. If the object is not accessible on GPU, the code will not work as intended. For example, .. highlight:: c++ :: class MyClass { public: Box bx; int m; // integer created on the host. void f () { amrex::launch(bx, [=] AMREX_GPU_DEVICE (Box const& tbx) { printf("m = %d\n", m); // Failed attempt to use m on the GPU. }); } }; The function ``f`` in the code above will not work unless the :cpp:`MyClass` object is in unified memory. If it is undesirable to put the object into unified memory, a local copy of the information can be created for the lambda to capture. For example: .. highlight:: c++ :: class MyClass { public: Box bx; int m; void f () { int local_m = m; // Local temporary copy of m. amrex::launch(bx, [=] AMREX_GPU_DEVICE (Box const& tbx) { printf("m = %d\n", local_m); // Lambda captures local_m by value. }); } }; C++ macros have some important limitations. For example, commas outside of a set of parentheses are interpreted by the macro, leading to errors such as: .. highlight:: c++ :: AMREX_PARALLEL_FOR_3D (bx, tbx, { Real a, b; <---- Error. Macro reads "{ Real a" as a parameter and "b; }" as another. Real a; <---- OK Real b; }); One should also avoid using :cpp:`continue` and :cpp:`return` inside the macros because it is not an actual :cpp:`for` loop. Users that choose to implement the macro launches should be aware of the limitations of C++ preprocessing macros to ensure GPU offloading is done properly. Finally, AMReX's most common CPU threading strategy for GPU/CPU systems is to utilize OpenMP threads to maintain multi-threaded parallelism on work chosen to run on the host. This means OpenMP pragmas should be maintained where CPU work is performed and usually turned off where work is offloaded onto the GPU. OpenMP pragmas can be turned off using the conditional pragma and :cpp:`Gpu::notInLaunchRegion()`, as shown below: .. highlight:: c++ :: #ifdef AMREX_USE_OMP #pragma omp parallel if (Gpu::notInLaunchRegion()) #endif It is generally expected that simply using OpenMP threads to launch GPU work quicker will show little improvement or even perform worse. So, this conditional statement should be added to MFIter loops that contain GPU work, unless users specifically test the performance or are designing more complex workflows that require OpenMP. .. _sec:gpu:stream: Stream and Synchronization ========================== As mentioned in Section :ref:`sec:gpu:overview`, AMReX uses a number of GPU streams that are either CUDA streams or HIP streams or SYCL queues. Many GPU functions (e.g., :cpp:`ParallelFor` and :cpp:`Gpu::copyAsync`) are asynchronous with respect to the host. To facilitate synchronization that is sometimes necessary, AMReX provides :cpp:`Gpu::streamSynchronize()` and :cpp:`Gpu::streamSynchronizeAll()` to synchronize the current stream and all AMReX streams, respectively. For performance reasons, one should try to minimize the number of synchronization calls. For example, .. highlight:: c++ :: // The synchronous version is NOT recommended Gpu::copy(Gpu::deviceToHost, ....); Gpu::copy(Gpu::deviceToHost, ....); Gpu::copy(Gpu::deviceToHost, ....); // NOT recommended because of unnecessary synchronization Gpu::copyAsync(Gpu::deviceToHost, ....); Gpu::streamSynchronize(); Gpu::copyAsync(Gpu::deviceToHost, ....); Gpu::streamSynchronize(); Gpu::copyAsync(Gpu::deviceToHost, ....); Gpu::streamSynchronize(); // recommended Gpu::copyAsync(Gpu::deviceToHost, ....); Gpu::copyAsync(Gpu::deviceToHost, ....); Gpu::copyAsync(Gpu::deviceToHost, ....); Gpu::streamSynchronize(); In addition to stream synchronization, there is also :cpp:`Gpu::synchronize()` that will perform a device wide synchronization. However, a device wide synchronization is usually too excessive and it might interfere with other libraries (e.g., MPI). .. _sec:gpu:example: An Example of Migrating to GPU ============================== The nature of GPU programming poses difficulties for a number of common AMReX patterns, such as the one below: .. highlight:: c++ :: // Given MultiFab uin and uout #ifdef AMREX_USE_OMP #pragma omp parallel #endif { FArrayBox q; for (MFIter mfi(uin,true); mfi.isValid(); ++mfi) { const Box& tbx = mfi.tilebox(); const Box& gbx = amrex::grow(tbx,1); q.resize(gbx); // Do some work with uin[mfi] as input and q as output. // The output region is gbx; f1(gbx, q, uin[mfi]); // Then do more work with q as input and uout[mfi] as output. // The output region is tbx. f2(tbx, uout[mfi], q); } } There are several issues in migrating this code to GPUs that need to be addressed. First, functions ``f1`` and ``f2`` have different work regions (``tbx`` and ``gbx``, respectively) and there are data dependencies between the two (``q``). This makes it difficult to put them into a single GPU kernel, so two separate kernels will be launched, one for each function. As we have discussed, AMReX uses multiple CUDA streams or HIP streams or SYCL queues for launching kernels. Because ``q`` is used inside :cpp:`MFIter` loops, multiple GPU kernels on different streams are accessing its data. This creates a race condition. One way to fix this is to move ``FArrayBox q`` inside the loop to make it local to each loop and use :cpp:`Elixir` to make it async-safe (see Section :ref:`sec:gpu:classes:elixir`). This strategy works well for GPU. However it is not optimal for OpenMP CPU threads when the GPU is not used, because of the memory allocation inside OpenMP parallel region. It turns out it is actually unnecessary to make ``FArrayBox q`` local to each iteration when :cpp:`Elixir` is used to extend the life of its floating point data. The code below shows an example of how to rewrite the example in a performance portable way. .. highlight:: c++ :: // Given MultiFab uin and uout #ifdef AMREX_USE_OMP #pragma omp parallel if (Gpu::notInLaunchRegion()) #endif { FArrayBox q; for (MFIter mfi(uin,TilingIfNotGPU()); mfi.isValid(); ++mfi) { const Box& tbx = mfi.tilebox(); const Box& gbx = amrex::grow(tbx,1); q.resize(gbx); Elixir eli = q.elixir(); Array4 const& qarr = q.array(); Array4 const& uinarr = uin.const_array(mfi); Array4 const& uoutarr = uout.array(mfi); amrex::launch(gbx, [=] AMREX_GPU_DEVICE (Box const& b) { f1(b, qarr, uinarr); }); amrex::launch(tbx, [=] AMREX_GPU_DEVICE (Box const& b) { f2(b, uoutarr, qarr); }); } } .. =================================================================== .. _sec:gpu:assertion: Assertions and Error Checking ============================= To help debugging, we often use :cpp:`amrex::Assert` and :cpp:`amrex::Abort`. These functions are GPU safe and can be used in GPU kernels. However, implementing these functions requires additional GPU registers, which will reduce overall performance. Therefore, by default these functions and the macro ``AMREX_ALWAYS_ASSERT`` are no-ops for optimized builds (e.g., ``DEBUG=FALSE`` using the GNU Make build system) when called from kernels run on GPU. Calls to these functions from GPU kernels are active for debug builds and can optionally be activated at compile time for optimized builds (e.g., ``DEBUG=FALSE`` and ``USE_ASSERTION=TRUE`` using the GNU Make build system). In CPU code, :cpp:`AMREX_GPU_ERROR_CHECK()` can be called to check the health of previous GPU launches. This call looks up the return message from the most recently completed GPU launch and aborts if it was not successful. Many kernel launch macros as well as the :cpp:`MFIter` destructor include a call to :cpp:`AMREX_GPU_ERROR_CHECK()`. This prevents additional launches from being called if a previous launch caused an error and ensures all GPU launches within an :cpp:`MFIter` loop completed successfully before continuing work. However, due to asynchronicity, determining the source of the error can be difficult. Even if GPU kernels launched earlier in the code result in a CUDA error or HIP error, the error may not be output at a nearby call to :cpp:`AMREX_GPU_ERROR_CHECK()` by the CPU. When tracking down a CUDA launch error, :cpp:`Gpu::synchronize()`, :cpp:`Gpu::streamSynchronize()`, or :cpp:`Gpu::streamSynchronizeAll()` can be used to synchronize the device, the current GPU stream, or all GPU streams, respectively, and track down the specific launch that causes the error. This error-checking macro will not return any information for SYCL. .. =================================================================== Particle Support ================ .. _sec:gpu:particle: As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes are stored in GPU memory when AMReX is compiled with ``USE_CUDA=TRUE``. This means that the :cpp:`dataPtr` associated with particles can be passed into GPU kernels. These kernels can be launched with a variety of approaches, including Cuda C / Fortran and OpenACC. An example Fortran particle subroutine offloaded via OpenACC might look like the following: .. highlight:: fortran :: subroutine push_position_boris(np, structs, uxp, uyp, uzp, gaminv, dt) use em_particle_module, only : particle_t use amrex_fort_module, only : amrex_real implicit none integer, intent(in), value :: np type(particle_t), intent(inout) :: structs(np) real(amrex_real), intent(in) :: uxp(np), uyp(np), uzp(np), gaminv(np) real(amrex_real), intent(in), value :: dt integer :: ip !$acc parallel deviceptr(structs, uxp, uyp, uzp, gaminv) !$acc loop gang vector do ip = 1, np structs(ip)%pos(1) = structs(ip)%pos(1) + uxp(ip)*gaminv(ip)*dt structs(ip)%pos(2) = structs(ip)%pos(2) + uyp(ip)*gaminv(ip)*dt structs(ip)%pos(3) = structs(ip)%pos(3) + uzp(ip)*gaminv(ip)*dt end do !$acc end loop !$acc end parallel end subroutine push_position_boris Note the use of the :fortran:`!$acc parallel deviceptr` clause to specify which data has been placed in device memory. This instructs OpenACC to treat those variables as if they already live on the device, bypassing the usual copies. For complete examples of a particle code that has been ported to GPUs using Cuda, OpenACC, and OpenMP, please see the tutorial `Electromagnetic PIC`_. .. _`Electromagnetic PIC`: https://amrex-codes.github.io/amrex/tutorials_html/Particles_Tutorial.html#electromagneticpic GPU-aware implementations of many common particle operations are provided with AMReX, including neighbor list construction and traversal, particle-mesh deposition and interpolation, parallel reductions of particle data, and a set of transformation and filtering operations that are useful when operating on sets of particles. For examples of these features in use, please see :cpp:`Tests/Particles/`. Finally, the parallel communication of particle data has been ported and optimized for performance on GPU platforms. This includes :cpp:`Redistribute()`, which moves particles back to the proper grids after their positions have changed, as well as :cpp:`fillNeighbors()` and :cpp:`updateNeighbors()`, which are used to exchange halo particles. As with :cpp:`MultiFab` data, these have been designed to minimize host / device traffic as much as possible, and can take advantage of the Cuda-aware MPI implementations available on platforms such as ORNL's Summit. Profiling with GPUs =================== .. _sec:gpu:profiling: When profiling for GPUs, AMReX recommends ``nvprof``, NVIDIA's visual profiler. ``nvprof`` returns data on how long each kernel launch lasted on the GPU, the number of threads and registers used, the occupancy of the GPU and recommendations for improving the code. For more information on how to use ``nvprof``, see NVIDIA's User's Guide as well as the help web pages of your favorite supercomputing facility that uses NVIDIA GPUs. AMReX's internal profilers currently cannot hook into profiling information on the GPU and an efficient way to time and retrieve that information is being explored. In the meantime, AMReX's timers can be used to report some generic timers that are useful in categorizing an application. Due to the asynchronous launching of GPU kernels, any AMReX timers inside of asynchronous regions or inside GPU kernels will not measure useful information. However, since the :cpp:`MFIter` synchronizes when being destroyed, any timer wrapped around an :cpp:`MFIter` loop will yield a consistent timing of the entire set of GPU launches contained within. For example: .. highlight:: cpp :: BL_PROFILE_VAR("A_NAME", blp); // Profiling start for (MFIter mfi(mf); mfi.isValid(); ++mfi) { // gpu works } BL_PROFILE_STOP(blp); // Profiling stop For now, this is the best way to profile GPU codes using ``TinyProfiler``. If you require further profiling detail, use ``nvprof``. Performance Tips ================ .. _sec:gpu:performance: Here are some helpful performance tips to keep in mind when working with AMReX for GPUs: * To obtain the best performance when using CUDA kernel launches, all device functions called within the launch region should be inlined. Inlined functions use substantially fewer registers, freeing up GPU resources to perform other tasks. This increases parallel performance and greatly reduces runtime. Functions are written inline by putting their definitions in the ``.H`` file and using the ``AMREX_FORCE_INLINE`` AMReX macro. Examples can be found in in the `Launch`_ tutorial. For example: .. highlight:: cpp :: AMREX_GPU_DEVICE AMREX_FORCE_INLINE void plusone_cudacpp (amrex::Box const& bx, amrex::FArrayBox& fab) { ... } * Pay attention to what GPUs your job scheduler is assigning to each MPI rank. In most cases you'll achieve the best performance when a single MPI rank is assigned to each GPU, and has boxes large enough to saturate that GPU's compute capacity. While there are some cases where multiple MPI ranks per GPU can make sense (typically this would be when you have some portion of your code that is not GPU accelerated and want to have many MPI ranks to make that part faster), this is probably the minority of cases. For example, on OLCF Summit you would want to ensure that your resource sets contain one MPI rank and GPU each, using `jsrun -n N -a 1 -c 7 -g 1`, where `N` is the total number of MPI ranks/GPUs you want to use. (See the OLCF [job step viewer](https://jobstepviewer.olcf.ornl.gov/) for more information.) Conversely, if you choose to have multiple GPUs visible to each MPI rank, AMReX will attempt to do the best job it can assigning MPI ranks to GPUs by doing round robin assignment. This may be suboptimal because this assignment scheme would not be aware of locality benefits that come from having an MPI rank be on the same socket as the GPU it is managing. .. =================================================================== Inputs Parameters ================= .. _sec:gpu:parameters: The following inputs parameters control the behavior of amrex when running on GPUs. They should be prefaced by "amrex" in your :cpp:`inputs` file. +----------------------------+-----------------------------------------------------------------------+-------------+----------+ | | Description | Type | Default | +============================+=======================================================================+=============+==========+ | use_gpu_aware_mpi | Whether to use GPU memory for communication buffers during MPI calls. | Bool | 0 | | | If true, the buffers will use device memory. If false (i.e., 0), they | | | | | will use pinned memory. In practice, we find it is not always worth | | | | | it to use GPU aware MPI. | | | +----------------------------+-----------------------------------------------------------------------+-------------+----------+ | abort_on_out_of_gpu_memory | If the size of free memory on the GPU is less than the size of a | Bool | 0 | | | requested allocation, AMReX will call AMReX::Abort() with an error | | | | | describing how much free memory there is and what was requested. | | | +----------------------------+-----------------------------------------------------------------------+-------------+----------+ | the_arena_is_managed | Whether :cpp:`The_Arena()` allocates managed memory. | Bool | 0 | +----------------------------+-----------------------------------------------------------------------+-------------+----------+