Overview of AMReX GPU Strategy

AMReX’s GPU strategy focuses on providing performant GPU support with minimal changes and maximum flexibility. This allows application teams to get running on GPUs quickly while allowing long term performance tuning and programming model selection. AMReX uses the native programming language for GPUs: CUDA for NVIDIA, HIP for AMD and DPC++ for Intel. This will be designated with CUDA/HIP/DPC++ throughout the documentation. However, application teams can also use OpenACC or OpenMP in their individual codes.

At this time, AMReX does not support cross-native language compilation (HIP for non-AMD systems and DPC++ for non Intel systems). It may work with a given version, but AMReX does not track or guarantee such functionality.

When running AMReX on a CPU system, the parallelization strategy is a combination of MPI and OpenMP using tiling, as detailed in MFIter with Tiling. However, tiling is ineffective on GPUs due to the overhead associated with kernel launching. Instead, efficient use of the GPU’s resources is the primary concern. Improving resource efficiency allows a larger percentage of GPU threads to work simultaneously, increasing effective parallelism and decreasing the time to solution.

When running on CPUs, AMReX uses an MPI+X strategy where the X threads are used to perform parallelization techniques, like tiling. The most common X is OpenMP. On GPUs, AMReX requires CUDA/HIP/DPC++ and can be further combined with other parallel GPU languages, including OpenACC and OpenMP, to control the offloading of subroutines to the GPU. This MPI+CUDA+X GPU strategy has been developed to give users the maximum flexibility to find the best combination of portability, readability and performance for their applications.

Presented here is an overview of important features of AMReX’s GPU strategy. Additional information that is required for creating GPU applications is detailed throughout the rest of this chapter:

  • Each MPI rank offloads its work to a single GPU. (MPI ranks == Number of GPUs)

  • Calculations that can be offloaded efficiently to GPUs use GPU threads to parallelize over a valid box at a time. This is done by launching over a large number GPU threads that only work on a few cells each. This work distribution is illustrated in Table 11.

Table 11 Comparison of OpenMP and GPU work distribution. Pictures provided by Mike Zingale and the CASTRO team.

a

b

OpenMP tiled box. OpenMP threads break down the valid box into two large boxes (blue and orange). The lo and hi of one tiled box are marked.

GPU threaded box. Each GPU thread works on a few cells of the valid box. This example uses one cell per thread, each thread using a box with lo = hi.

  • C++ macros and GPU extended lambdas are used to provide performance portability while making the code as understandable as possible to science-focused code teams.

  • AMReX utilizes GPU managed memory to automatically handle memory movement for mesh and particle data. Simple data structures, such as IntVects can be passed by value and complex data structures, such as FArrayBoxes, have specialized AMReX classes to handle the data movement for the user. Tests have shown CUDA managed memory to be efficient and reliable, especially when applications remove any unnecessary data accesses.

  • Application teams should strive to keep mesh and particle data structures on the GPU for as long as possible, minimizing movement back to the CPU. This strategy lends itself to AMReX applications readily; the mesh and particle data can stay on the GPU for most subroutines except for of redistribution, communication and I/O operations.

  • AMReX’s GPU strategy is focused on launching GPU kernels inside AMReX’s MFIter and ParIter loops. By performing GPU work within MFIter and ParIter loops, GPU work is isolated to independent data sets on well-established AMReX data objects, providing consistency and safety that also matches AMReX’s coding methodology. Similar tools are also available for launching work outside of AMReX loops.

  • AMReX further parallelizes GPU applications by utilizing streams. Streams guarantee execution order of kernels within the same stream, while allowing different streams to run simultaneously. AMReX places each iteration of MFIter loops on separate streams, allowing each independent iteration to be run simultaneously and sequentially, while maximizing GPU usage.

    The AMReX implementation of streams is illustrated in Fig. 12. The CPU runs the first iteration of the MFIter loop (blue), which contains three GPU kernels. The kernels begin immediately in GPU Stream 1 and run in the same order they were added. The second (red) and third (green) iterations are similarly launched in Streams 2 and 3. The fourth (orange) and fifth (purple) iterations require more GPU resources than remain, so they have to wait until resources are freed before beginning. Meanwhile, after all the loop iterations are launched, the CPU reaches a synchronize in the MFIter’s destructor and waits for all GPU launches to complete before continuing.

  • The Fortran interface of AMReX does not currently have GPU support. AMReX recommends porting Fortran code to C++ when coding for GPUs.

_images/Streams.png

Fig. 12 Timeline illustration of GPU streams. Illustrates the case of an MFIter loop of five iterations with three GPU kernels each being ran with three GPU streams.

Building GPU Support

Building with GNU Make

To build AMReX with GPU support, add USE_CUDA=TRUE, USE_HIP=TRUE or USE_DPCPP=TRUE to the GNUmakefile or as a command line argument.

AMReX does not require OpenACC, but application codes can use them if they are supported by the compiler. For OpenACC support, add USE_ACC=TRUE. PGI, Cray and GNU compilers support OpenACC. Thus, for OpenACC, you must use COMP=pgi, COMP=cray or COMP=gnu.

Currently, only IBM is supported with OpenMP offloading. To use OpenMP offloading, make with USE_OMP_OFFLOAD=TRUE.

Compiling AMReX with CUDA requires compiling the code through NVIDIA’s CUDA compiler driver in addition to the standard compiler. This driver is called nvcc and it requires a host compiler to work through. The default host compiler for NVCC is GCC even if COMP is set to a different compiler. One can change this by setting NVCC_HOST_COMP. For example, COMP=pgi alone will compile C/C++ codes with NVCC/GCC and Fortran codes with PGI, and link with PGI. Using COMP=pgi and NVCC_HOST_COMP=pgi will compile C/C++ codes with PGI and NVCC/PGI.

You can use amrex-tutorials/ExampleCodes/Basic/HelloWorld_C/ to test your programming environment. For example, building with:

make COMP=gnu USE_CUDA=TRUE

should produce an executable named main3d.gnu.DEBUG.CUDA.ex. You can run it and that will generate results like:

$ ./main3d.gnu.DEBUG.CUDA.ex
Initializing CUDA...
CUDA initialized with 1 GPU
AMReX (19.06-404-g0455b168b69c-dirty) initialized
Hello world from AMReX version 19.06-404-g0455b168b69c-dirty
Total GPU global memory (MB): 6069
Free  GPU global memory (MB): 5896
[The         Arena] space (MB): 4552
[The Managed Arena] space (MB): 8
[The  Pinned Arena] space (MB): 8
AMReX (19.06-404-g0455b168b69c-dirty) finalized

Building with CMake

Enabling CUDA support

To build AMReX with CUDA support in CMake, add -DAMReX_GPU_BACKEND=CUDA to the cmake invocation. For a full list of CUDA-specific configuration options, check the table below.

Table 12 AMReX CUDA-specific build options

Variable Name

Description

Default

Possible values

AMReX_CUDA_ARCH

CUDA target architecture

Auto

User-defined

AMReX_CUDA_FASTMATH

Enable CUDA fastmath library

YES

YES, NO

AMReX_CUDA_BACKTRACE

Host function symbol names (e.g. cuda-memcheck)

Auto

YES, NO

AMReX_CUDA_COMPILATION_TIMER

CSV table with time for each compilation phase

NO

YES, NO

AMReX_CUDA_DEBUG

Device debug information (optimizations: off)

YES: Debug

YES, NO

AMReX_CUDA_ERROR_CAPTURE_THIS

Error if a CUDA lambda captures a class’ this

NO

YES, NO

AMReX_CUDA_ERROR_CROSS _EXECUTION_SPACE_CALL

Error if a host function is called from a host device function

NO

YES, NO

AMReX_CUDA_KEEP_FILES

Keep intermediately files (folder: nvcc_tmp)

NO

YES, NO

AMReX_CUDA_LTO

Enable CUDA link-time-optimization

NO

YES, NO

AMReX_CUDA_MAX_THREADS

Max number of CUDA threads per block

256

User-defined

AMReX_CUDA_MAXREGCOUNT

Limits the number of CUDA registers available

255

User-defined

AMReX_CUDA_PTX_VERBOSE

Verbose code generation statistics in ptxas

NO

YES, NO

AMReX_CUDA_SHOW_CODELINES

Source information in PTX (optimizations: on)

Auto

YES, NO

AMReX_CUDA_SHOW_LINENUMBERS

Line-number information (optimizations: on)

Auto

YES, NO

AMReX_CUDA_WARN_CAPTURE_THIS

Warn if a CUDA lambda captures a class’ this

YES

YES, NO

The target architecture to build for can be specified via the configuration option -DAMReX_CUDA_ARCH=<target-architecture>, where <target-architecture> can be either the name of the NVIDIA GPU generation, i.e. Turing, Volta, Ampere, ... , or its compute capability, i.e. 10.0, 9.0, ... . For example, on Cori GPUs you can specify the architecture as follows:

cmake [options] -DAMReX_GPU_BACKEND=CUDA -DAMReX_CUDA_ARCH=Volta /path/to/amrex/source

If no architecture is specified, CMake will default to the architecture defined in the environment variable AMREX_CUDA_ARCH (note: all caps). If the latter is not defined, CMake will try to determine which GPU architecture is supported by the system. If more than one is found, CMake will build for all of them. If autodetection fails, a list of “common” architectures is assumed. Multiple CUDA architectures can also be set manually as semicolon-separated list, e.g. -DAMReX_CUDA_ARCH=7.0;8.0. Building for multiple CUDA architectures will generally result in a larger library and longer build times.

Note that AMReX supports NVIDIA GPU architectures with compute capability 6.0 or higher and CUDA Toolkit version 9.0 or higher.

In order to import the CUDA-enabled AMReX library into your CMake project, you need to include the following code into the appropriate CMakeLists.txt file:

# Find CUDA-enabled AMReX installation
find_package(AMReX REQUIRED CUDA)

If instead of using an external installation of AMReX you prefer to include AMReX as a subproject in your CMake setup, we strongly encourage you to use the AMReX_SetupCUDA module as shown below if the CMake version is less than 3.20:

# Enable CUDA in your CMake project
enable_language(CUDA)

# Include the AMReX-provided CUDA setup module -- OBSOLETE with CMake >= 3.20
if(CMAKE_VERSION VERSION_LESS 3.20)
    include(AMReX_SetupCUDA)
endif()

# Include AMReX source directory ONLY AFTER the two steps above
add_subdirectory(/path/to/amrex/source/dir)

To ensure consistency between CUDA-enabled AMReX and any CMake target that links against it, we provide the helper function setup_target_for_cuda_compilation():

# Set all sources for my_target
target_sources(my_target source1 source2 source3 ...)

# Setup my_target to be compiled with CUDA and be linked against CUDA-enabled AMReX
# MUST be done AFTER all sources have been assigned to my_target
setup_target_for_cuda_compilation(my_target)

# Link against amrex
target_link_libraries(my_target AMReX::amrex)

Enabling HIP Support

To build AMReX with HIP support in CMake, add -DAMReX_GPU_BACKEND=HIP -DAMReX_AMD_ARCH=<target-arch> -DCMAKE_CXX_COMPILER=<your-hip-compiler> to the cmake invocation. If you don’t need Fortran features (AMReX_FORTRAN=OFF), it is recommended to use AMD’s clang++ as the HIP compiler. (Please see these issues for reference in rocm/HIP <= 4.2.0 [1] [2].)

In AMReX CMake, the HIP compiler is treated as a special C++ compiler and therefore the standard CMake variables used to customize the compilation process for C++, for example CMAKE_CXX_FLAGS, can be used for HIP as well.

Since CMake does not support autodetection of HIP compilers/target architectures yet, CMAKE_CXX_COMPILER must be set to a valid HIP compiler, i.e. clang++ or hipcc or nvcc, and AMReX_AMD_ARCH to the target architecture you are building for. Thus AMReX_AMD_ARCH and CMAKE_CXX_COMPILER are required user-inputs when AMReX_GPU_BACKEND=HIP. We again read also an environment variable: AMREX_AMD_ARCH (note: all caps) and the C++ compiler can be hinted as always, e.g. with export CXX=$(which clang++). Below is an example configuration for HIP on Tulip:

cmake -S . -B build -DAMReX_GPU_BACKEND=HIP -DCMAKE_CXX_COMPILER=$(which clang++) -DAMReX_AMD_ARCH="gfx906;gfx908"  # [other options]
cmake --build build -j 6

Enabling SYCL Support

To build AMReX with SYCL support in CMake, add -DAMReX_GPU_BACKEND=SYCL -DCMAKE_CXX_COMPILER=<your-sycl-compiler> to the cmake invocation. For a full list of SYCL-specific configuration options, check the table below.

In AMReX CMake, the SYCL compiler is treated as a special C++ compiler and therefore the standard CMake variables used to customize the compilation process for C++, for example CMAKE_CXX_FLAGS, can be used for DPCPP as well.

Since CMake does not support autodetection of SYCL compilers yet, CMAKE_CXX_COMPILER must be set to a valid SYCL compiler. i.e. dpcpp. Thus CMAKE_CXX_COMPILER is a required user-input when AMReX_GPU_BACKEND=SYCL. At this time, the only supported SYCL compiler is dpcpp. Below is an example configuration for SYCL:

cmake -DAMReX_GPU_BACKEND=SYCL -DCMAKE_CXX_COMPILER=$(which dpcpp)  [other options] /path/to/amrex/source
Table 13 AMReX SYCL-specific build options

Variable Name

Description

Default

Possible values

AMReX_DPCPP_AOT

Enable DPCPP ahead-of-time compilation

NO

YES, NO

AMREX_INTEL_ARCH

Specify target if AOT is enabled

None

Gen9, etc.

AMReX_DPCPP_SPLIT_KERNEL

Enable DPCPP kernel splitting

YES

YES, NO

AMReX_DPCPP_ONEDPL

Enable DPCPP’s oneDPL algorithms

NO

YES, NO

Gpu Namespace and Macros

Most GPU related classes and functions are in namespace Gpu, which is inside namespace amrex. For example, the GPU configuration class Device can be referenced to at amrex::Gpu::Device.

For portability, AMReX defines some macros for CUDA function qualifiers and they should be preferred to allow execution with USE_CUDA=FALSE. These include:

#define AMREX_GPU_HOST        __host__
#define AMREX_GPU_DEVICE      __device__
#define AMREX_GPU_GLOBAL      __global__
#define AMREX_GPU_HOST_DEVICE __host__ __device__

Note that when AMReX is not built with CUDA/HIP/DPC++, these macros expand to empty space.

When AMReX is compiled with USE_CUDA=TRUE, the preprocessor macros AMREX_USE_CUDA and AMREX_USE_GPU are defined for conditional programming. When AMReX is compiled with USE_ACC=TRUE, AMREX_USE_ACC is defined. When AMReX is compiled with USE_OMP_OFFLOAD=TRUE, AMREX_USE_OMP_OFFLOAD is defined.

In addition to AMReX’s preprocessor macros, CUDA provides the __CUDA_ARCH__ macro which is only defined when in device code. __CUDA_ARCH__ should be used when a __host__ __device__ function requires separate code for the CPU and GPU implementations.

Memory Allocation

To provide portability and improve memory allocation performance, AMReX provides a number of memory pools. When compiled without GPU support, all Arenas use standard new and delete operators. With GPU support, the Arenas each allocate with a specific type of GPU memory:

Table 14 Memory Arenas

Arena

Memory Type

The_Arena()

managed or device memory

The_Device_Arena()

device memory, could be an alias to The_Arena()

The_Managed_Arena()

managed memory, could be an alias to The_Arena()

The_Pinned_Arena()

pinned memory

The Arena object returned by these calls provides access to two functions:

void* alloc (std::size_t sz);
void free (void* p);

The_Arena() is used for memory allocation of data in BaseFab. By default, it allocates managed memory. This can be changed with a boolean runtime parameter amrex.the_arena_is_managed. Therefore the data in a MultiFab is placed in managed memory by default and is accessible from both CPU host and GPU device. This allows application codes to develop their GPU capability gradually. The behavior of The_Managed_Arena() likewise depends on the amrex.the_arena_is_managed parameter. If amrex.the_arena_is_managed=0, The_Managed_Arena() is a separate pool of managed memory. If amrex.the_arena_is_managed=1, The_Managed_Arena() is simply aliased to The_Arena() to reduce memory fragmentation.

In amrex::Initialize, a large amount of GPU device memory is allocated and is kept in The_Arena(). The default is 3/4 of the total device memory, and it can be changed with a ParmParse parameter, amrex.the_arena_init_size, in the unit of bytes. The default initial size for other arenas is 8388608 (i.e., 8 MB). For The_Managed_Arena() and The_Device_Arena(), it can be changed with amrex.the_managed_arena_init_size and amrex.the_device_arena_init_size, respectively, if they are not an alias to The_Arena(). For The_Pinned_Arena(), it can be changed with amrex.the_pinned_arena_init_size. The user can also specify a release threshold for these arenas. If the memory usage in an arena is below the threshold, the arena will keep the memory for later reuse, otherwise it will try to release memory back to the system if it is not being used. By default, the release threshold for The_Arena() is set to be a huge number that prevents the memory being released automatically, and it can be changed with a parameter, amrex.the_arena_release_threshold. For The_Pinned_Arena(), the default release threshold is the size of the total device memory, and the runtime parameter is amrex.the_pinned_arena_release_threshold. If it is a separate arena, the behavior of The_Device_Area() or The_Managed_Arena() can be changed with amrex.the_device_arena_release_threshold or amrex.the_managed_arena_release_threshold. Note that the units for all the parameter discussed above are bytes. All these areans also have a member function freeUnused() that can be used to manually release unused memory back to the system.

If you want to print out the current memory usage of the Arenas, you can call amrex::Arena::PrintUsage(). When AMReX is built with SUNDIALS turned on, amrex::sundials::The_SUNMemory_Helper() can be provided to SUNDIALS data structures so that they use the appropriate Arena object when allocating memory. For example, it can be provided to the SUNDIALS CUDA vector:

N_Vector x = N_VNewWithMemHelp_Cuda(size, use_managed_memory, *The_SUNMemory_Helper());

GPU Safe Classes and Functions

AMReX GPU work takes place inside of MFIter and particle loops. Therefore, there are two ways classes and functions have been modified to interact with the GPU:

1. A number of functions used within these loops are labelled using AMREX_GPU_HOST_DEVICE and can be called on the device. This includes member functions, such as IntVect::type(), as well as non-member functions, such as amrex::min and amrex::max. In specialized cases, classes are labeled such that the object can be constructed, destructed and its functions can be implemented on the device, including IntVect.

2. Functions that contain MFIter or particle loops have been rewritten to contain device launches. For example, the FillBoundary function cannot be called from device code, but calling it from CPU will launch GPU kernels if AMReX is compiled with GPU support.

Necessary and convenient AMReX functions and objects have been given a device version and/or device access.

In this section, we discuss some examples of AMReX device classes and functions that are important for programming GPUs.

GpuArray, Array1D, Array2D, and Array3D

GpuArray, Array1D, Array2D, and Array3D are trivial types that work on both host and device. They can be used whenever a fixed size array needs to be passed to the GPU or created on GPU. A variety of functions in AMReX return GpuArray and they can be lambda-captured to GPU code. For example, GeometryData::CellSizeArray(), GeometryData::InvCellSizeArray() and Box::length3d() all return GpuArrays.

AsyncArray

Where the GpuArray is a statically-sized array designed to be passed by value onto the device, AsyncArray is a dynamically-sized array container designed to work between the CPU and GPU. AsyncArray stores a CPU pointer and a GPU pointer and coordinates the movement of an array of objects between the two. It can take initial values from the host and move them to the device. It can copy the data from device back to host. It can also be used as scratch space on device.

The call to delete the memory is added to the GPU stream as a callback function in the destructor of AsyncArray. This guarantees the memory allocated in AsyncArray continues to exist after the AsyncArray object is deleted when going out of scope until after all GPU kernels in the stream are completed without forcing the code to synchronize. The resulting AsyncArray class is “async-safe”, meaning it can be safely used in asynchronous code regions that contain both CPU work and GPU launches, including MFIter loops.

AsyncArray is also portable. When built without USE_CUDA, the object only stores and handles the CPU version of the data.

An example using AsyncArray is given below,

Real h_s = 0.0;
AsyncArray<Real> aa_s(&h_s, 1);  // Build AsyncArray of size 1
Real* d_s = aa_s.data();         // Get associated device pointer

for (MFIter mfi(mf); mfi.isValid(); ++mfi)
{
    Vector<Real> h_v = a_cpu_function();
    AsyncArray<Real> aa_v1(h_v.data(), h_v.size());
    Real* d_v1 = aa_v1.data();  // A device copy of the data

    std::size_t n = ...;
    AsyncArray<Real> aa_v2(n);  // Allocate temporary space on device
    Real* d_v2 = aa_v2.data();  // A device pointer to uninitialized data

    ... // gpu kernels using the data pointed by d_v1 and atomically
        // updating the data pointed by d_s.
        // d_v2 can be used as scratch space and for pass data
        // between kernels.

    // If needed, we can copy the data back to host using
    // AsyncArray::copyToHost(host_pointer, number_of_elements);

    // At the end of each loop the compiler inserts a call to the
    // destructor of aa_v* on cpu.  Objects aa_v* are deleted, but
    // their associated memory pointed by d_v* is not deleted
    // immediately until the gpu kernels in this loop finish.
}

aa_s.copyToHost(&h_s, 1); // Copy the value back to host

Gpu Vectors

AMReX also provides a number of dynamic vectors for use with GPU kernels. These are configured to use the different AMReX memory Arenas, as summarized below. By using the memory Arenas, we can avoid expensive allocations and deallocations when (for example) resizing vectors.

Table 15 Memory Arenas Associated with each Gpu Vector

Vector

Arena

DeviceVector

The_Arena()

HostVector

The_Pinned_Arena()

ManagedVector

The_Managed_Arena()

These classes behave identically to an amrex::Vector, (see Vector, Array, GpuArray, Array1D, Array2D, and Array3D), except that they can only hold “plain-old-data” objects (e.g. Reals, integers, amrex Particles, etc… ). If you want a resizable vector that doesn’t use a memory Arena, simply use amrex::Vector.

Note that, even if the data in the vector is managed and available on GPUs, the member functions of e.g. Gpu::ManagedVector are not. To use the data on the GPU, it is necessary to pass the underlying data pointer in to the GPU kernels. The managed data pointer can be accessed using the data() member function.

Be aware: resizing of dynamically allocated memory on the GPU is unsupported. All resizing of the vector should be done on the CPU, in a manner that avoids race conditions with concurrent GPU kernels.

Also note: Gpu::ManagedVector is not async-safe. It cannot be safely constructed inside of an MFIter loop with GPU kernels and great care should be used when accessing Gpu::ManagedVector data on GPUs to avoid race conditions.

MultiFab Reductions

AMReX provides functions for performing standard reduction operations on MultiFabs, including MultiFab::sum and MultiFab::max. When USE_CUDA=TRUE, these functions automatically implement the corresponding reductions on GPUs in an efficient manner.

Function template ParReduce can be used to implement user-defined reduction functions over MultiFabs. For example, the following function computes the sum of total kinetic energy using the data in a MultiFab storing the mass and momentum density.

Real compute_ek (MultiFab const& mf)
{
    auto const& ma = mf.const_arrays();
    return ParReduce(TypeList<ReduceOpSum>{}, TypeList<Real>{},
                     mf, IntVect(0), // zero ghost cells
           [=] AMREX_GPU_DEVICE (int box_no, int i, int j, int k)
               noexcept -> GpuTuple<Real>
           {
               Array4<Real const> const& a = ma[box_no];
               Real rho = a(i,j,k,0);
               Real rhovx = a(i,j,k,1);
               Real rhovy = a(i,j,k,2);
               Real rhovz = a(i,j,k,3);
               Real ek = (rhovx*rhovx+rhovy*rhovy+rhovz*rhovz)/(2.*rho);
               return { ek };
           });
}

As another example, the following function computes the max- and 1-norm of a MultiFab in the masked region specified by an iMultiFab.

GpuTuple<Real,Real> compute_norms (MultiFab const& mf,
                                   iMulitiFab const& mask)
{
    auto const& data_ma = mf.const_arrays();
    auto const& mask_ma = mask.const_arrays();
    return ParReduce(TypeList<ReduceOpMax,ReduceOpSum>{},
                     TypeList<Real,Real>{},
                     mf, IntVect(0), // zero ghost cells
           [=] AMREX_GPU_DEVICE (int box_no, int i, int j, int k)
               noexcept -> GpuTuple<Real,Real>
           {
               if (mask_ma[box_no](i,j,k)) {
                   Real a = amrex::Math::abs(data_ma[box_no](i,j,k));
                   return { a, a };
               } else {
                   return { 0., 0. };
               }
           });
}

It should be noted that the reduction result of ParReduce is local and it is the user’s responsibility if MPI communication is needed.

Box, IntVect and IndexType

In AMReX, Box, IntVect and IndexType are classes for representing indices. These classes and most of their member functions, including constructors and destructors, have both host and device versions. They can be used freely in device code.

Geometry

AMReX’s Geometry class is not a GPU safe class. However, we often need to use geometric information such as cell size and physical coordinates in GPU kernels. We can use the following member functions and pass the returned values to GPU kernels:

GpuArray<Real,AMREX_SPACEDIM> ProbLoArray () const noexcept;
GpuArray<Real,AMREX_SPACEDIM> ProbHiArray () const noexcept;
GpuArray<int,AMREX_SPACEDIM> isPeriodicArray () const noexcept;
GpuArray<Real,AMREX_SPACEDIM> CellSizeArray () const noexcept;
GpuArray<Real,AMREX_SPACEDIM> InvCellSizeArray () const noexcept;

Alternatively, we can copy the data into a GPU safe class that can be passed by value to GPU kernels. This class is called GeometryData, which is created by calling Geometry::data(). The accessor functions of GeometryData are identical to Geometry.

BaseFab, FArrayBox, IArrayBox

BaseFab<T>, IArrayBox and FArrayBox have some GPU support. They cannot be constructed in device code unless they are constructed as an alias to Array4. Many of their member functions can be used in device code as long as they have been constructed in device memory. Some of the device member functions include array, dataPtr, box, nComp, and setVal.

All BaseFab<T> objects in FabArray<FAB> are allocated in CPU memory, including IArrayBox and FArrayBox, which are derived from BaseFab, although the array data contained are allocated in managed memory. We cannot pass a BaseFab object by value because they do not have copy constructor. However, we can make an Array4 using member function BaseFab::array(), and pass it by value to GPU kernels. In GPU device code, we can use Array4 or, if necessary, we can make an alias BaseFab from an Array4. For example,

AMREX_GPU_HOST_DEVICE void g (FArrayBox& fab) { ... }

AMREX_GPU_HOST_DEVICE void f (Box const& bx, Array4<Real> const& a)
{
  FArrayBox fab(a,bx.ixType());
  g(fab);
}

Elixir

We often have temporary FArrayBoxes in MFIter loops. These objects go out of scope at the end of each iteration. Because of the asynchronous nature of GPU kernel execution, their destructors might get called before their data are used on GPU. Elixir can be used to extend the life of the data. For example,

for (MFIter mfi(mf); mfi.isValid(); ++mfi) {
  const Box& bx = mfi.tilebox();
  FArrayBox tmp_fab(bx, numcomps);
  Elixir tmp_eli = tmp_fab.elixir();
  Array4<Real> const& tmp_arr = tmp_fab.array();

  // GPU kernels using the temporary
}

Without Elixir, the code above will likely cause memory errors because the temporary FArrayBox is deleted on cpu before the gpu kernels use its memory. With Elixir, the ownership of the memory is transferred to Elixir that is guaranteed to be async-safe.

Async Arena

CUDA 11.2 has introduced a new feature, stream-ordered CUDA memory allocator. This feature enables AMReX to solve the temporary memory allocation and deallocation issue discussed above using a memory pool. Instead of using Elixir, we can write code like below,

for (MFIter mfi(mf); mfi.isValid(); ++mfi) {
  const Box& bx = mfi.tilebox();
  FArrayBox tmp_fab(bx, numcomps, The_Async_Arena());
  Array4<Real> const& tmp_arr = tmp_fab.array();

  // GPU kernels using the temporary
}

This is now the recommended way because it’s usually more efficient than Elixir. Note that the code above works for CUDA older than 11.2, HIP and DPC++ as well, and it’s equivalent to using Elixir in these cases. By default, the release threshold for the memory pool is unlimited. One can adjust it with ParmParse parameter, amrex.the_async_arena_release_threshold.

Kernel Launch

In this section, how to offload work to the GPU will be demonstrated. AMReX supports offloading work with CUDA, OpenACC, or OpenMP.

When using CUDA, AMReX provides users with portable C++ function calls or C++ macros that launch a user-defined lambda function. When compiled without CUDA, the lambda function is ran on the CPU. When compiled with CUDA, the launch function prepares and launches the lambda function on the GPU. The preparation includes calculating the appropriate number of blocks and threads, selecting the CUDA stream and defining the appropriate work chunk for each CUDA thread.

When using OpenACC or OpenMP offloading pragmas, the users add the appropriate pragmas to their work loops and functions to offload to the GPU. These work in conjunction with AMReX’s internal CUDA-based memory management, described earlier, to ensure the required data is available on the GPU when the offloaded function is executed.

The available launch schema are presented here in three categories: launching nested loops over Boxes or 1D arrays, launching generic work and launching using OpenACC or OpenMP pragmas. The latest versions of the examples used in this section of the documentation can be found in the AMReX source code in the Launch tutorials. Users should also refer to Chapter Basics as needed for information about basic AMReX classes.

AMReX also recommends writing primary floating point operation kernels in C++ using AMReX’s Array4 object syntax. It provides a multi-dimensional array syntax, similar in appearance to Fortran, while maintaining performance. The details can be found in Array4 and C++ Kernel.

Launching C++ nested loops

The most common AMReX work construct is a set of nested loops over the cells in a box. AMReX provides C++ functions and macro equivalents to port nested loops efficiently onto the GPU. There are 3 different nested loop GPU launches: a 4D launch for work over a box and a number of components, a 3D launch for work over a box and a 1D launch for work over a number of arbitrary elements. Each of these launches provides a performance portable set of nested loops for both CPU and GPU applications.

These loop launches should only be used when each iteration of the nested loop is independent of other iterations. Therefore, these launches have been marked with AMREX_PRAGMA_SIMD when using the CPU and they should only be used for simd-capable nested loops. Calculations that cannot vectorize should be rewritten wherever possible to allow efficient utilization of GPU hardware.

However, it is important for applications to use these launches whenever appropriate because they contain optimizations for both CPU and GPU variations of nested loops. For example, on the GPU the spatial coordinate loops are reduced to a single loop and the component loop is moved to these inner most loop. AMReX’s launch functions apply the appropriate optimizations for USE_CUDA=TRUE and USE_CUDA=FALSE in a compact and readable format.

AMReX also provides a variation of the launch function that is implemented as a C++ macro. It behaves identically to the function, but hides the lambda function from the user. There are some subtle differences between the two implementations, that will be discussed. It is up to the user to select which version they would like to use. For simplicity, the function variation will be discussed throughout the rest of this documentation, however all code snippets will also include the macro variation for reference.

A 4D example of the launch function, amrex::ParallelFor, is given here:

int ncomp = mf.nComp();
for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi)
{
    const Box& bx = mfi.tilebox();
    Array4<Real> const& fab = mf.array(mfi);

    amrex::ParallelFor(bx, ncomp,
    [=] AMREX_GPU_DEVICE (int i, int j, int k, int n)
    {
        fab(i,j,k,n) += 1.;
    });

    /* MACRO VARIATION:
    /
    /   AMREX_PARALLEL_FOR_4D ( bx, ncomp, i, j, k, n,
    /   {
    /       fab(i,j,k,n) += 1.;
    /   });
    */
}

This code works whether it is compiled for GPUs or CPUs. TilingIfNotGPU() returns false in the GPU case to turn off tiling and maximize the amount of work given to the GPU in each launch. When tiling is off, tilebox() returns the validbox(). The BaseFab::array() function returns a lightweight Array4 object that defines access to the underlying FArrayBox data. The Array4s is then captured by the C++ lambda functions defined in the launch function.

amrex::ParallelFor() expands into different variations of a quadruply-nested for loop depending dimensionality and whether it is being implemented on CPU or GPU. The best way to understand this function is to take a look at the 4D amrex::ParallelFor that is implemented when USE_CUDA=FALSE. A simplified version is reproduced here:

void ParallelFor (Box const& box, int ncomp, /* LAMBDA FUNCTION */)
{
    const Dim3 lo = amrex::lbound(box);
    const Dim3 hi = amrex::ubound(box);

    for (int n = 0; n < ncomp; ++n) {
        for (int z = lo.z; z <= hi.z; ++z) {
        for (int y = lo.y; y <= hi.y; ++y) {
        AMREX_PRAGMA_SIMD
        for (int x = lo.x; x <= hi.x; ++x) {
            /* LAUNCH LAMBDA FUNCTION (x,y,z,n) */
        }}}
    }
}

amrex::ParallelFor takes a Box and a number of components, which define the bounds of the quadruply-nested for loop, and a lambda function to run on each iteration of the nested loop. The lambda function takes the loop iterators as parameters, allowing the current cell to be indexed in the lambda. In addition to the loop indices, the lambda function captures any necessary objects defined in the local scope.

CUDA lambda functions can only capture by value, as the information must be able to be copied onto the device. In this example, the lambda function captures a Array4 object, fab, that defines how to access the FArrayBox. The macro uses fab to increment the value of each cell within the Box bx. If USE_CUDA=TRUE, this incrementation is performed on the GPU, with GPU optimized loops.

This 4D launch can also be used to work over any sequential set of components, by passing the number of consecutive components and adding the iterator to the starting component: fab(i,j,k,n_start+n).

The 3D variation of the loop launch does not include a component loop and has the syntax shown here:

for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi)
{
    const Box& bx = mfi.tilebox();
    Array4<Real> const& fab = mf.array(mfi);
    amrex::ParallelFor(bx,
    [=] AMREX_GPU_DEVICE (int i, int j, int k)
    {
        fab(i,j,k) += 1.;
    });

    /* MACRO VARIATION:
    /
    /   AMREX_PARALLEL_FOR_3D ( bx, i, j, k,
    /   {
    /       fab(i,j,k) += 1.;
    /   });
    */
}

Finally, a 1D version is available for looping over a number of elements, such as particles. An example of a 1D function launch is given here:

for (MFIter mfi(mf); mfi.isValid(); ++mfi)
{
    FArrayBox& fab = mf[mfi];
    Real* AMREX_RESTRICT p = fab.dataPtr();
    const long nitems = fab.box().numPts() * fab.nComp();

    amrex::ParallelFor(nitems,
    [=] AMREX_GPU_DEVICE (long idx)
    {
        p[idx] += 1.;
    });

    /* MACRO VARIATION:
    /
    /   AMREX_PARALLEL_FOR_1D ( nitems, idx,
    /   {
    /       p[idx] += 1.;
    /   });
    */
}

Instead of passing an Array4, FArrayBox::dataPtr() is called to obtain a CUDA managed pointer to the FArrayBox data. This is an alternative way to access the FArrayBox data on the GPU. Instead of passing a Box to define the loop bounds, a long or int number of elements is passed to bound the single for loop. This construct can be used to work on any contiguous set of memory by passing the number of elements to work on and indexing the pointer to the starting element: p[idx + 15].

GPU block size

By default, ParallelFor launches AMREX_GPU_MAX_THREADS threads per GPU block, where AMREX_GPU_MAX_THREADS is a compile-time constant with a default value of 256. The users can also explcitly specify the number of threads per block by ParallelFor<MY_BLOCK_SIZE>(...), where MY_BLOCK_SIZE is a multiple of the warp size (e.g., 128). This allows the users to do performance tuning for individual kernels.

Launching general kernels

To launch more general work on the GPU, AMReX provides a standard launch function: amrex::launch. Instead of creating nested loops, this function prepares the device launch based on a Box, launches with an appropriate sized GPU kernel and constructs a thread Box that defines the work for each thread. On the CPU, the thread Box is set equal to the total launch Box, so tiling works as expected. On the GPU, the thread Box usually contains a single cell to allow all GPU threads to be utilized effectively.

An example of a generic function launch is shown here:

for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi)
{
    const Box& bx = mfi.tilebox();
    Array4<Real> const& arr = mf.array(mfi);

    amrex::launch(bx,
    [=] AMREX_GPU_DEVICE (Box const& tbx)
    {
        pluseone_array4(tbx, arr);
        FArrayBox fab(arr, tbx.ixType());
        plusone_fab(tbx, fab); // this version takes FArrayBox
    });

    /* MACRO VARIATION
    /
    /   AMREX_LAUNCH_DEVICE_LAMBDA ( bx, tbx,
    /   {
    /       plusone_array4(tbx, arr);
    /       plusone_fab(tbx, FArrayBox(arr,tbx.ixType()));
    /   });
    */
}

It also shows how to make a FArrayBox from Array4 when needed. Note that FarrayBoxes cannot be passed to GPU kernels directly. TilingIfNotGPU() returns false in the GPU case to turn off tiling and maximize the amount of work given to the GPU in each launch, which substantially improves performance. When tiling is off, tilebox() returns the validbox() of the FArrayBox for that iteration.

Offloading work using OpenACC or OpenMP pragmas

When using OpenACC or OpenMP with AMReX, the GPU offloading work is done with pragmas placed on the nested loops. This leaves the MFIter loop largely unchanged. An example GPU pragma based MFIter loop that calls a Fortran function is given here:

for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi)
{
    const Box& bx = mfi.tilebox();
    FArrayBox& fab = mf[mfi];
    plusone_acc(BL_TO_FORTRAN_BOX(tbx),
                BL_TO_FORTRAN_ANYD(fab));
}

The function plusone_acc is a CPU host function. The FArrayBox reference from operator[] is a reference to a FArrayBox in host memory with data that has been placed in managed CUDA memory. BL_TO_FORTRAN_BOX and BL_TO_FORTRAN_ANYD behave identically to implementations used on the CPU. These macros return the individual components of the AMReX C++ objects to allow passing to the Fortran function.

The corresponding OpenACC labelled loop in plusone_acc is:

!dat = pointer to fab's managed data

!$acc kernels deviceptr(dat)
do       k = lo(3), hi(3)
   do    j = lo(2), hi(2)
      do i = lo(1), hi(1)
         dat(i,j,k) = dat(i,j,k) + 1.0_amrex_real
      end do
   end do
end do
!$acc end kernels

Since the data pointer passed to plusone_acc points to unified memory, OpenACC can be told the data is available on the device using the deviceptr construct. For further details about OpenACC programming, consult the OpenACC user’s guide.

The OpenMP implementation of this loop is similar, only requiring changing the pragmas utilized to obtain the proper offloading. The OpenMP labelled version of this loop is:

!dat = pointer to fab's managed data

!$omp target teams distribute parallel do collapse(3) schedule(static,1) is_device_ptr(dat)
do       k = lo(3), hi(3)
   do    j = lo(2), hi(2)
      do i = lo(1), hi(1)
         dat(i,j,k) = dat(i,j,k) + 1.0_amrex_real
      end do
   end do
end do

In this case, is_device_ptr is used to indicate that dat is available in device memory. For further details about programming with OpenMP for GPU offloading, consult the OpenMP user’s guide.

Kernel launch details

CUDA kernel calls are asynchronous and they return before the kernel is finished on the GPU. So the MFIter loop finishes iterating on the CPU and is ready to move on to the next work before the actual work completes on the GPU. To guarantee consistency, there is an implicit device synchronization (a GPU barrier) in the destructor of MFIter. This ensures that all GPU work inside of an MFIter loop will complete before code outside of the loop is executed. Any CUDA kernel launches made outside of an MFIter loop must ensure appropriate device synchronization occurs. This can be done by calling Gpu::streamSynchronize().

CUDA supports multiple streams and kernels. Kernels launched in the same stream are executed sequentially, but different streams of kernel launches may be run in parallel. For each iteration of MFIter, AMReX uses a different CUDA stream (up to 16 streams in total). This allows each iteration of an MFIter loop to run independently, but in the expected sequence, and maximize the use of GPU parallelism. However, AMReX uses the default CUDA stream outside of MFIter loops.

Launching kernels with AMReX’s launch macros or functions implement a C++ lambda function. Lambdas functions used with CUDA have some restrictions the user must understand. First, the function enclosing the extended lambda must not have private or protected access within its parent class, otherwise the code will not compile. This can be fixed by changing the access of the enclosing function to public.

Another pitfall that must be considered: if the lambda function accesses a member of the enclosing class, the lambda function actually captures this pointer by value and accesses variables and functions via this->. If the object is not accessible on GPU, the code will not work as intended. For example,

class MyClass {
public:
    Box bx;
    int m;                           // Unmanaged integer created on the host.
    void f () {
        amrex::launch(bx,
        [=] AMREX_GPU_DEVICE (Box const& tbx)
        {
            printf("m = %d\n", m);   // Failed attempt to use m on the GPU.
        });
    }
};

The function f in the code above will not work unless the MyClass object is in unified memory. If it is undesirable to put the object into unified memory, a local copy of the information can be created for the lambda to capture. For example:

class MyClass {
public:
    Box bx;
    int m;
    void f () {
        int local_m = m;                  // Local temporary copy of m.
        amrex::launch(bx,
        [=] AMREX_GPU_DEVICE (Box const& tbx)
        {
            printf("m = %d\n", local_m);  // Lambda captures local_m by value.
        });
    }
};

C++ macros have some important limitations. For example, commas outside of a set of parentheses are interpreted by the macro, leading to errors such as:

AMREX_PARALLEL_FOR_3D (bx, tbx,
{
    Real a, b;   <---- Error. Macro reads "{ Real a" as a parameter
                                             and "b; }" as
                                             another.
    Real a;      <---- OK
    Real b;
});

One should also avoid using continue and return inside the macros because it is not an actual for loop. Users that choose to implement the macro launches should be aware of the limitations of C++ preprocessing macros to ensure GPU offloading is done properly.

Finally, AMReX’s most common CPU threading strategy for GPU/CPU systems is to utilize OpenMP threads to maintain multi-threaded parallelism on work chosen to run on the host. This means OpenMP pragmas should be maintained where CPU work is performed and usually turned off where work is offloaded onto the GPU. OpenMP pragmas can be turned off using the conditional pragma and Gpu::notInLaunchRegion(), as shown below:

#ifdef AMREX_USE_OMP
#pragma omp parallel if (Gpu::notInLaunchRegion())
#endif

It is generally expected that simply using OpenMP threads to launch GPU work quicker will show little improvement or even perform worse. So, this conditional statement should be added to MFIter loops that contain GPU work, unless users specifically test the performance or are designing more complex workflows that require OpenMP.

Stream and Synchronization

As mentioned in Section Overview of AMReX GPU Strategy, AMReX uses a number of GPU streams that are either CUDA streams or HIP streams or SYCL queues. Many GPU functions (e.g., ParallelFor and Gpu::copyAsync) are asynchronous with respect to the host. To facilitate synchronization that is sometimes necessary, AMReX provides Gpu::streamSynchronize() and Gpu::streamSynchronizeAll() to synchronize the current stream and all AMReX streams, respectively. For performance reasons, one should try to minimize the number of synchronization calls. For example,

// The synchronous version is NOT recommended
Gpu::copy(Gpu::deviceToHost, ....);
Gpu::copy(Gpu::deviceToHost, ....);
Gpu::copy(Gpu::deviceToHost, ....);

// NOT recommended because of unnecessary synchronization
Gpu::copyAsync(Gpu::deviceToHost, ....);
Gpu::streamSynchronize();
Gpu::copyAsync(Gpu::deviceToHost, ....);
Gpu::streamSynchronize();
Gpu::copyAsync(Gpu::deviceToHost, ....);
Gpu::streamSynchronize();

// recommended
Gpu::copyAsync(Gpu::deviceToHost, ....);
Gpu::copyAsync(Gpu::deviceToHost, ....);
Gpu::copyAsync(Gpu::deviceToHost, ....);
Gpu::streamSynchronize();

In addition to stream synchronization, there is also Gpu::synchronize() that will perform a device wide synchronization. However, a device wide synchronization is usually too excessive and it might interfere with other libraries (e.g., MPI).

An Example of Migrating to GPU

The nature of GPU programming poses difficulties for a number of common AMReX patterns, such as the one below:

// Given MultiFab uin and uout
#ifdef AMREX_USE_OMP
#pragma omp parallel
#endif
{
  FArrayBox q;
  for (MFIter mfi(uin,true); mfi.isValid(); ++mfi)
  {
    const Box& tbx = mfi.tilebox();
    const Box& gbx = amrex::grow(tbx,1);
    q.resize(gbx);

    // Do some work with uin[mfi] as input and q as output.
    // The output region is gbx;
    f1(gbx, q, uin[mfi]);

    // Then do more work with q as input and uout[mfi] as output.
    // The output region is tbx.
    f2(tbx, uout[mfi], q);
  }
}

There are several issues in migrating this code to GPUs that need to be addressed. First, functions f1 and f2 have different work regions (tbx and gbx, respectively) and there are data dependencies between the two (q). This makes it difficult to put them into a single GPU kernel, so two separate kernels will be launched, one for each function.

As we have discussed, AMReX uses multiple CUDA streams for launching kernels. Because q is used inside MFIter loops, multiple GPU kernels on different streams are accessing its data. This creates a race condition. One way to fix this is to move FArrayBox q inside the loop to make it local to each loop and use Elixir to make it async-safe (see Section Elixir). This strategy works well for GPU. However it is not optimal for OpenMP CPU threads when CUDA is not used, because of the memory allocation inside OpenMP parallel region. It turns out it is actually unnecessary to make FArrayBox q local to each iteration when Elixir is used to extend the life of its floating point data. The code below shows an example of how to rewrite the example in a performance portable way.

// Given MultiFab uin and uout
#ifdef AMREX_USE_OMP
#pragma omp parallel if (Gpu::notInLaunchRegion())
#endif
{
  FArrayBox q;
  for (MFIter mfi(uin,TilingIfNotGPU()); mfi.isValid(); ++mfi)
  {
    const Box& tbx = mfi.tilebox();
    const Box& gbx = amrex::grow(tbx,1);
    q.resize(gbx);
    Elixir eli = q.elixir();
    Array4<Real> const& qarr = q.array();

    Array4<Real const> const& uinarr = uin.const_array(mfi);
    Array4<Real> const& uoutarr = uout.array(mfi);

    amrex::launch(gbx,
    [=] AMREX_GPU_DEVICE (Box const& b)
    {
      f1(b, qarr, uinarr);
    });

    amrex::launch(tbx,
    [=] AMREX_GPU_DEVICE (Box const& b)
    {
      f2(b, uoutarr, qarr);
    });
  }
}

Assertions and Error Checking

To help debugging, we often use amrex::Assert and amrex::Abort. These functions are GPU safe and can be used in GPU kernels. However, implementing these functions requires additional GPU registers, which will reduce overall performance. Therefore, it is preferred to implement such calls in debug mode only by wrapping the calls using #ifdef AMREX_DEBUG.

In CPU code, AMREX_GPU_ERROR_CHECK() can be called to check the health of previous GPU launches. This call looks up the return message from the most recently completed GPU launch and aborts if it was not successful. Many kernel launch macros as well as the MFIter destructor include a call to AMREX_GPU_ERROR_CHECK(). This prevents additional launches from being called if a previous launch caused an error and ensures all GPU launches within an MFIter loop completed successfully before continuing work.

However, due to asynchronicity, determining the source of the error can be difficult. Even if GPU kernels launched earlier in the code result in a CUDA error, the error may not be output at a nearby call to AMREX_GPU_ERROR_CHECK() by the CPU. When tracking down a CUDA launch error, Gpu::synchronize(), Gpu::streamSynchronize(), or Gpu::streamSynchronizeAll() can be used to synchronize the device, the current GPU stream, or all GPU streams, respectively, and track down the specific launch that causes the error.

Particle Support

As with MultiFab, particle data stored in AMReX ParticleContainer classes are stored in unified memory when AMReX is compiled with USE_CUDA=TRUE. This means that the dataPtr associated with particles is managed and can be passed into GPU kernels. These kernels can be launched with a variety of approaches, including Cuda C / Fortran and OpenACC. An example Fortran particle subroutine offloaded via OpenACC might look like the following:

subroutine push_position_boris(np, structs, uxp, uyp, uzp, gaminv, dt)

use em_particle_module, only : particle_t
use amrex_fort_module, only : amrex_real
implicit none

integer,          intent(in), value  :: np
type(particle_t), intent(inout)      :: structs(np)
real(amrex_real), intent(in)         :: uxp(np), uyp(np), uzp(np), gaminv(np)
real(amrex_real), intent(in), value  :: dt

integer                              :: ip

!$acc parallel deviceptr(structs, uxp, uyp, uzp, gaminv)
!$acc loop gang vector
do ip = 1, np
    structs(ip)%pos(1) = structs(ip)%pos(1) + uxp(ip)*gaminv(ip)*dt
    structs(ip)%pos(2) = structs(ip)%pos(2) + uyp(ip)*gaminv(ip)*dt
    structs(ip)%pos(3) = structs(ip)%pos(3) + uzp(ip)*gaminv(ip)*dt
end do
!$acc end loop
!$acc end parallel

end subroutine push_position_boris

Note the use of the !$acc parallel deviceptr clause to specify which data has been placed in managed memory. This instructs OpenACC to treat those variables as if they already live on the device, bypassing the usual copies. For complete examples of a particle code that has been ported to GPUs using Cuda, OpenACC, and OpenMP, please see the tutorial Electromagnetic PIC.

GPU-aware implementations of many common particle operations are provided with AMReX, including neighbor list construction and traversal, particle-mesh deposition and interpolation, parallel reductions of particle data, and a set of transformation and filtering operations that are useful when operating on sets of particles. For examples of these features in use, please see Tests/Particles/.

Finally, the parallel communication of particle data has been ported and optimized for performance on GPU platforms. This includes Redistribute(), which moves particles back to the proper grids after their positions have changed, as well as fillNeighbors() and updateNeighbors(), which are used to exchange halo particles. As with MultiFab data, these have been designed to minimize host / device traffic as much as possible, and can take advantage of the Cuda-aware MPI implementations available on platforms such as ORNL’s Summit.

Profiling with GPUs

When profiling for GPUs, AMReX recommends nvprof, NVIDIA’s visual profiler. nvprof returns data on how long each kernel launch lasted on the GPU, the number of threads and registers used, the occupancy of the GPU and recommendations for improving the code. For more information on how to use nvprof, see NVIDIA’s User’s Guide as well as the help web pages of your favorite supercomputing facility that uses NVIDIA GPUs.

AMReX’s internal profilers currently cannot hook into profiling information on the GPU and an efficient way to time and retrieve that information is being explored. In the meantime, AMReX’s timers can be used to report some generic timers that are useful in categorizing an application.

Due to the asynchronous launching of GPU kernels, any AMReX timers inside of asynchronous regions or inside GPU kernels will not measure useful information. However, since the MFIter synchronizes when being destroyed, any timer wrapped around an MFIter loop will yield a consistent timing of the entire set of GPU launches contained within. For example:

BL_PROFILE_VAR("A_NAME", blp);     // Profiling start
for (MFIter mfi(mf); mfi.isValid(); ++mfi)
{
    // gpu works
}
BL_PROFILE_STOP(blp);              // Profiling stop

For now, this is the best way to profile GPU codes using TinyProfiler. If you require further profiling detail, use nvprof.

Performance Tips

Here are some helpful performance tips to keep in mind when working with AMReX for GPUs:

  • To obtain the best performance when using CUDA kernel launches, all device functions called within the launch region should be inlined. Inlined functions use substantially fewer registers, freeing up GPU resources to perform other tasks. This increases parallel performance and greatly reduces runtime. Functions are written inline by putting their definitions in the .H file and using the AMREX_FORCE_INLINE AMReX macro. Examples can be found in in the Launch tutorial. For example:

AMREX_GPU_DEVICE
AMREX_FORCE_INLINE
void plusone_cudacpp (amrex::Box const& bx, amrex::FArrayBox& fab)
{
    ...
}
  • Pay attention to what GPUs your job scheduler is assigning to each MPI rank. In most cases you’ll achieve the best performance when a single MPI rank is assigned to each GPU, and has boxes large enough to saturate that GPU’s compute capacity. While there are some cases where multiple MPI ranks per GPU can make sense (typically this would be when you have some portion of your code that is not GPU accelerated and want to have many MPI ranks to make that part faster), this is probably the minority of cases. For example, on OLCF Summit you would want to ensure that your resource sets contain one MPI rank and GPU each, using jsrun -n N -a 1 -c 7 -g 1, where N is the total number of MPI ranks/GPUs you want to use. (See the OLCF [job step viewer](https://jobstepviewer.olcf.ornl.gov/) for more information.)

    Conversely, if you choose to have multiple GPUs visible to each MPI rank, AMReX will attempt to do the best job it can assigning MPI ranks to GPUs by doing round robin assignment. This may be suboptimal because this assignment scheme would not be aware of locality benefits that come from having an MPI rank be on the same socket as the GPU it is managing. If you know the hardware layout of the system you’re running on, specifically the number of GPUs per socket (M) and number of GPUs per node (N), you can set the preprocessor defines -DAMREX_GPUS_PER_SOCKET=M and -DAMREX_GPUS_PER_NODE=N, which are exposed in the GNU Make system through the variables GPUS_PER_SOCKET and GPUS_PER_NODE respectively (see an example in Tools/GNUMake/sites/Make.olcf). Then AMReX can ensure that each MPI rank selects a GPU on the same socket as that rank (assuming your MPI implementation supports MPI 3.)

Inputs Parameters

The following inputs parameters control the behavior of amrex when running on GPUs. They should be prefaced by “amrex” in your inputs file.

Description

Type

Default

use_gpu_aware_mpi

Whether to use GPU memory for communication buffers during MPI calls. If true, the buffers will use device memory. If false, they will use pinned memory. In practice, we find it is usually not worth it to use GPU aware MPI.

Bool

False

abort_on_out_of_gpu_memory

If the size of free memory on the GPU is less than the size of a requested allocation, AMReX will call AMReX::Abort() with an error describing how much free memory there is and what was requested.

Bool

False

the_arena_is_managed

Whether The_Arena() allocates managed memory.

Bool

True