Debugging is an art. Everyone has their own favorite method. Here we offer a few tips we have found to be useful.

To help debugging, AMReX handles various signals in the C standard library raised in the runs. This gives us a chance to print out more information using Linux/Unix backtrace capability. The signals include segmentation fault (or “segfault”), interruption by the user (control-c), assertion errors, and floating point exceptions (NaNs, divided by zero and overflow). The handling of segfault, assertion errors and interruption by control-C are enabled by default. Note that AMREX_ASSERT() is only on when compiled with DEBUG=TRUE or USE_ASSERTION=TRUE in GNU make, or with -DCMAKE_BUILD_TYPE=Debug or -DAMReX_ASSERTIONS=YES in CMake. The trapping of floating point exceptions is not enabled by default unless the code is compiled with DEBUG=TRUE in GNU make, or with -DCMAKE_BUILD_TYPE=Debug or -DAMReX_FPE=YES in CMake to turn on compiler flags if supported. Alternatively, one can always use runtime parameters to control the handling of floating point exceptions: amrex.fpe_trap_invalid for NaNs, amrex.fpe_trap_zero for division by zero and amrex.fpe_trap_overflow for overflow. To more effectively trap the use of uninitialized values, AMReX also initializes FArrayBoxs in MultiFabs and arrays allocated by bl_allocate to signaling NaNs when it is compiled with TEST=TRUE or DEBUG=TRUE in GNU make, or with -DCMAKE_BUILD_TYPE=Debug in CMake. One can also control the setting for FArrayBox using the runtime parameter, fab.init_snan. Note for Macs, M1 and M2 chips using Arm64 architecture are not able to trap division by zero.

One can get more information than the backtrace of the call stack by instrumenting the code. Here is an example. You know the line Real rho = state(cell,0); is causing a segfault. You could add a print statement before that. But it might print out thousands (or even millions) of line before it hits the segfault. What you could do is the following,

#include <AMReX_BLBackTrace.H>

std::ostringstream ss;
ss << " = " << << " cell = " << cell;
BL_BACKTRACE_PUSH(ss.str()); // PUSH takes std::string

Real rho = state(cell,0);  // state is a Fab, and cell is an IntVect.

BL_BACKTRACE_POP(); // One can omit this line.  In that case,
                    // there is an implicit POP when "PUSH" is
                    // out of scope.

When it hits the segfault, you will only see the last print out.

Writing a MultiFab to disk with

VisMF::Write(const FabArray<FArrayBox>& mf, const std::string& name)

in AMReX_VisMF.H and examining it with Amrvis (section Amrvis) can be helpful as well. In AMReX_MultiFabUtil.H, function

void print_state(const MultiFab& mf, const IntVect& cell, const int n=-1,
                 const IntVect& ng = IntVect::TheZeroVector());

can output the data for a single cell. n is the component, with the default being to print all components. ng is the number of ghost cells to include.

Valgrind is one of our favorite debugging tools. For MPI runs, one can tell Valgrind to output to different files for different processes. For example,

mpiexec -n 4 valgrind --leak-check=yes --track-origins=yes --log-file=vallog.%p ./foo.exe ...

Breaking into Debuggers

In order to break into debuggers and use modern IDEs, the backtrace signal handling described above needs to be disabled.

The following runtime options need to be set in order to prevent AMReX from catching the break signals before a debugger can attach to a crashing process:

amrex.throw_exception = 1
amrex.signal_handling = 0

This default behavior can also be modified by applications, see for example this custom application initializer.

Basic Gpu Debugging

The asynchronous nature of GPU execution can make tracking down bugs complex. The relative timing of improperly coded functions can cause variations in output and the timing of error messages may not linearly relate to a place in the code. One strategy to isolate specific kernel failures is to add amrex::Gpu::synchronize() or amrex::Gpu::streamSynchronize() after every ParallelFor or similar amrex::launch type call. These synchronization commands will halt execution of the code until the GPU or GPU stream, respectively, has finished processing all previously requested tasks, thereby making it easier to locate and identify sources of error.

CUDA-Specific Tests

  • To test if your kernels have launched, run:

    nvprof ./

    If using NVIDIA Nsight Compute instead, access nvprof functionality with:

    nsys nvprof ./
  • Run nvprof -o profile%p.nvvp ./main3d.xxxx or nsys profile -o nsys_out.%q{SLURM_PROCID}.%q{SLURM_JOBID} ./ for a small problem and examine page faults using nvvp or nsight-sys $(pwd)/nsys_out.#.######.qdrep.

  • Run under cuda-memcheck or the newer version compute-sanitizer to identify memory errors.

  • Run under cuda-gdb to identify kernel errors.

  • To help identify race conditions, globally disable asynchronicity of kernel launches for all CUDA applications by setting CUDA_LAUNCH_BLOCKING=1 in your environment variables. This will ensure that only one CUDA kernel will run at a time.

AMD ROCm-Specific Tests

  • To test if your kernels have launched, run:

    rocprof ./
  • Run rocprof  --hsa-trace --stats --timestamp on --roctx-trace ./main3d.xxxx for a small problem and examine tracing using chrome://tracing.

  • Run under rocgdb for source-level debugging.

  • To help identify if there are race conditions, globally disable asynchronicity of kernel launches by setting CUDA_LAUNCH_BLOCKING=1 or HIP_LAUNCH_BLOCKING=1 in your environment variables. This will ensure only one kernel will run at a time. See the AMD ROCm docs’ chicken bits section for more debugging environment variables.

Intel GPU Specific Tests

  • To test if your kernels have launched, run:

    ./ze_tracer ./
  • Run Intel Advisor, advisor --collect=survey ./ for a small problem with 1 MPI process and examine metrics.

  • Run under gdb with the Intel Distribution for GDB.

  • To report back-end information, set ZE_DEBUG=1 in your environment variables.