hacc/SWFFT, developed by Adrian Pope et al. at Argonne National Lab, provides the functionality to perform forward and reverse Fast Fourier Transforms (FFT) within a fully parallelized framework built in C++ and F90. In the words of HACC’s developers, SWFFT is a “distributed-memory, pencil-decomposed, parallel 3D FFT.” [1] The SWFFT source code is also contained in the following directory within AMReX: amrex/Src/Extern/SWFFT. [2]

Pencil Redistribution

As input, SWFFT takes three-dimensional arrays of data distributed across block-structured grids, and redistributes the data into “pencil” grids in \(z, x,\) and then \(y\), belonging to different MPI processes. After each pencil conversion, a 1D FFT is performed on the data along the pencil direction using calls to the FFTW [3] library. The README files in the tutorial directories specify the relationship between the number of grids and the number of MPI processes that should be used. The hacc/SWFFT README document by Adrian Pope et al. explains restrictions on grid dimensions in relation to the number of MPI processes [1] [2]:

[…] A rule of thumb is that [SWFFT] generally works when the number of vertices along one side of the global 3D grid (“ng”) can be factored into small primes, and when the number of MPI ranks can also be factored into small primes. I believe that all of the unique prime factors of the number of MPI ranks must be present in the set of prime factors of the grid, eg. if you have 20 MPI ranks then ng must be a multiple of 5 and 2. The CheckDecomposition utility is provided to check (on one rank) whether a proposed grid size and number of MPI ranks will work, which can be done before submitting a large test with TestDfft/TestFDfft.

The relationship between the number of processes versus global grid dimensions is determined by how the total number of grids can be factored from a three dimensional grid structure (block structured grids) into a two dimensional structure (pencil arrays), as shown in the figures below.

The following figures illustrate how data is distributed from block structured grids to pencil arrays within SWFFT, where the colors of each box indicate which MPI rank it belongs to:

Table 21 SWFFT Redistribution from \(4 \times 4 \times 4\) Box Array into Pencils



(a) Block structured grids: \(N_x=4,N_y=4,N_z=4\)
(b) Z-pencils: \(N_x=8,N_y=8,N_z=1\)
Table 22 SWFFT Redistribution from \(2 \times 2 \times 2\) Box Array into Pencils



(a) Block structured grids: \(N_x=2,N_y=2,N_z=2\)
(b) Z-pencils: \(N_x=4,N_y=2,N_z=1\)



(c) X-pencils: \(N_x=1,N_y=4,N_z=2\)
(d) Y-pencils: \(N_x=4,N_y=1,N_z=2\)

Using the same number of AMReX grids as processes has been verified to work in the SWFFT Poisson and SWFFT Simple tutorials. This can be illustrated by the following equation for the total number of grids, \(N_{b}\), in a regularly structured domain:

\[N_{b} = m_{bi} m_{bj} = n_{bi} n_{bj} n_{bk},\]

where \(n_{bi}, n_{bj},\) and \(n_{bk}\) are the number of grids, or boxes, in the \(x, y,\) and \(z\) dimensions of the block-structured grid. Analogously, for pencil distributions, \(m_{bi}\) and \(m_{bj}\) are the number of grids along the remaining dimensions if pencils are taken in the \(k\) direction. There are many possible ways of redistributing the data, for example \(m_{bi} = n_{bi}n_{bk}\) & \(m_{bj} = n_{bj}\) is one possible simple configuration. However, it is evident from the figures above that the SWFFT redistribution algorithm has a more sophisticated method for finding the prime factors of the grid.


AMReX contains two SWFFT tutorials, SWFFT Poisson and SWFFT Simple:

  • SWFFT Poisson solves a Poisson equation with periodic boundary conditions. In it, both a forward FFT and reverse FFT are called to solve the equation, however, no reordering of the DFT data in k-space is performed.

  • SWFFT Simple is useful if the objective is to simply take a forward FFT of data, and the DFT’s ordering in k-space matters to the user. This tutorial initializes a 3D or 2D MultiFab, takes a forward FFT, and then redistributes the data in k-space back to the “correct,” 0 to \(2\pi\), ordering. The results are written to a plot file.