Types of Profiling
AMReX’s built-in profiling works through objects that start and stop timers based on user-placed macros or an object’s constructor and destructor. The results from these timers are stored in a global list that is consolidated and printed during finalization, or at a user-defined flush point.
Currently, AMReX has two options for built-in profiling: Tiny Profiling and Full Profiling.
Tiny Profiling
To enable “Tiny Profiling” with GNU Make edit the options in the file GNUMakefile
to show,
TINY_PROFILE = TRUE
PROFILE = FALSE
If building with CMake, set the following CMake flags,
AMReX_TINY_PROFILE = ON
AMReX_BASE_PROFILE = OFF
Note
If you set PROFILE = TRUE
(or AMReX_BASE_PROFILE =
ON
) to enable full profiling then this will override the TINY_PROFILE
flag
and tiny profiling will be disabled.
Output
At the end of a run, a summary of exclusive and inclusive function times will
be written to stdout
. This output includes the minimum and maximum (over
processes) time spent in each routine as well as the average and the maximum
percentage of total run time. See the sample output below.
TinyProfiler total time across processes [min...avg...max]: 1.765...1.765...1.765
---------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
---------------------------------------------------------------------------------
mfix_level::EvolveFluid 1 1.602 1.668 1.691 95.83%
FabArray::FillBoundary() 11081 0.02195 0.03336 0.06617 3.75%
FabArrayBase::getFB() 22162 0.02031 0.02147 0.02275 1.29%
PC<...>::WriteAsciiFile() 1 0.00292 0.004072 0.004551 0.26%
---------------------------------------------------------------------------------
Name NCalls Incl. Min Incl. Avg Incl. Max Max %
---------------------------------------------------------------------------------
mfix_level::Evolve() 1 1.69 1.723 1.734 98.23%
mfix_level::EvolveFluid 1 1.69 1.723 1.734 98.23%
FabArray::FillBoundary() 11081 0.04236 0.05485 0.08826 5.00%
FabArrayBase::getFB() 22162 0.02031 0.02149 0.02275 1.29%
The tiny profiler automatically writes the results to stdout
at the end of your
code, when amrex::Finalize();
is reached. However, you may want to write
partial profiling results to ensure your information is saved when you may fail
to converge or if you expect to run out of allocated time. Partial results can
be written at user-defined points in the code by inserting the line:
BL_PROFILE_TINY_FLUSH();
Any timers that have not reached their BL_PROFILE_VAR_STOP
call or exited
their scope and deconstructed will not be included in these partial outputs.
(e.g., a properly instrumented main()
should show a time of zero in all
partial outputs.) Therefore, it is recommended to place these flush calls in
easily identifiable regions of your code and outside of as many profiling
timers as possible, such as immediately before or after writing a checkpoint.
Also, since flush calls will print multiple, similar looking outputs to stdout
,
it is also recommended to wrap any BL_PROFILE_TINY_FLUSH();
calls in
informative amrex::Print()
lines to ensure accurate identification of each
set of timers.
Hot Spots and Load Balance
The output of TinyProfiler can help us to identify hot spots. For example, the following output shows the top three hot spots of a linear solver test running on 4 MPI processes.
--------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
--------------------------------------------------------------------------------------------
MLPoisson::Fsmooth() 560 0.4775 0.4793 0.4815 34.97%
MLPoisson::Fapply() 114 0.1103 0.113 0.1167 8.48%
FabArray::Xpay() 109 0.1 0.1013 0.1038 7.54%
In this test, there are 16 boxes evenly distributed among 4 MPI processes. The
output above shows that the load is perfectly balanced. However, if the load
is not balanced, the results can be very different and sometimes
misleading. For example, if we put 2, 2, 6 and 6 boxes on processes 0, 1, 2
and 3, respectively, the top three hot spots now include two MPI
communication functions, FillBoundary
and ParallelCopy
.
--------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
--------------------------------------------------------------------------------------------
FillBoundary_finish() 607 0.01568 0.3367 0.6574 41.97%
MLPoisson::Fsmooth() 560 0.2133 0.4047 0.5973 38.13%
FabArray::ParallelCopy_finish() 231 0.002977 0.09748 0.1895 12.10%
The reason that the MPI communication appears slow is that the lightly loaded processes have to wait for messages sent by the heavily loaded processes. See also Profiling Options for a diagnostic option that may provide more insight on the load imbalance.
Full Profiling
If you set PROFILE = TRUE
then a bl_prof
directory will be written that
contains detailed per-task timings for each processor. This will be written in
nfiles
files (where nfiles
is specified by the user). The information
in the directory can be analyzed by the AMRProfParser tool
within Amrvis. In addition, an
exclusive-only set of function timings will be written to stdout
.
Trace Profiling
If you set
TRACE_PROFILE = TRUE
in addition toPROFILE = TRUE
, then the profiler keeps track of when each profiled function is called and thebl_prof
directory will include the function call stack. This is especially useful when core functions, such asFillBoundary
can be called from many different regions of the code. Using trace profiling allows one to specify regions in the code that can be analyzed for profiling information independently from other regions.
Communication Profiling
If you set
COMM_PROFILE = TRUE
in addition toPROFILE = TRUE
, then thebl_prof
directory will contain additional information about MPI communication (point-to-point timings, data volume, barrier/reduction times, etc.).TRACE_PROFILE = TRUE
andCOMM_PROFILE = TRUE
can be set together.
The AMReX-specific profiling tools are currently under development and this documentation will reflect the latest status in the development branch.
Instrumenting C++ Code
AMReX profiler objects are created and managed through BL_PROF
macros.
To start, you must at least instrument main(), i.e.:
int main(...)
{
amrex::Initialize(argc,argv);
BL_PROFILE_VAR("main()",pmain);
<AMReX code block>
BL_PROFILE_VAR_STOP(pmain);
amrex::Finalize();
}
Or:
void main_main()
{
BL_PROFILE("main()");
<AMReX code block>
}
int main(...)
{
amrex::Initialize(argc,argv);
main_main();
amrex::Finalize();
}
You can then instrument any of your functions, or code blocks. There are four general profiler macro types available:
1) A scoped timer, BL_PROFILE
:
These timers generate their own object names, so they can’t be controlled after being defined. However, they are the cleanest and easiest to work with in many situations. They time from the point where the macro is called until the end of the enclosing scope. This macro is ideal for timing an entire function. For example:
void YourClass::YourFunction()
{
BL_PROFILE("YourClass::YourFunction()"); // Timer starts here.
< Your Function Code Block>
} // <------ Timer goes out of scope here, calling stop and returning the function time.
Note that all AMReX timers are scoped and will call “stop” when the corresponding object is destroyed. This macro is unique because it can only stop when it goes out of scope.
2) A named, scoped timer, BL_PROFILE_VAR
:
In some cases, using scopes to control a timer is not ideal. In such cases, you can use the
_VAR_
macros to create a named timer that can be controlled through _START_
and _STOP_
macros.
_VAR_
signifies that the macro takes a variable name. For example, to time a function without scoping:
BL_PROFILE_VAR("Flaten::FORT_FLATENX()", anyname); // Create and start "anyname".
FORT_FLATENX(arg1, arg2);
BL_PROFILE_VAR_STOP(anyname); // Stop the "anyname" timer object.
This can also be used to selectively time with the same scope. For example, to include Func_0
and Func_2
, but not Func_1
:
BL_PROFILE_VAR("MyFuncs()", myfuncs); // the first one
MyFunc_0(args);
BL_PROFILE_VAR_STOP(myfuncs);
MyFunc_1(args);
BL_PROFILE_VAR_START(myfuncs);
MyFunc_2(arg);
BL_PROFILE_VAR_STOP(myfuncs);
Remember, these are still scoped. So, the scoped timer example can be reproduced exactly with named
timers by just using the _VAR
macro:
void YourClass::YourFunction()
{
BL_PROFILE_VAR("YourClass::YourFunction()", pmain); // Timer starts here.
< Your Function Code Block>
} // <------ Timer goes out of scope here correctly, without a STOP call.
3) A named, scoped timer that doesn’t auto-start, BL_PROFILE_VAR_NS
:
Sometimes, a complicated scoping may mean the profiling object needs to be defined before it’s
started. To create a named AMReX timer that doesn’t start automatically, use the _NS_
macros.
(“NS” stands for “no start”). For example, this implementation times MyFunc0
and MyFunc1
but not any of the
“Additional Code” blocks:
{
BL_PROFILE_VAR_NS("MyFuncs()", myfuncs); // dont start the timer
<Additional Code A>
{
BL_PROFILE_VAR_START(myfuncs);
MyFunc_0(arg);
BL_PROFILE_VAR_STOP(myfuncs);
}
<Additional Code B>
{
BL_PROFILE_VAR_START(myfuncs);
MyFunc_1(arg);
BL_PROFILE_VAR_STOP(myfuncs);
<Additional Code C>
}
}
Note
The _NS_
macro must, by necessity, also be a _VAR_
macro. Otherwise, you would never be
able to turn the timer on!
4) Designate a sub-region to profile, BL_PROFILE_REGION
:
Often, it’s helpful to look at a subset of timers separately from the complete profile. For example, you may want to view the timing of a specific time step or isolate everything inside the “Chemistry” part of the code. This can be accomplished by designating profile regions. All timers within a named region will be included both in the full analysis, as well as in a separate sub-analysis.
Regions are meant to be large contiguous blocks of code, and should be used sparingly and purposefully to produce useful profiling reports. As such, the possible region options are purposefully limited.
Scoped Regions
When using the Tiny Profiler, the only available region macro is the scoped macro. To create a region that profiles the MyFuncs code block, including all timers in the “Additional Code” regions, add macros in the following way:
{
BL_PROFILE_REGION("MyFuncs");
<Additional Code A>
{
BL_PROFILE("MyFunc0");
MyFunc_0(arg);
}
<Additional Code B>
{
BL_PROFILE("MyFunc1");
MyFunc_1(arg);
<Additional Code C>
}
}
The MyFuncs
region appears in the Tiny Profiler output as an additional table.
The following output example, mimics the above code. In it, the region is
indicated by REG::MyFuncs
.
BEGIN REGION MyFuncs
-------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
-------------------------------------------------------------
MyFunc0 1000 4.402 4.402 4.402 14.19%
MyFunc1 1000 4.39 4.39 4.39 14.15%
REG::MyFuncs 1000 0.0168 0.0168 0.0168 0.05%
-------------------------------------------------------------
-------------------------------------------------------------
Name NCalls Incl. Min Incl. Avg Incl. Max Max %
-------------------------------------------------------------
REG::MyFuncs 1000 8.809 8.809 8.809 28.39%
MyFunc0 1000 4.402 4.402 4.402 14.19%
MyFunc1 1000 4.39 4.39 4.39 14.15%
-------------------------------------------------------------
END REGION MyFuncs
Named Regions
If using the Full Profiler, named region objects are also available.
Named regions allow control of start and stop points without relying on scope.
These macros use slightly modified
_VAR_
, _START_
and _STOP_
formatting. The first
argument is the name, followed by the profile variable. Names
for each section can differ, but because the profiler variable will be used
to group the sections into a region, it must be the same.
Consider the following example:
{
BL_PROFILE_REGION_VAR("RegionAC",reg_ac);
<Code Block A>
BL_PROFILE_REGION_VAR_STOP("RegionAC", reg_ac);
{
MyFunc_0(arg);
}
BL_PROFILE_REGION_VAR("RegionB", reg_b)
<Code Block B>
BL_PROFILE_REGION_VAR_STOP("RegionB", reg_b);
{
MyFunc_1(arg);
BL_PROFILE_REGION_VAR_START("SecondRegionAC", reg_ac);
<Code Block C>
BL_PROFILE_REGION_VAR_STOP("SecondRegionAC", reg_ac);
}
}
Here, <Code Block A>
and <Code Block C>
are
grouped into one region labeled “RegionAC” for profiling. <Code Block B>
is isolated in its own group.
Any timers inside MyFunc_0
and MyFunc_1
are not included in the
region groupings.
Instrumenting Fortran90 Code
When using the full profiler, Fortran90 functions can also be instrumented with the following calls:
call bl_proffortfuncstart("my_function")
...
call bl_proffortfuncstop("my_function")
Note that the start and stop calls must be matched before leaving the
scope of the corresponding start. Moreover, it is necessary to take into
account all possible code paths. Therefore, you may need to add bl_proffortfuncstop
in multiple locations, such as before any returns, at the end of the function
and at the point in the function where you want to stop profiling. The profiling
output will only warn of any bl_proffortfuncstart
calls that were not stopped with
bl_proffortfuncstop
calls when in debug mode.
For functions with a high number of calls, there is a lighter-weight interface,
call bl_proffortfuncstart_int(n)
...
call bl_proffortfuncstop_int(n)
where n
is an integer in the range [1,mFortProfsIntMaxFuncs]
.
mFortProfsIntMaxFuncs
is currently set to 32. The profiled
function will be named FORTFUNC_n
in the profiler output,
unless you rename it with BL_PROFILE_CHANGE_FORT_INT_NAME(fname, int)
where fname
is a std::string
and int
is the integer n
in the bl_proffortfuncstart_int/bl_proffortfuncstop_int
calls.
BL_PROFILE_CHANGE_FORT_INT_NAME
should be called in main()
.
Warning
Fortran functions cannot be profiled when using the Tiny Profiler. You will need to turn on the Full Profiler to receive the results from fortran instrumentation.
Profiling Options
AMReX’s communication algorithms are often regions of code that increase in wall clock time
when the application is load imbalanced, due to the MPI_Wait calls in these functions.
To better understand if this is occurring and by how much, you can turn on an AMReX timed
synchronization with the runtime variable: amrex.use_profiler_syncs=1
This adds named timers
beginning with SyncBeforeComms
immediately prior to the start of the FillBoundary,
ParallelCopy and particle Redistribute functions, isolating any prior load imbalance to that timer
before beginning the comm operation.
This is a diagnostic tool and may slow your code down, so it is not recommended to turn this on for production runs.
Note
Note: the SyncBeforeComms
timer is not equal to your load imbalance. It only captures imbalance
between the comm functions and the previous sync point; there may be other load imbalances
captured elsewhere. Also, the timer reports in terms of MPI rank, so if the most imbalanced
rank changes throughout the simulation, the timer will be an underestimation.
The effect on the communication timers may be more helpful: they will show the time to complete communications if there was no load imbalance. This means the difference between a case with and without this profiler sync may be a more useful metric for analysis.
AMRProfParser
AMRProfParser
is a tool for processing and analyzing the bl_prof
database. It is a command line application that can create performance
summaries, plotfiles showing point-to-point communication and timelines, HTML
call trees, communication call statistics, function timing graphs, and other
data products. The parser’s data services functionality can be called from an
interactive environment such as Amrvis, from a sidecar for dynamic performance
optimization, and from other utilities such as the command line version of the
parser itself. It has been integrated into Amrvis for visual interpretation of
the data allowing Amrvis to open the bl_prof
database like a plotfile but with
interfaces appropriate to profiling data. AMRProfParser and Amrvis can be run
in parallel both interactively and in batch mode.