perf is the most powerful performance profiler for Linux, featuring support for various hardware Performance Monitoring Units, as well as integration with the kernel's performance events framework.
We will only look at how can the
perf command can be used to profile SIMD code.
Full system profiling is outside of the scope of this book.
The first step is to record a program's execution during an average workload. It helps if you can isolate the parts of your program which have performance issues, and set up a benchmark which can be easily (re)run.
Build the benchmark binary in release mode, after having enabled debug info:
$ cargo build --release Finished release [optimized + debuginfo] target(s) in 0.02s
Then use the
perf record subcommand:
$ perf record --call-graph=dwarf ./target/release/my-program [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
Instead of using
--call-graph=dwarf, which can become pretty slow, you can use
--call-graph=lbr if you have a processor with support for Last Branch Record
(i.e. Intel Haswell and newer).
perf will, by default, record the count of CPU cycles it takes to execute
various parts of your program. You can use the
-e command line option
to enable other performance events, such as
to get a list of all hardware counters supported by your CPU.
The next step is getting a bird's eye view of the program's execution.
perf provides a
ncurses-based interface which will get you started.
perf report to open a visualization of your program's performance:
perf report --hierarchy -M intel
--hierarchy will display a tree-like structure of where your program spent
most of its time.
-M intel enables disassembly output with Intel syntax, which
is subjectively more readable than the default AT&T syntax.
Here is the output from profiling the
- 100,00% nbody - 94,18% nbody + 93,48% [.] nbody_lib::simd::advance + 0,70% [.] nbody_lib::run + 5,06% libc-2.28.so
If you move with the arrow keys to any node in the tree, you can the press
perf annotate that node. This means it will:
disassemble the function
associate every instruction with the percentage of time which was spent executing it
interleaves the disassembly with the source code, assuming it found the debug symbols (you can use
sto toggle this behaviour)
perf will, by default, open the instruction which it identified as being the
hottest spot in the function:
0,76 │ movapd xmm2,xmm0 0,38 │ movhlps xmm2,xmm0 │ addpd xmm2,xmm0 │ unpcklpd xmm1,xmm2 12,50 │ sqrtpd xmm0,xmm1 1,52 │ mulpd xmm0,xmm1
In this case,
sqrtpd will be highlighted in red, since that's the instruction
which the CPU spends most of its time executing.
Valgrind is a set of tools which initially helped C/C++ programmers find unsafe memory accesses in their code. Nowadays the project also has
a heap profiler called
a cache utilization profiler called
a call-graph performance profiler called