Introduction
What is SIMD
History of SIMD in Rust
Discover packed_simd
Writing fast and portable SIMD algorithms using packed_simd
is, unfortunately,
not trivial. There are many pitfals that one should be aware of, and some idioms
that help avoid those pitfalls.
This book attempts to document these best practices and provides practical examples on how to apply the tips to your code.
Floating-point math
This chapter contains information pertaining to working with floating-point numbers.
Short Vector Math Library
Approximate functions
Fused Multiply Add
Enabling target features
Not all processors of a certain architecture will have SIMD processing units, and using a SIMD instruction which is not supported will trigger undefined behavior.
To allow building safe, portable programs, the Rust compiler will not, by default,
generate any sort of vector instructions, unless it can statically determine
they are supported. For example, on AMD64, SSE2 support is architecturally guaranteed.
The x86_64-apple-darwin
target enables up to SSSE3. The get a defintive list of
which features are enabled by default on various platforms, refer to the target
specifications in the compiler's source code.
Using RUSTFLAGS
One of the easiest ways to benefit from SIMD is to allow the compiler to generate code using certain vector instruction extensions.
The environment variable RUSTFLAGS
can be used to pass options for code
generation to the Rust compiler. These flags will affect all compiled crates.
There are two flags which can be used to enable specific vector extensions:
target-feature
-
Syntax:
-C target-feature=<features>
-
Provides the compiler with a comma-separated set of instruction extensions to enable.
Example: Use
-C target-feature=+sse3,+avx
to enable generating instructions for Streaming SIMD Extensions 3 and Advanced Vector Extensions. -
To list target triples for all targets supported by Rust, use:
rustc --print target-list
-
To list all support target features for a certain target triple, use:
rustc --target=${TRIPLE} --print target-features
-
Note that all CPU features are independent, and will have to be enabled individually.
Example: Setting
-C target-feature=+avx2
will not enablefma
, even though all CPUs which support AVX2 also support FMA. To enable both, one has to use-C target-feature=+avx2,+fma
-
Some features also depend on other features, which need to be enabled for the target instructions to be generated.
Example: Unless
v7
is specified as the target CPU (see below), to enable NEON on ARM it is necessary to use-C target-feature=+v7,+neon
.
target-cpu
-
Syntax:
-C target-cpu=<cpu>
-
Sets the identifier of a CPU family / model for which to build and optimize the code.
Example:
RUSTFLAGS='-C target-cpu=cortex-a75'
-
To list all supported target CPUs for a certain target triple, use:
rustc --target=${TRIPLE} --print target-cpus
Example:
rustc --target=i686-pc-windows-msvc --print target-cpus
-
The compiler will translate this into a list of target features. Therefore, individual feature checks (
#[cfg(target_feature = "...")]
) will still work properly. -
It will cause the code generator to optimize the generated code for that specific CPU model.
-
Using
native
as the CPU model will cause Rust to generate and optimize code for the CPU running the compiler. It is useful when building programs which you plan to only use locally. This should never be used when the generated programs are meant to be run on other computers, such as when packaging for distribution or cross-compiling.
The target_feature
attribute
Inlining
Detecting host features at runtime
Bounds checking
Reading and writing packed vectors to/from slices is checked by default. Independently of the configuration options used, the safe functions:
Simd<[T; N]>::from_slice_aligned(& s[..])
Simd<[T; N]>::write_to_slice_aligned(&mut s[..])
always check that:
- the slice is big enough to hold the vector
- the slice is suitably aligned to perform an aligned load/store for a
Simd<[T; N]>
(this alignment is often much larger than that ofT
).
There are _unaligned
versions that use unaligned load and stores, as well as
unsafe
_unchecked
that do not perform any checks iff debug-assertions = false
/ debug = false
. That is, the _unchecked
methods do still assert size
and alignment in debug builds and could also do so in release builds depending
on the configuration options.
These assertions do often significantly impact performance and you should be aware of them.
Vertical and horizontal operations
In SIMD terminology, each vector has a certain "width" (number of lanes). A vector processor is able to perform two kinds of operations on a vector:
- Vertical operations: operate on two vectors of the same width, result has same width
Example: vertical addition of two f32x4
vectors
%0 == | 2 | -3.5 | 0 | 7 |
+ + + +
%1 == | 4 | 1.5 | -1 | 0 |
= = = =
%0 + %1 == | 6 | -2 | -1 | 7 |
- Horizontal operations: reduce the elements of two vectors in some way, the result's elements combine information from the two original ones
Example: horizontal addition of two u64x2
vectors
%0 == | 1 | 3 |
└─+───┘
└───────┐
│
%1 == | 4 | -1 | │
└─+──┘ │
└───┐ │
│ │
┌─────│───┘
▼ ▼
%0 + %1 == | 4 | 3 |
Performance consideration of horizontal operations
The result of vertical operations, like vector negation: -a
, for a given lane,
does not depend on the result of the operation for the other lanes. The result
of horizontal operations, like the vector sum
reduction: a.sum()
, depends on
the value of all vector lanes.
In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, very slow.
Consider the following two functions for computing the sum of all f32
values
in a slice:
#![allow(unused)] fn main() { fn fast_sum(x: &[f32]) -> f32 { assert!(x.len() % 4 == 0); let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.] for i in (0..x.len()).step_by(4) { sum += f32x4::from_slice_unaligned(&x[i..]); } sum.sum() } fn slow_sum(x: &[f32]) -> f32 { assert!(x.len() % 4 == 0); let mut sum: f32 = 0.; for i in (0..x.len()).step_by(4) { sum += f32x4::from_slice_unaligned(&x[i..]).sum(); } sum } }
The inner loop over the slice is where the bulk of the work actually happens.
There, the fast_sum
function perform vertical operations into a vector, doing
a single horizontal reduction at the end, while the slow_sum
function performs
horizontal vector operations inside of the loop.
On all widely-used architectures, fast_sum
is a large constant factor faster
than slow_sum
. You can run the slice_sum example and see for yourself. On
the particular machine tested there the algorithm using the horizontal vector
addition is 2.7x slower than the one using vertical vector operations!
Performance profiling
While the rest of the book provides practical advice on how to improve the performance of SIMD code, this chapter is dedicated to performance profiling. Profiling consists of recording a program's execution in order to identify program hotspots.
Important: most profilers require debug information in order to accurately
link the program hotspots back to the corresponding source code lines. Rust will
disable debug info generation by default for optimized builds, but you can change
that in your Cargo.toml
.
Performance profiling on Linux
Using perf
perf is the most powerful performance profiler for Linux, featuring support for various hardware Performance Monitoring Units, as well as integration with the kernel's performance events framework.
We will only look at how can the perf
command can be used to profile SIMD code.
Full system profiling is outside of the scope of this book.
Recording
The first step is to record a program's execution during an average workload. It helps if you can isolate the parts of your program which have performance issues, and set up a benchmark which can be easily (re)run.
Build the benchmark binary in release mode, after having enabled debug info:
$ cargo build --release
Finished release [optimized + debuginfo] target(s) in 0.02s
Then use the perf record
subcommand:
$ perf record --call-graph=dwarf ./target/release/my-program
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]
Instead of using --call-graph=dwarf
, which can become pretty slow, you can use
--call-graph=lbr
if you have a processor with support for Last Branch Record
(i.e. Intel Haswell and newer).
perf
will, by default, record the count of CPU cycles it takes to execute
various parts of your program. You can use the -e
command line option
to enable other performance events, such as cache-misses
. Use perf list
to get a list of all hardware counters supported by your CPU.
Viewing the report
The next step is getting a bird's eye view of the program's execution.
perf
provides a ncurses
-based interface which will get you started.
Use perf report
to open a visualization of your program's performance:
perf report --hierarchy -M intel
--hierarchy
will display a tree-like structure of where your program spent
most of its time. -M intel
enables disassembly output with Intel syntax, which
is subjectively more readable than the default AT&T syntax.
Here is the output from profiling the nbody
benchmark:
- 100,00% nbody
- 94,18% nbody
+ 93,48% [.] nbody_lib::simd::advance
+ 0,70% [.] nbody_lib::run
+ 5,06% libc-2.28.so
If you move with the arrow keys to any node in the tree, you can the press a
to have perf
annotate that node. This means it will:
-
disassemble the function
-
associate every instruction with the percentage of time which was spent executing it
-
interleaves the disassembly with the source code, assuming it found the debug symbols (you can use
s
to toggle this behaviour)
perf
will, by default, open the instruction which it identified as being the
hottest spot in the function:
0,76 │ movapd xmm2,xmm0
0,38 │ movhlps xmm2,xmm0
│ addpd xmm2,xmm0
│ unpcklpd xmm1,xmm2
12,50 │ sqrtpd xmm0,xmm1
1,52 │ mulpd xmm0,xmm1
In this case, sqrtpd
will be highlighted in red, since that's the instruction
which the CPU spends most of its time executing.
Using Valgrind
Valgrind is a set of tools which initially helped C/C++ programmers find unsafe memory accesses in their code. Nowadays the project also has
-
a heap profiler called
massif
-
a cache utilization profiler called
cachegrind
-
a call-graph performance profiler called
callgrind
Machine code analysis tools
The microarchitecture of modern CPUs
While you might have heard of Instruction Set Architectures, such as x86
or
arm
or mips
, the term microarchitecture (also written here as µ-arch),
refers to the internal details of an actual family of CPUs, such as Intel's
Haswell or AMD's Jaguar.
Replacing scalar code with SIMD code will improve performance on all CPUs supporting the required vector extensions. However, due to microarchitectural differences, the actual speed-up at runtime might vary.
Example: a simple example arises when optimizing for AMD K8 CPUs. The assembly generated for an empty function should look like this:
nop
ret
The nop
is used to align the ret
instruction for better performance.
However, the compiler will actually generated the following code:
repz ret
The repz
instruction will repeat the following instruction until a certain
condition. Of course, in this situation, the function will simply immediately
return, and the ret
instruction is still aligned.
However, AMD K8's branch predictor performs better with the latter code.
For those looking to absolutely maximize performance for a certain target µ-arch,
you will have to read some CPU manuals, or ask the compiler to do it for you
with -C target-cpu
.
Summary of CPU internals
Modern processors are able to execute instructions out-of-order for better performance, by utilizing tricks such as branch prediction, instruction pipelining, or superscalar execution.
SIMD instructions are also subject to these optimizations, meaning it can get pretty difficult to determine where the slowdown happens. For example, if the profiler reports a store operation is slow, one of two things could be happening:
-
the store is limited by the CPU's memory bandwidth, which is actually an ideal scenario, all things considered;
-
memory bandwidth is nowhere near its peak, but the value to be stored is at the end of a long chain of operations, and this store is where the profiler encountered the pipeline stall;
Since most profilers are simple tools which don't understand the subtleties of instruction scheduling, you
Analyzing the machine code
Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
-
how many physical register files a CPU actually has
-
what is the latency / throughtput of an instruction
-
what µ-ops are generated for a set of instructions
and many other architectural details.
These tools are therefore able to provide accurate information as to why some instructions are inefficient, and where the bottleneck is.
The disadvantage is that the output of these tools requires advanced knowledge of the target architecture to understand, i.e. they cannot point out what the cause of the issue is explicitly.
Intel's Architecture Code Analyzer (IACA)
IACA is a free tool offered by Intel for analyzing the performance of various computational kernels.
Being a proprietary, closed source tool, it only supports Intel's µ-arches.