Machine code analysis tools

The microarchitecture of modern CPUs

While you might have heard of Instruction Set Architectures, such as x86 or arm or mips, the term microarchitecture (also written here as µ-arch), refers to the internal details of an actual family of CPUs, such as Intel's Haswell or AMD's Jaguar.

Replacing scalar code with SIMD code will improve performance on all CPUs supporting the required vector extensions. However, due to microarchitectural differences, the actual speed-up at runtime might vary.

Example: a simple example arises when optimizing for AMD K8 CPUs. The assembly generated for an empty function should look like this:

nop
ret

The nop is used to align the ret instruction for better performance. However, the compiler will actually generated the following code:

repz ret

The repz instruction will repeat the following instruction until a certain condition. Of course, in this situation, the function will simply immediately return, and the ret instruction is still aligned. However, AMD K8's branch predictor performs better with the latter code.

For those looking to absolutely maximize performance for a certain target µ-arch, you will have to read some CPU manuals, or ask the compiler to do it for you with -C target-cpu.

Summary of CPU internals

Modern processors are able to execute instructions out-of-order for better performance, by utilizing tricks such as branch prediction, instruction pipelining, or superscalar execution.

SIMD instructions are also subject to these optimizations, meaning it can get pretty difficult to determine where the slowdown happens. For example, if the profiler reports a store operation is slow, one of two things could be happening:

the store is limited by the CPU's memory bandwidth, which is actually an ideal scenario, all things considered;
memory bandwidth is nowhere near its peak, but the value to be stored is at the end of a long chain of operations, and this store is where the profiler encountered the pipeline stall;

Since most profilers are simple tools which don't understand the subtleties of instruction scheduling, you

Analyzing the machine code

Certain tools have knowledge of internal CPU microarchitecture, i.e. they know

how many physical register files a CPU actually has
what is the latency / throughtput of an instruction
what µ-ops are generated for a set of instructions

and many other architectural details.

These tools are therefore able to provide accurate information as to why some instructions are inefficient, and where the bottleneck is.

The disadvantage is that the output of these tools requires advanced knowledge of the target architecture to understand, i.e. they cannot point out what the cause of the issue is explicitly.

Intel's Architecture Code Analyzer (IACA)

IACA is a free tool offered by Intel for analyzing the performance of various computational kernels.

Being a proprietary, closed source tool, it only supports Intel's µ-arches.

Rust SIMD Performance Guide

Machine code analysis tools

The microarchitecture of modern CPUs

Summary of CPU internals

Analyzing the machine code

Intel's Architecture Code Analyzer (IACA)

llvm-mca