# Vertical and horizontal operations

In SIMD terminology, each vector has a certain "width" (number of lanes). A vector processor is able to perform two kinds of operations on a vector:

• Vertical operations: operate on two vectors of the same width, result has same width

Example: vertical addition of two `f32x4` vectors

``````  %0     == | 2 | -3.5 |  0 | 7 |
+     +     +   +
%1     == | 4 |  1.5 | -1 | 0 |
=     =     =   =
%0 + %1  == | 6 |  -2  | -1 | 7 |
``````
• Horizontal operations: reduce the elements of two vectors in some way, the result's elements combine information from the two original ones

Example: horizontal addition of two `u64x2` vectors

``````  %0     == | 1 |  3 |
└─+───┘
└───────┐
│
%1     == | 4 | -1 |  │
└─+──┘    │
└───┐   │
│   │
┌─────│───┘
▼     ▼
%0 + %1  == | 4 |   3 |
``````

## Performance consideration of horizontal operations

The result of vertical operations, like vector negation: `-a`, for a given lane, does not depend on the result of the operation for the other lanes. The result of horizontal operations, like the vector `sum` reduction: `a.sum()`, depends on the value of all vector lanes.

In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, very slow.

Consider the following two functions for computing the sum of all `f32` values in a slice:

``````
#![allow(unused)]
fn main() {
fn fast_sum(x: &[f32]) -> f32 {
assert!(x.len() % 4 == 0);
let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
for i in (0..x.len()).step_by(4) {
sum += f32x4::from_slice_unaligned(&x[i..]);
}
sum.sum()
}

fn slow_sum(x: &[f32]) -> f32 {
assert!(x.len() % 4 == 0);
let mut sum: f32 = 0.;
for i in (0..x.len()).step_by(4) {
sum += f32x4::from_slice_unaligned(&x[i..]).sum();
}
sum
}
}
``````

The inner loop over the slice is where the bulk of the work actually happens. There, the `fast_sum` function perform vertical operations into a vector, doing a single horizontal reduction at the end, while the `slow_sum` function performs horizontal vector operations inside of the loop.

On all widely-used architectures, `fast_sum` is a large constant factor faster than `slow_sum`. You can run the slice_sum example and see for yourself. On the particular machine tested there the algorithm using the horizontal vector addition is 2.7x slower than the one using vertical vector operations!