Crate packed_simd_2[−][src]
Expand description
Portable packed SIMD vectors
This crate is proposed for stabilization as std::packed_simd
in RFC2366:
std::simd
.
The examples available in the
examples/
sub-directory of the crate showcase how to use the library in practice.
Table of contents
Introduction
This crate exports Simd<[T; N]>
: a packed vector of N
elements of type T
as well as many type aliases for this type: for
example, f32x4
, which is just an alias for Simd<[f32; 4]>
.
The operations on packed vectors are, by default, “vertical”, that is, they are applied to each vector lane in isolation of the others:
let a = i32x4::new(1, 2, 3, 4); let b = i32x4::new(5, 6, 7, 8); assert_eq!(a + b, i32x4::new(6, 8, 10, 12));
Many “horizontal” operations are also provided:
assert_eq!(a.wrapping_sum(), 10);
In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, much slower. That is, the most portably-efficient way of performing a reduction over a slice is to collect the results into a vector using vertical operations, and performing a single horizontal operation at the end:
fn reduce(x: &[i32]) -> i32 { assert!(x.len() % 4 == 0); let mut sum = i32x4::splat(0); // [0, 0, 0, 0] for i in (0..x.len()).step_by(4) { sum += i32x4::from_slice_unaligned(&x[i..]); } sum.wrapping_sum() } let x = [0, 1, 2, 3, 4, 5, 6, 7]; assert_eq!(reduce(&x), 28);
Vector types
The vector type aliases are named according to the following scheme:
{element_type}x{number_of_lanes} == Simd<[element_type; number_of_lanes]>
where the following element types are supported:
i{element_width}
: signed integeru{element_width}
: unsigned integerf{element_width}
: floatm{element_width}
: mask (see below)*{const,mut} T
:const
andmut
pointers
Basic operations
// Sets all elements to `0`: let a = i32x4::splat(0); // Reads a vector from a slice: let mut arr = [0, 0, 0, 1, 2, 3, 4, 5]; let b = i32x4::from_slice_unaligned(&arr); // Reads the 4-th element of a vector: assert_eq!(b.extract(3), 1); // Returns a new vector where the 4-th element is replaced with `1`: let a = a.replace(3, 1); assert_eq!(a, b); // Writes a vector to a slice: let a = a.replace(2, 1); a.write_to_slice_unaligned(&mut arr[4..]); assert_eq!(arr, [0, 0, 0, 1, 0, 0, 1, 1]);
Conditional operations
One often needs to perform an operation on some lanes of the vector. Vector
masks, like m32x4
, allow selecting on which vector lanes an operation is
to be performed:
let a = i32x4::new(1, 1, 2, 2); // Add `1` to the first two lanes of the vector. let m = m16x4::new(true, true, false, false); let a = m.select(a + 1, a); assert_eq!(a, i32x4::splat(2));
The elements of a vector mask are either true
or false
. Here true
means that a lane is “selected”, while false
means that a lane is not
selected.
All vector masks implement a mask.select(a: T, b: T) -> T
method that
works on all vectors that have the same number of lanes as the mask. The
resulting vector contains the elements of a
for those lanes for which the
mask is true
, and the elements of b
otherwise.
The example constructs a mask with the first two lanes set to true
and
the last two lanes set to false
. This selects the first two lanes of a + 1
and the last two lanes of a
, producing a vector where the first two
lanes have been incremented by 1
.
note: mask
select
can be used on vector types that have the same number of lanes as the mask. The example shows this by usingm16x4
instead ofm32x4
. It is typically more performant to use a mask element width equal to the element width of the vectors being operated upon. This is, however, not true for 512-bit wide vectors when targetting AVX-512, where the most efficient masks use only 1-bit per element.
All vertical comparison operations returns masks:
let a = i32x4::new(1, 1, 3, 3); let b = i32x4::new(2, 2, 0, 0); // ge: >= (Greater Eequal; see also lt, le, gt, eq, ne). let m = a.ge(i32x4::splat(2)); if m.any() { // all / any / none allow coherent control flow let d = m.select(a, b); assert_eq!(d, i32x4::new(2, 2, 3, 3)); }
Conversions
-
lossless widening conversions:
From
/Into
are implemented for vectors with the same number of lanes when the conversion is value preserving (same as instd
). -
safe bitwise conversions: The cargo feature
into_bits
provides theIntoBits/FromBits
traits (x.into_bits()
). These perform safe bitwisetransmute
s when all bit patterns of the source type are valid bit patterns of the target type and are also implemented for the architecture-specific vector types ofstd::arch
. For example,let x: u8x8 = m8x8::splat(true).into_bits();
is provided because allm8x8
bit patterns are validu8x8
bit patterns. However, the opposite is not true, not allu8x8
bit patterns are validm8x8
bit-patterns, so this operation cannot be peformed safely usingx.into_bits()
; one needs to useunsafe { crate::mem::transmute(x) }
for that, making sure that the value in theu8x8
is a valid bit-pattern ofm8x8
. -
numeric casts (
as
): are peformed usingFromCast
/Cast
(x.cast()
), just likeas
:-
casting integer vectors whose lane types have the same size (e.g.
i32xN
->u32xN
) is a no-op, -
casting from a larger integer to a smaller integer (e.g.
u32xN
->u8xN
) will truncate, -
casting from a smaller integer to a larger integer (e.g.
u8xN
->u32xN
) will:- zero-extend if the source is unsigned, or
- sign-extend if the source is signed,
-
casting from a float to an integer will round the float towards zero,
-
casting from an integer to float will produce the floating point representation of the integer, rounding to nearest, ties to even,
-
casting from an
f32
to anf64
is perfect and lossless, -
casting from an
f64
to anf32
rounds to nearest, ties to even.
Numeric casts are not very “precise”: sometimes lossy, sometimes value preserving, etc.
-
Hardware Features
This crate can use different hardware features based on your configured
RUSTFLAGS
. For example, with no configured RUSTFLAGS
, u64x8
on
x86_64 will use SSE2 operations like PCMPEQD
. If you configure
RUSTFLAGS='-C target-feature=+avx2,+avx'
on supported x86_64 hardware
the same u64x8
may use wider AVX2 operations like VPCMPEQQ
. It is
important for performance and for hardware support requirements that
you choose an appropriate set of target-feature
and target-cpu
options during builds. For more information, see the Performance
guide
Macros
shuffle | Shuffles vector elements. |
Structs
LexicographicallyOrdered | Wrapper over |
Simd | Packed SIMD vector type. |
m8 | 8-bit wide mask. |
m16 | 16-bit wide mask. |
m32 | 32-bit wide mask. |
m64 | 64-bit wide mask. |
m128 | 128-bit wide mask. |
msize | isize-wide mask. |
Traits
Cast | Numeric cast from |
FromBits | Safe lossless bitwise conversion from |
FromCast | Numeric cast from |
IntoBits | Safe lossless bitwise conversion from |
Mask | This trait is implemented by all mask types |
SimdArray | Trait implemented by arrays that can be SIMD types. |
SimdVector | This trait is implemented by all SIMD vector types. |
Type Definitions
cptrx2 | A vector with 2 |
cptrx4 | A vector with 4 |
cptrx8 | A vector with 8 |
f32x2 | A 64-bit vector with 2 |
f32x4 | A 128-bit vector with 4 |
f32x8 | A 256-bit vector with 8 |
f32x16 | A 512-bit vector with 16 |
f64x2 | A 128-bit vector with 2 |
f64x4 | A 256-bit vector with 4 |
f64x8 | A 512-bit vector with 8 |
i8x2 | A 16-bit vector with 2 |
i8x4 | A 32-bit vector with 4 |
i8x8 | A 64-bit vector with 8 |
i8x16 | A 128-bit vector with 16 |
i8x32 | A 256-bit vector with 32 |
i8x64 | A 512-bit vector with 64 |
i16x2 | A 32-bit vector with 2 |
i16x4 | A 64-bit vector with 4 |
i16x8 | A 128-bit vector with 8 |
i16x16 | A 256-bit vector with 16 |
i16x32 | A 512-bit vector with 32 |
i32x2 | A 64-bit vector with 2 |
i32x4 | A 128-bit vector with 4 |
i32x8 | A 256-bit vector with 8 |
i32x16 | A 512-bit vector with 16 |
i64x2 | A 128-bit vector with 2 |
i64x4 | A 256-bit vector with 4 |
i64x8 | A 512-bit vector with 8 |
i128x1 | A 128-bit vector with 1 |
i128x2 | A 256-bit vector with 2 |
i128x4 | A 512-bit vector with 4 |
isizex2 | A vector with 2 |
isizex4 | A vector with 4 |
isizex8 | A vector with 8 |
m8x2 | A 16-bit vector mask with 2 |
m8x4 | A 32-bit vector mask with 4 |
m8x8 | A 64-bit vector mask with 8 |
m8x16 | A 128-bit vector mask with 16 |
m8x32 | A 256-bit vector mask with 32 |
m8x64 | A 512-bit vector mask with 64 |
m16x2 | A 32-bit vector mask with 2 |
m16x4 | A 64-bit vector mask with 4 |
m16x8 | A 128-bit vector mask with 8 |
m16x16 | A 256-bit vector mask with 16 |
m16x32 | A 512-bit vector mask with 32 |
m32x2 | A 64-bit vector mask with 2 |
m32x4 | A 128-bit vector mask with 4 |
m32x8 | A 256-bit vector mask with 8 |
m32x16 | A 512-bit vector mask with 16 |
m64x2 | A 128-bit vector mask with 2 |
m64x4 | A 256-bit vector mask with 4 |
m64x8 | A 512-bit vector mask with 8 |
m128x1 | A 128-bit vector mask with 1 |
m128x2 | A 256-bit vector mask with 2 |
m128x4 | A 512-bit vector mask with 4 |
mptrx2 | A vector with 2 |
mptrx4 | A vector with 4 |
mptrx8 | A vector with 8 |
msizex2 | A vector mask with 2 |
msizex4 | A vector mask with 4 |
msizex8 | A vector mask with 8 |
u8x2 | A 16-bit vector with 2 |
u8x4 | A 32-bit vector with 4 |
u8x8 | A 64-bit vector with 8 |
u8x16 | A 128-bit vector with 16 |
u8x32 | A 256-bit vector with 32 |
u8x64 | A 512-bit vector with 64 |
u16x2 | A 32-bit vector with 2 |
u16x4 | A 64-bit vector with 4 |
u16x8 | A 128-bit vector with 8 |
u16x16 | A 256-bit vector with 16 |
u16x32 | A 512-bit vector with 32 |
u32x2 | A 64-bit vector with 2 |
u32x4 | A 128-bit vector with 4 |
u32x8 | A 256-bit vector with 8 |
u32x16 | A 512-bit vector with 16 |
u64x2 | A 128-bit vector with 2 |
u64x4 | A 256-bit vector with 4 |
u64x8 | A 512-bit vector with 8 |
u128x1 | A 128-bit vector with 1 |
u128x2 | A 256-bit vector with 2 |
u128x4 | A 512-bit vector with 4 |
usizex2 | A vector with 2 |
usizex4 | A vector with 4 |
usizex8 | A vector with 8 |