Rust's Unsafe Code Guidelines Reference
This document is a past effort by the UCG WG to provide a "guide" for writing unsafe code that "recommends" what kinds of things unsafe code can and cannot do, and that documents which guarantees unsafe code may rely on. It is largely abandoned right now. However, the glossary is actively maintained.
Unless stated otherwise, the information in the guide is mostly a "recommendation" and still subject to change.
Glossary
ABI (of a type)
The function call ABI or short ABI of a type defines how it is passed by-value across a function boundary.
Possible ABIs include passing the value directly in zero or more registers, or passing it indirectly as a pointer to the actual data.
The space of all possible ABIs is huge and extremely target-dependent.
Rust therefore does generally not clearly define the ABI of any type, it only defines when two types are ABI-compatible,
which means that it is legal to call a function declared with an argument or return type T
using a declaration or function pointer with argument or return type U
.
Note that ABI compatibility is stricter than layout compatibility.
For instance #[repr(C)] struct S(i32)
is (guaranteed to be) layout-compatible with i32
, but it is not ABI-compatible.
Abstract Byte
The byte is the smallest unit of storage in Rust. Memory allocations are thought of as storing a list of bytes, and at the lowest level each load returns a list of bytes and each store takes a list of bytes and puts it into memory. (The representation relation then defines how to convert between those lists of bytes and higher-level values such as mathematical integers or pointers.)
However, a byte in the Rust Abstract Machine is more complicated than just an integer in 0..256
-- think of it as there being some extra "shadow state" that is relevant for the Abstract Machine execution (in particular, for whether this execution has UB), but that disappears when compiling the program to assembly.
That's why we call it abstract byte, to distinguish it from the physical machine byte in 0..256
.
The most obvious "shadow state" is tracking whether memory is initialized.
See this blog post for details, but the gist of it is that bytes in memory are more like Option<u8>
where None
indicates that this byte is uninitialized.
Operations like copy
work on that representation, so if you copy from some uninitialized memory into initialized memory, the target memory becomes "de-initialized".
Another piece of shadow state is pointer provenance: the Abstract Machine tracks the "origin" of each pointer value to enforce the rule that a pointer used to access some memory is "based on" the original pointer produced when that memory got allocated.
This provenance must be preserved when the pointer is stored to memory and loaded again later, which implies that abstract bytes must be able to carry provenance.
Without committing to the exact shape of provenance in Rust, we can therefore say that an (abstract) byte in the Rust Abstract Machine looks as follows:
#![allow(unused)] fn main() { pub enum AbstractByte<Provenance> { /// An uninitialized byte. Uninit, /// An initialized byte with a value in `0..256`, /// optionally with some provenance (if it is encoding a pointer). Init(u8, Option<Provenance>), } }
Aliasing
Aliasing occurs when one pointer or reference points to a "span" of memory that overlaps with the span of another pointer or reference. A span of memory is similar to how a slice works: there's a base byte address as well as a length in bytes.
Note: a full aliasing model for Rust, defining when aliasing is allowed and when not, has not yet been defined. The purpose of this definition is to define when aliasing happens, not when it is allowed. The most developed potential aliasing model so far is Stacked Borrows.
Consider the following example:
fn main() { let u: u64 = 7_u64; let r: &u64 = &u; let s: &[u8] = unsafe { core::slice::from_raw_parts(&u as *const u64 as *const u8, 8) }; let (head, tail) = s.split_first().unwrap(); }
In this case, both r
and s
alias each other, since they both point to all of
the bytes of u
.
However, head
and tail
do not alias each other: head
points to the first
byte of u
and tail
points to the other seven bytes of u
after it. Both head
and tail
alias s
, any overlap is sufficient to count as an alias.
The span of a pointer or reference is the size of the value being pointed to or referenced. Depending on the type, you can determine the size as follows:
- For a type
T
that isSized
The span length of a pointer or reference toT
is found withsize_of::<T>()
. - When
T
is notSized
the story is a little tricker:- If you have a reference
r
you can usesize_of_val(r)
to determine the span of the reference. - If you have a pointer
p
you must unsafely convert that to a reference before you can usesize_of_val
. There is not currently a safe way to determine the span of a pointer to an unsized type.
- If you have a reference
The Data layout chapter also has more information on the sizes of different types.
One interesting side effect of these rules is that references and pointers to Zero Sized Types never alias each other, because their span length is always 0 bytes.
It is also important to know that LLVM IR has a noalias
attribute that works
somewhat differently from this definition. However, that's considered a low
level detail of a particular Rust implementation. When programming Rust, the
Abstract Rust Machine is intended to operate according to the definition here.
Allocation
An allocation is a chunk of memory that is addressable from Rust. Allocations are created for objects on the heap, for stack-allocated variables, for globals (statics and consts), but also for objects that do not have Rust-inspectable data such as functions and vtables. An allocation has a contiguous range of memory addresses that it covers, and it can generally only be deallocated all at once. (Though in the future, we might allow allocations with holes, and we might allow growing/shrinking an allocation.) This range can be empty, but even empty allocations have a base address that they are located at. The base address of an allocation is not necessarily unique; but if two distinct allocations have the same base address then at least one of them must be empty.
Pointer arithmetic is generally only possible within an allocation: provenance ensures that each pointer "remembers" which allocation it points to, and accesses are only permitted if the address is in range of the allocation associated with the pointer.
Data inside an allocation is stored as abstract bytes; in particular, allocations do not track which type the data inside them has.
Interior mutability
Interior Mutation means mutating memory where there also exists a live shared reference pointing to the same memory; or mutating memory through a pointer derived from a shared reference. "live" here means a value that will be "used again" later. "derived from" means that the pointer was obtained by casting a shared reference and potentially adding an offset. This is not yet precisely defined, which will be fixed as part of developing a precise aliasing model.
Finding live shared references propagates recursively through references, but not through raw pointers.
So, for example, if data immediately pointed to by a &T
or & &mut T
is mutated, that's interior mutability.
If data immediately pointed to by a *const T
or &*const T
is mutated, that's not interior mutability.
Interior mutability refers to the ability to perform interior mutation without causing UB.
All interior mutation in Rust has to happen inside an UnsafeCell
, so all data structures that have interior mutability must (directly or indirectly) use UnsafeCell
for this purpose.
Layout
The layout of a type defines its size and alignment as well as the offsets of its subobjects (e.g. fields of structs/unions/enums/... or elements of arrays, and the discriminant of enums).
Note that layout does not capture everything that there is to say about how a type is represented on the machine; it notably does not include ABI or Niches.
Note: Originally, layout and representation were treated as synonyms, and Rust language features like the #[repr]
attribute reflect this.
In this document, layout and representation are not synonyms.
Memory Address
A memory address is an integer value that identifies where in the process' memory some data is stored. This will typically be a virtual address, if the Rust process runs as a regular user-space program. It can also be a physical address for bare-level / kernel code. Rust doesn't really care either way, the point is: it's an address as understood by the CPU, it's what the load/store instructions need to identify where in memory to perform the load/store.
Note that a pointer in Rust is not just a memory address. A pointer value consists of a memory address and provenance.
Niche
The niche of a type determines invalid bit-patterns that will be used by layout optimizations.
For example, &mut T
has at least one niche, the "all zeros" bit-pattern. This
niche is used by layout optimizations like "enum
discriminant
elision" to
guarantee that Option<&mut T>
has the same size as &mut T
.
While all niches are invalid bit-patterns, not all invalid bit-patterns are
niches. For example, the "all bits uninitialized" is an invalid bit-pattern for
&mut T
, but this bit-pattern cannot be used by layout optimizations, and is not a
niche.
Padding
Padding (of a type T
) refers to the space that the compiler leaves between fields of a struct or enum variant to satisfy alignment requirements, and before/after variants of a union or enum to make all variants equally sized.
Padding can be thought of as the type containing secret fields of type [Pad; N]
for some hypothetical type Pad
(of size 1) with the following properties:
Pad
is valid for any byte, i.e., it has the same validity invariant asMaybeUninit<u8>
.- Copying
Pad
ignores the source byte, and writes any value to the target byte. Or, equivalently (in terms of Abstract Machine behavior), copyingPad
marks the target byte as uninitialized.
Note that padding is a property of the type and not the memory: reading from the padding of an &Foo
(by casting to a byte reference) may produce initialized values if the &Foo
is pointing to memory that was initialized (for example, if it was originally a byte buffer initialized to 0
), but the moment you perform a typed copy out of that reference you will have uninitialized padding bytes in the copy.
We can also define padding in terms of the representation relation:
A byte at index i
is a padding byte for type T
if,
for all values v
and lists of bytes b
such that v
and b
are related at T
(let's write this Vrel_T(v, b)
),
changing b
at index i
to any other byte yields a b'
such v
and b'
are related (Vrel_T(v, b')
).
In other words, the byte at index i
is entirely ignored by Vrel_T
(the value relation for T
), and two lists of bytes that only differ in padding bytes relate to the same value(s), if any.
This definition works fine for product types (structs, tuples, arrays, ...). The desired notion of "padding byte" for enums and unions is still unclear.
Place
A place (called "lvalue" in C and "glvalue" in C++) is the result of computing a place expression. A place is basically a pointer (pointing to some location in memory, potentially carrying provenance), but might contain more information such as size or alignment (the details will have to be determined as the Rust Abstract Machine gets specified more precisely). A place has a type, indicating the type of values that it stores.
The key operations on a place are:
- Storing a value of the same type in it (when it is used on the left-hand side of an assignment).
- Loading a value of the same type from it (through the place-to-value coercion).
- Converting between a place (of type
T
) and a pointer value (of type&T
,&mut T
,*const T
or*mut T
) using the&
and*
operators. This is also the only way a place can be "stored": by converting it to a value first.
Pointer Provenance
The provenance of a pointer is used to distinguish pointers that point to the same memory address (i.e., pointers that, when cast to usize
, will compare equal).
Provenance is extra state that only exists in the Rust Abstract Machine; it is needed to specify program behavior but not present any more when the program runs on real hardware.
In other words, pointers that only differ in their provenance can not be distinguished any more in the final binary (but provenance can influence how the compiler translates the program).
The exact form of provenance in Rust is unclear. It is also unclear whether provenance applies to more than just pointers, i.e., one could imagine integers having provenance as well (so that pointer provenance can be preserved when pointers are cast to an integer and back). In the following, we give some examples if what provenance could look like.
Using provenance to track originating allocation. For example, we have to distinguish pointers to the same location if they originated from different allocations. Cross-allocation pointer arithmetic does not lead to usable pointers, so the Rust Abstract Machine somehow has to remember the original allocation to which a pointer pointed. It could use provenance to achieve this:
#![allow(unused)] fn main() { // Let's assume the two allocations here have base addresses 0x100 and 0x200. // We write pointer provenance as `@N` where `N` is some kind of ID uniquely // identifying the allocation. let raw1 = Box::into_raw(Box::new(13u8)); let raw2 = Box::into_raw(Box::new(42u8)); let raw2_wrong = raw1.wrapping_add(raw2.wrapping_sub(raw1 as usize) as usize); // These pointers now have the following values: // raw1 points to address 0x100 and has provenance @1. // raw2 points to address 0x200 and has provenance @2. // raw2_wrong points to address 0x200 and has provenance @1. // In other words, raw2 and raw2_wrong have same *address*... assert_eq!(raw2 as usize, raw2_wrong as usize); // ...but it would be UB to dereference raw2_wrong, as it has the wrong *provenance*: // it points to address 0x200, which is in allocation @2, but the pointer // has provenance @1. }
This kind of provenance also exists in C/C++, but Rust is more permissive by (a) providing a way to do pointer arithmetic across allocation boundaries without causing immediate UB (though, as we have seen, the resulting pointer still cannot be used for locations outside the allocation it originates), and (b) by allowing pointers to always be compared safely, even if their provenance differs. For some more information, see this document proposing a more precise definition of provenance for C.
Using provenance for Rust's aliasing rules. Another example of pointer provenance is the "tag" from Stacked Borrows. For some more information, see this blog post.
Representation (relation)
A representation of a value is a list of (abstract) bytes that is used to store or "represent" that value in memory.
We also sometimes speak of the representation of a type; this should more correctly be called the representation relation as it relates values of this type to lists of bytes that represent this value.
The term "relation" here is used in the mathematical sense: the representation relation is a predicate that, given a value and a list of bytes, says whether this value is represented by that list of bytes (val -> list byte -> Prop
).
The relation should be functional for a fixed list of bytes (i.e., every list of bytes has at most one associated representation).
It is partial in both directions: not all values have a representation (e.g. the mathematical integer 300
has no representation at type u8
), and not all lists of bytes correspond to a value of a specific type (e.g. lists of the wrong size correspond to no value, and the list consisting of the single byte 0x10
corresponds to no value of type bool
).
For a fixed value, there can be many representations (e.g., when considering type #[repr(C)] Pair(u8, u16)
, the second byte is a padding byte so changing it does not affect the value represented by a list of bytes).
See the value domain for an example how values and representation relations can be made more precise.
Soundness (of code / of a library)
Soundness is a type system concept (actually originating from the study of logics) and means that the type system is "correct" in the sense that well-typed programs actually have the desired properties.
For Rust, this means well-typed programs cannot cause Undefined Behavior.
This promise only extends to safe code however; for unsafe
code, it is up to the programmer to uphold this contract.
Accordingly, we say that a library (or an individual function) is sound if it is impossible for safe code to cause Undefined Behavior using its public API. Conversely, the library/function is unsound if safe code can cause Undefined Behavior.
Undefined Behavior
Undefined Behavior is a concept of the contract between the Rust programmer and the compiler: The programmer promises that the code exhibits no undefined behavior. In return, the compiler promises to compile the code in a way that the final program does on the real hardware what the source program does according to the Rust Abstract Machine. If it turns out the program does have undefined behavior, the contract is void, and the program produced by the compiler is essentially garbage (in particular, it is not bound by any specification; the program does not even have to be well-formed executable code).
In Rust, the Nomicon and the Reference both have a list of behavior that the language considers undefined. Rust promises that safe code cannot cause Undefined Behavior---the compiler and authors of unsafe code takes the burden of this contract on themselves. For unsafe code, however, the burden is still on the programmer.
Also see: Soundness.
Validity and safety invariant
The validity invariant is an invariant that all data must uphold any time it is accessed or copied in a typed manner. This invariant is known to the compiler and exploited by optimizations such as improved enum layout or eliding in-bounds checks.
In terms of MIR statements, "accessed or copied" means whenever an assignment statement is executed. That statement has a type (LHS and RHS must have the same type), and the data being assigned must be valid at that type. Moreover, arguments passed to a function must be valid at the type given in the callee signature, and the return value of a function must be valid at the type given in the caller signature. OPEN QUESTION: Are there more cases where data must be valid?
In terms of code, some data computed by TERM
is valid at type T
if and only if the following program does not have UB:
fn main() { unsafe {
let t: T = std::mem::transmute(TERM);
} }
The safety invariant is an invariant that safe code may assume all data to uphold. This invariant is used to justify which operations safe code can perform. The safety invariant can be temporarily violated by unsafe code, but must always be upheld when interfacing with unknown safe code. It is not relevant when arguing whether some program has UB, but it is relevant when arguing whether some code safely encapsulates its unsafety -- in other words, it is relevant when arguing whether some library is sound.
In terms of code, some data computed by TERM
(possibly constructed from some arguments
that can be assumed to satisfy the safety invariant) is valid at type T
if and only if the following library function can be safely exposed to arbitrary (safe) code as part of the public library interface:
pub fn make_something(arguments: U) -> T { unsafe {
std::mem::transmute(TERM)
} }
One example of valid-but-unsafe data is a &str
or String
that's not well-formed UTF-8: the compiler will not run its own optimizations that would cause any trouble here, so unsafe code may temporarily violate the invariant that strings are UTF-8
.
However, functions on &str
/String
may assume the string to be UTF-8
, meaning they may cause UB if the string is not UTF-8
.
This means that unsafe code violating the UTF-8 invariant must not perform string operations (it may operate on the data as a byte slice though), or else it risks UB.
Moreover, such unsafe code must not return a non-UTF-8 string to the "outside" of its safe abstraction boundary, because that would mean safe code could cause UB by doing bad_function().chars().count()
.
To summarize: Data must always be valid, but it only must be safe in safe code. For some more information, see this blog post.
Value
A value (called "value of the expression" or "rvalue" in C and "prvalue" in C++) is what gets stored in a place, and also the result of computing a value expression. A value has a type, and it denotes the abstract mathematical concept that is represented by data in our programs.
For example, a value of type u8
is a mathematical integer in the range 0..256
.
Values can be (according to their type) turned into a list of (abstract) bytes, which is called a representation of the value.
Values are ephemeral; they arise during the computation of an instruction but are only ever persisted in memory through their representation.
(This is comparable to how run-time data in a program is ephemeral and is only ever persisted in serialized form.)
Zero-sized type / ZST
Types with zero size are called zero-sized types, which is abbreviated as "ZST". This document also uses the "1-ZST" abbreviation, which stands for "one-aligned zero-sized type", to refer to zero-sized types with an alignment requirement of 1.
For example, ()
is a "1-ZST" but [u16; 0]
is not because it has an alignment
requirement of 2.
Data layout
Layout of structs and tuples
Disclaimer: This chapter represents the consensus from issues #11 and #12. The statements in here are not (yet) "guaranteed" not to change until an RFC ratifies them.
Tuple types
In general, an anonymous tuple type (T1..Tn)
of arity N is laid out
"as if" there were a corresponding tuple struct declared in libcore:
#[repr(Rust)]
struct TupleN<P1..Pn:?Sized>(P1..Pn);
In this case, (T1..Tn)
would be compatible with TupleN<T1..Tn>
.
As discussed below, this generally means that the compiler is free
to re-order field layout as it wishes. Thus, if you would like a
guaranteed layout from a tuple, you are generally advised to create a
named struct with a #[repr(C)]
annotation (see the section on
structs for more details).
Note that the final element of a tuple (Pn
) is marked as ?Sized
to
permit unsized tuple coercion -- this is implemented on nightly but is
currently unstable (tracking issue). In the future, we may
extend unsizing to other elements of tuples as well.
Other notes on tuples
Some related discussion:
- RFC #1582 proposed
that tuple structs should have a "nested layout", where
e.g.
(T1, T2, T3)
would in fact be laid out as(T1, (T2, T3))
. The purpose of this was to permit variadic matching and so forth against some suffix of the struct. This RFC was not accepted, however. This layout requires extra padding and seems somewhat surprising: it means that the layout of tuples and tuple structs would diverge significantly from structs with named fields.
Struct types
Structs come in two principle varieties:
// Structs with named fields
struct Foo { f1: T1, .., fn: Tn }
// Tuple structs
struct Foo(T1, .., Tn);
In terms of their layout, tuple structs can be understood as
equivalent to a named struct with fields named 0..n-1
:
struct Foo {
0: T1,
...
n-1: Tn
}
(In fact, one may use such field names in patterns or in accessor
expressions like foo.0
.)
The degrees of freedom the compiler has when computing the layout of an
inhabited struct or tuple is to determine the order of the fields, and the
"gaps" (often called padding) before, between, and after the fields. The
layout of these fields themselves is already entirely determined by their types,
and since we intend to allow creating references to fields (&s.f1
), structs do
not have any wiggle-room there.
This can be visualized as follows:
[ <--> [field 3] <-----> [field 1] <-> [ field 2 ] <--> ]
Figure 1 (struct-field layout): The <-...->
and [ ... ]
denote the differently-sized gaps and fields, respectively.
Here, the individual fields are blocks of fixed size (determined by the field's layout). The compiler freely picks an order for the fields to be in (this does not have to be the order of declaration in the source), and it picks the gaps between the fields (under some constraints, such as alignment).
For uninhabited structs or tuples like (i32, !)
that do not have a valid
inhabitant, the compiler has more freedom. After all, no references to fields
can ever be taken. For example, such structs might be zero-sized.
How exactly the compiler picks order and gaps, as well as other aspects of
layout beyond size and field offset, can be controlled by a #[repr]
attribute:
#[repr(Rust)]
-- the default.#[repr(C)]
-- request C compatibility#[repr(align(N))]
-- specify the alignment#[repr(packed)]
-- request packed layout where fields are not internally aligned#[repr(transparent)]
-- request that a "wrapper struct" be treated "as if" it were an instance of its field type when passed as an argument
Default layout ("repr rust")
With the exception of the guarantees provided below, the default layout of structs is not specified.
As of this writing, we have not reached a full consensus on what limitations should exist on possible field struct layouts, so effectively one must assume that the compiler can select any layout it likes for each struct on each compilation, and it is not required to select the same layout across two compilations. This implies that (among other things) two structs with the same field types may not be laid out in the same way (for example, the hypothetical struct representing tuples may be laid out differently from user-declared structs).
Known things that can influence layout (non-exhaustive):
- the type of the struct fields and the layout of those types
- compiler settings, including esoteric choices like optimization fuel
A note on determinism. The definition above does not guarantee determinism between executions of the compiler -- two executions may select different layouts, even if all inputs are identical. Naturally, in practice, the compiler aims to produce deterministic output for a given set of inputs. However, it is difficult to produce a comprehensive summary of the various factors that may affect the layout of structs, and so for the time being we have opted for a conservative definition.
Compiler's current behavior. As of the time of this writing, the compiler will reorder struct fields to minimize the overall size of the struct (and in particular to eliminate padding due to alignment restrictions).
Layout is presently defined not in terms of a "fully monomorphized"
struct definition but rather in terms of its generic definition along
with a set of substitutions (values for each type parameter; lifetime
parameters do not affect layout). This distinction is important
because of unsizing -- if the final field has generic type, the
compiler will not reorder it, to allow for the possibility of
unsizing. E.g., struct Foo { x: u16, y: u32 }
and struct Foo<T> { x: u16, y: T }
where T = u32
are not guaranteed to be identical.
Zero-sized structs
For repr(Rust)
, repr(packed(N))
, repr(align(N))
, and repr(C)
structs: if
all fields of a struct have size 0, then the struct has size 0.
For example, all these types are zero-sized:
use std::mem::size_of; #[repr(align(32))] struct Zst0; #[repr(C)] struct Zst1(Zst0); struct Zst2(Zst1, Zst0); fn main() { assert_eq!(size_of::<Zst0>(), 0); assert_eq!(size_of::<Zst1>(), 0); assert_eq!(size_of::<Zst2>(), 0); }
In particular, a struct with no fields is a ZST, and if it has no repr attribute it is moreover a 1-ZST as it also has no alignment requirements.
Single-field structs
A struct with only one field has the same layout as that field.
Structs with 1-ZST fields
For the purposes of struct layout 1-ZST fields are ignored.
In particular, if all but one field are 1-ZST, then the struct is equivalent to a single-field struct. In other words, if all but one field is a 1-ZST, then the entire struct has the same layout as that one field.
Similarly, if all fields are 1-ZST, then the struct has the same layout as a struct with no fields, and is itself a 1-ZST.
For example:
#![allow(unused)] fn main() { type Zst1 = (); struct S1(i32, Zst1); // same layout as i32 type Zst2 = [u16; 0]; struct S2(Zst2, Zst1); // same layout as Zst2 struct S3(Zst1); // same layout as Zst1 }
Unresolved questions
During the course of the discussion in #11 and #12, various suggestions arose to limit the compiler's flexibility. These questions are currently considering unresolved and -- for each of them -- an issue has been opened for further discussion on the repository. This section documents the questions and gives a few light details, but the reader is referred to the issues for further discussion.
Homogeneous structs (#36). If you have homogeneous structs, where all
the N
fields are of a single type T
, can we guarantee a mapping to
the memory layout of [T; N]
? How do we map between the field names
and the indices? What about zero-sized types?
Deterministic layout (#35). Can we say that layout is some deterministic function of a certain, fixed set of inputs? This would allow you to be sure that if you do not alter those inputs, your struct layout would not change, even if it meant that you can't predict precisely what it will be. For example, we might say that struct layout is a function of the struct's generic types and its substitutions, full stop -- this would imply that any two structs with the same definition are laid out the same. This might interfere with our ability to do profile-guided layout or to analyze how a struct is used and optimize based on that. Some would call that a feature.
C-compatible layout ("repr C")
For structs tagged #[repr(C)]
, the compiler will apply a C-like
layout scheme. See section 6.7.2.1 of the C17 specification for
a detailed write-up of what such rules entail (as well as the relevant
specs for your platform). For most platforms, however, this means the
following:
- Field order is preserved.
- The first field begins at offset 0.
- Assuming the struct is not packed, each field's offset is aligned1 to the ABI-mandated alignment for that field's type, possibly creating unused padding bits.
- The total size of the struct is rounded up to its overall alignment.
Aligning an offset O to an alignment A means to round up the offset O until it is a multiple of the alignment A.
The intention is that if one has a set of C struct declarations and a
corresponding set of Rust struct declarations, all of which are tagged
with #[repr(C)]
, then the layout of those structs will all be
identical. Note that this setup implies that none of the structs in
question can contain any #[repr(Rust)]
structs (or Rust tuples), as
those would have no corresponding C struct declaration -- as
#[repr(Rust)]
types have undefined layout, you cannot safely declare
their layout in a C program.
See also the notes on ABI compatibility under the section on #[repr(transparent)]
.
Structs with no fields. One area where Rust layout can deviate
from C/C++ -- even with #[repr(C)]
-- comes about with "empty
structs" that have no fields. In C, an empty struct declaration like
struct Foo { }
is illegal. However, both gcc and clang support
options to enable such structs, and assign them size
zero. Rust behaves the same way --
empty structs have size 0 and alignment 1 (unless an explicit
#[repr(align)]
is present). C++, in contrast, gives empty structs a
size of 1, unless they are inherited from or they are fields that have
the [[no_unique_address]]
attribute, in which case they do not
increase the overall size of the struct.
Structs of zero-size. It is also possible to have structs that
have fields but still have zero size. In this case, the size of the
struct would be zero, but its alignment may be greater. For example,
#[repr(C)] struct Foo { x: [u16; 0] }
would have an alignment of 2
bytes by default. (This matches the behavior in gcc and
clang.)
Structs with fields of zero-size. If a #[repr(C)]
struct
containing a field of zero-size, that field does not occupy space in
the struct; it can affect the offsets of subsequent fields if it
induces padding due to the alignment on its type. (This matches the
behavior in gcc and clang.)
C++ compatibility hazard. As noted above when discussing structs
with no fields, C++ treats empty structs like struct Foo { }
differently from C and Rust. This can introduce subtle compatibility
hazards. If you have an empty struct in your C++ code and you make the
"naive" translation into Rust, even tagging with #[repr(C)]
will not
produce layout- or ABI-compatible results.
Fixed alignment
The #[repr(align(N))]
attribute may be used to raise the alignment
of a struct, as described in The Rust Reference.
Packed layout
The #[repr(packed(N))]
attribute may be used to impose a maximum
limit on the alignments for individual fields. It is most commonly
used with an alignment of 1, which makes the struct as small as
possible. For example, in a #[repr(packed(2))]
struct, a u8
or
u16
would be aligned at 1- or 2-bytes respectively (as normal), but
a u32
would be aligned at only 2 bytes instead of 4. In the absence
of an explicit #[repr(align)]
directive, #[repr(packed(N))]
also
sets the alignment for the struct as a whole to N bytes.
The resulting fields may not fall at properly aligned boundaries in
memory. This makes it unsafe to create a Rust reference (&T
or &mut T
) to those fields, as the compiler requires that all reference
values must always be aligned (so that it can use more efficient
load/store instructions at runtime). See the Rust reference for more
details.
Function call ABI compatibility
In general, when invoking functions that use the C ABI, #[repr(C)]
structs are guaranteed to be passed in the same way as their
corresponding C counterpart (presuming one exists). #[repr(Rust)]
structs have no such guarantee. This means that if you have an extern "C"
function, you cannot pass a #[repr(Rust)]
struct as one of its
arguments. Instead, one would typically pass #[repr(C)]
structs (or
possibly pointers to Rust-structs, if those structs are opaque on the
other side, or the callee is defined in Rust).
However, there is a subtle point about C ABIs: in some C ABIs, passing
a struct with one field of type T
as an argument is not
equivalent to just passing a value of type T
. So e.g. if you have a
C function that is defined to take a uint32_t
:
void some_function(uint32_t value) { .. }
It is incorrect to pass in a struct as that value, even if that
struct is #[repr(C)
] and has only one field:
#[repr(C)]
struct Foo { x: u32 }
extern "C" some_function(Foo);
some_function(Foo { x: 22 }); // Bad!
Instead, you should declare the struct with #[repr(transparent)]
,
which specifies that Foo
should use the ABI rules for its field
type, u32
. This is useful when using "wrapper structs" in Rust to
give stronger typing guarantees.
#[repr(transparent)]
can only be applied to structs with a single
field whose type T
has non-zero size, along with some number of
other fields whose types are all zero-sized (typically
std::marker::PhantomData
fields). The struct then takes on the "ABI
behavior" of the type T
that has non-zero size.
(Note further that the Rust ABI is undefined and theoretically may vary from compiler revision to compiler revision.)
Unresolved question: Guaranteeing compatible layouts?
One key unresolved question was whether we would want to guarantee
that two #[repr(Rust)]
structs whose fields have the same types are
laid out in a "compatible" way, such that one could be transmuted to
the other. @rkruppe laid out a number of
examples
where this might be a reasonable thing to expect. As currently
written, and in an effort to be conservative, we make no such
guarantee, though we do not firmly rule out doing such a thing in the future.
It seems like it may well be desirable to -- at minimum -- guarantee
that #[repr(Rust)]
layout is "some deterministic function of the
struct declaration and the monomorphized types of its fields". Note
that it is not sufficient to consider the monomorphized type of a
struct's fields: due to unsizing coercions, it matters whether the
struct is declared in a generic way or not, since the "unsized" field
must presently be laid out last in the
structure. (Note
that tuples are always coercible (see #42877 for more information),
and are always declared as generics.) This implies that our
"deterministic function" also takes as input the form in which the
fields are declared in the struct.
However, that rule is not true today. For example, the compiler includes an option (called "optimization fuel") that will enable us to alter the layout of only the "first N" structs declared in the source. When one is accidentally relying on the layout of a structure, this can be used to track down the struct that is causing the problem.
There are also benefits to having fewer guarantees. For example:
- Code hardening tools can be used to randomize the layout of individual structs.
- Profile-guided optimization might analyze how instances of a particular struct are used and tweak the layout (e.g., to insert padding and reduce false sharing).
As a more declarative alternative, @alercah proposed a possible
extension
that would permit one to declare that the layout of two structs or
types are compatible (e.g., #[repr(as(Foo))] struct Bar { .. }
),
thus permitting safe transmutes (and also ABI compatibility). One
might also use some weaker form of #[repr(C)]
to specify a "more
deterministic" layout. These areas need future exploration.
Counteropinions and other notes
@joshtripplet argued against reordering struct fields, suggesting instead it would be better if users reordering fields themselves. However, there are a number of downsides to such a proposal (and -- further -- it does not match our existing behavior):
- In a generic struct, the best ordering of fields may not be known ahead of time, so the user cannot do it manually.
- If layout is defined, and a library exposes a struct with all public fields, then clients may be more likely to assume that the layout of that struct is stable. If they were to write unsafe code that relied on this assumption, that would break if fields were reordered. But libraries may well expect the freedom to reorder fields. This case is weakened because of the requirement to write unsafe code (after all, one can always write unsafe code that relies on virtually any implementation detail); if we were to permit safe casts that rely on the layout, then reordering fields would clearly be a breaking change (see also this comment and this thread).
- Many people would prefer the name ordering to be chosen for "readability" and not optimal layout.
Layout of scalar types
Disclaimer: This chapter represents the consensus from issue #9. The statements in here are not (yet) "guaranteed" not to change until an RFC ratifies them.
This documents the memory layout and considerations for bool
, char
, floating
point types (f{32, 64}
), and integral types ({i,u}{8,16,32,64,128,size}
).
These types are all scalar types, representing a single value, and have no
layout #[repr()]
flags.
bool
Rust's bool
has the same layout as C17's _Bool
, that is, its size
and alignment are implementation-defined. Any bool
can be
cast into an integer, taking on the values 1 (true
) or 0 (false
).
Note: on all platforms that Rust's currently supports, its size and alignment are 1, and its ABI class is
INTEGER
- see Rust Layout and ABIs.
char
Rust char is 32-bit wide and represents an unicode scalar value. The alignment
of char
is implementation-defined.
Note: Rust
char
type is not layout compatible with C / C++char
types. The C / C++char
types correspond to either Rust'si8
oru8
types on all currently supported platforms, depending on their signedness. Rust does not support C platforms in which Cchar
is not 8-bit wide.
isize
and usize
The isize
and usize
types are pointer-sized signed and unsigned integers.
They have the same layout as the pointer types for which the pointee is
Sized
, and are layout compatible with C's uintptr_t
and intptr_t
types.
Note: C99 7.18.2.4 requires
uintptr_t
andintptr_t
to be at least 16-bit wide. All platforms we currently support have a C platform, and as a consequence,isize
/usize
are at least 16-bit wide for all of them.
Note: Rust's
usize
and C'sunsigned
types are not equivalent. C'sunsigned
is at least as large as a short, allowed to have padding bits, etc. but it is not necessarily pointer-sized.
Note: in the current Rust implementation, the layouts of
isize
andusize
determine the following:
the maximum size of Rust allocations is limited to
isize::MAX
. The LLVMgetelementptr
instruction uses signed-integer field offsets. Rust callsgetelementptr
with theinbounds
flag which assumes that field offsets do not overflow,the maximum number of elements in an array is
usize::MAX
([T; N: usize]
). Only ZST arrays can probably be this large in practice, non-ZST arrays are bound by the maximum size of Rust values,the maximum value in bytes by which a pointer can be offseted using
ptr.add
orptr.offset
isisize::MAX
.These limits have not gone through the RFC process and are not guaranteed to hold.
Fixed-width integer types
For all Rust's fixed-width integer types {i,u}{8,16,32,64,128}
it holds that:
- these types have no padding bits,
- their size exactly matches their bit-width,
- negative values of signed integer types are represented using 2's complement.
Furthermore, Rust's signed and unsigned fixed-width integer types
{i,u}{8,16,32,64}
have the same layout as the C fixed-width integer types from
the <stdint.h>
header {u,}int{8,16,32,64}_t
. These fixed-width integer types
are therefore safe to use directly in C FFI where the corresponding C
fixed-width integer types are expected.
The alignment of Rust's {i,u}128
is unspecified and allowed to change.
Note: While the C standard does not define fixed-width 128-bit wide integer types, many C compilers provide non-standard
__int128
types as a language extension. The layout of{i,u}128
in the current Rust implementation does not match that of these C types, see rust-lang/#54341.
Layout compatibility with C native integer types
The specification of native C integer types, char
, short
, int
, long
,
... as well as their unsigned
variants, guarantees a lower bound on their size,
e.g., short
is at least 16-bit wide and at least as wide as char
.
Their exact sizes are implementation-defined.
Libraries like libc
use knowledge of this implementation-defined behavior on
each platform to select a layout-compatible Rust fixed-width integer type when
interfacing with native C integer types (e.g. libc::c_int
).
Note: Rust does not support C platforms on which the C native integer type are not compatible with any of Rust's fixed-width integer type (e.g. because of padding-bits, lack of 2's complement, etc.).
Fixed-width floating point types
Rust's f32
and f64
single (32-bit) and double (64-bit) precision
floating-point types have IEEE-754 binary32
and binary64
floating-point
layouts, respectively.
When the platforms' "math.h"
header defines the __STDC_IEC_559__
macro,
Rust's floating-point types are safe to use directly in C FFI where the
appropriate C types are expected (f32
for float
, f64
for double
).
If the C platform's "math.h"
header does not define the __STDC_IEC_559__
macro, whether using f32
and f64
in C FFI is safe or not for which C type is
implementation-defined.
Note: the
libc
crate uses knowledge of each platform's implementation-defined behavior to provide portablelibc::c_float
andlibc::c_double
types that can be used to safely interface with C via FFI.
Layout of Rust enum
types
Disclaimer: Some parts of this section were decided in RFCs, but others represent the consensus from issue #10. The text will attempt to clarify which parts are "guaranteed" (owing to the RFC decision) and which parts are still in a "preliminary" state, at least until we start to open RFCs ratifying parts of the Unsafe Code Guidelines effort.
Note: This document has not yet been updated to RFC 2645.
Categories of enums
Empty enums. Enums with no variants can never be instantiated and
are equivalent to the !
type. They do not accept any #[repr]
annotations.
Fieldless enums. The simplest form of enum is one where none of the variants have any fields:
#![allow(unused)] fn main() { enum SomeEnum { Variant1, Variant2, Variant3, } }
Such enums correspond quite closely with enums in the C language (though there are important differences as well). Presuming that they have more than one variant, these sorts of enums are always represented as a simple integer, though the size will vary.
Fieldless enums may also specify the value of their discriminants explicitly:
#![allow(unused)] fn main() { enum SomeEnum { Variant22 = 22, Variant44 = 44, Variant45, } }
As in C, discriminant values that are not specified are defined as either 0 (for the first variant) or as one more than the prior variant.
Data-carrying enums. Enums with at least one variant with fields are called "data-carrying" enums. Note that for the purposes of this definition, it is not relevant whether the variant fields are zero-sized. Therefore this enum is considered "data-carrying":
#![allow(unused)] fn main() { enum Foo { Bar(()), Baz, } }
repr annotations accepted on enums
In general, enums may be annotated using the following #[repr]
tags:
- A specific integer type (called
Int
as a shorthand below):#[repr(u8)]
#[repr(u16)]
#[repr(u32)]
#[repr(u64)]
#[repr(i8)]
#[repr(i16)]
#[repr(i32)]
#[repr(i64)]
- C-compatible layout:
#[repr(C)]
- C-compatible layout with a specified discriminant size:
#[repr(C, u8)]
#[repr(C, u16)]
- etc
Note that manually specifying the alignment using #[repr(align)]
is
not permitted on an enum.
The set of repr annotations accepted by an enum depends on its category, as defined above:
- Empty enums: no repr annotations are permitted.
- Fieldless enums:
#[repr(Int)]
-style and#[repr(C)]
annotations are permitted, but#[repr(C, Int)]
annotations are not. - Data-carrying enums: all repr annotations are permitted.
Enum layout rules
The rules for enum layout vary depending on the category.
Layout of an empty enum
An empty enum is an enum with no variants; empty enums can never
be instantiated and are logically equivalent to the "never type"
!
. #[repr]
annotations are not accepted on empty enums. Empty
enums are guaranteed to have the same layout as !
(zero size and
alignment 1).
Layout of a fieldless enum
If there is no #[repr]
attached to a fieldless enum, the compiler
will represent it using an integer of sufficient size to store the
discriminants for all possible variants -- note that if there is only
one variant, then 0 bits are required, so it is possible that the enum
may have zero size. In the absence of a #[repr]
annotation, the
number of bits used by the compiler are not defined and are subject to
change.
When a #[repr(Int)]
-style annotation is attached to a fieldless enum
(one without any data for its variants), it will cause the enum to be
represented as a simple integer of the specified size Int
. This must
be sufficient to store all the required discriminant values.
The #[repr(C)]
annotation is equivalent, but it selects the same
size as the C compiler would use for the given target for an
equivalent C-enum declaration.
Combining a C
and Int
repr
(e.g., #[repr(C, u8)]
) is
not permitted on a fieldless enum.
The values used for the discriminant will match up with what is specified (or automatically assigned) in the enum definition. For example, the following enum defines the discriminants for its variants as 22 and 23 respectively:
#![allow(unused)] fn main() { enum Foo { // Specificy discriminant of this variant as 22: Variant22 = 22, // Default discriminant is one more than the previous, // so 23 will be assigned. Variant23 } }
Note: some C compilers offer flags (e.g., -fshort-enums
) that
change the layout of enums from the default settings that are standard
for the platform. The integer size selected by #[repr(C)]
is defined
to match the default settings for a given target, when no such
flags are supplied. If interop with code that uses other flags is
desired, then one should either specify the sizes of enums manually or
else use an alternate target definition that is tailored to the
compiler flags in use.
Layout of a data-carrying enums with an explicit repr annotation
This section concerns data-carrying enums with an explicit repr annotation of some form. The memory layout of such cases was specified in RFC 2195 and is therefore normative.
The layout of data-carrying enums that do not have an explicit repr annotation is generally undefined, but with certain specific exceptions: see the next section for details.
Explicit repr annotation without C compatibility
When an enum is tagged with #[repr(Int)]
for some integral type
Int
(e.g., #[repr(u8)]
), it will be represented as a C-union of a
series of #[repr(C)]
structs, one per variant. Each of these structs
begins with an integral field containing the discriminant, which
specifies which variant is active. They then contain the remaining
fields associated with that variant.
Example. The following enum uses an repr(u8)
annotation:
#![allow(unused)] fn main() { #[repr(u8)] enum TwoCases { A(u8, u16), B(u16), } }
This will be laid out equivalently to the following more complex Rust types:
#![allow(unused)] fn main() { #[repr(C)] union TwoCasesRepr { A: TwoCasesVariantA, B: TwoCasesVariantB, } #[derive(Copy, Clone)] #[repr(u8)] enum TwoCasesTag { A, B } #[derive(Copy, Clone)] #[repr(C)] struct TwoCasesVariantA(TwoCasesTag, u8, u16); #[derive(Copy, Clone)] #[repr(C)] struct TwoCasesVariantB(TwoCasesTag, u16); }
Note that the TwoCasesVariantA
and TwoCasesVariantB
structs are
#[repr(C)]
; this is needed to ensure that the TwoCasesTag
value
appears at offset 0 in both cases, so that we can read it to determine
the current variant.
Explicit repr annotation with C compatibility
When the #[repr]
tag includes C
, e.g., #[repr(C)]
or #[repr(C, u8)]
, the layout of enums is changed to better match C++ enums. In
this mode, the data is laid out as a tuple of (discriminant, union)
,
where union
represents a C union of all the possible variants. The
type of the discriminant will be the integral type specified (u8
,
etc) -- if no type is specified, then the compiler will select one
based on what a size a fieldless enum would have with the same number of
variants.
This layout, while more compatible and arguably more obvious, is also
less efficient than the non-C compatible layout in some cases in terms
of total size. For example, the TwoCases
example given in the
previous section only occupies 4 bytes with #[repr(u8)]
, but would
occupy 6 bytes with #[repr(C, u8)]
, as more padding is required.
Example. The following enum:
#[repr(C, Int)]
enum MyEnum {
A(u32),
B(f32, u64),
C { x: u32, y: u8 },
D,
}
is equivalent to the following Rust definition:
#[repr(C)]
struct MyEnumRepr {
tag: MyEnumTag,
payload: MyEnumPayload,
}
#[repr(Int)]
enum MyEnumTag { A, B, C, D }
#[repr(C)]
union MyEnumPayload {
A: u32,
B: MyEnumPayloadB,
C: MyEnumPayloadC,
D: (),
}
#[repr(C)]
struct MyEnumPayloadB(f32, u64);
#[repr(C)]
struct MyEnumPayloadC { x: u32, y: u8 }
This enum can also be represented in C++ as follows:
#include <stdint.h>
enum class MyEnumTag: CppEquivalentOfInt { A, B, C, D };
struct MyEnumPayloadB { float _0; uint64_t _1; };
struct MyEnumPayloadC { uint32_t x; uint8_t y; };
union MyEnumPayload {
uint32_t A;
MyEnumPayloadB B;
MyEnumPayloadC C;
};
struct MyEnum {
MyEnumTag tag;
MyEnumPayload payload;
};
Layout of a data-carrying enums without a repr annotation
If no explicit #[repr]
attribute is used, then the layout of a
data-carrying enum is typically not specified. However, in certain
select cases, there are guaranteed layout optimizations that may
apply, as described below.
Discriminant elision on Option-like enums
(Meta-note: The content in this section is not fully described by any RFC and is therefore "non-normative". Parts of it were specified in rust-lang/rust#60300).
Definition. An option-like enum is a 2-variant enum
where:
- the
enum
has no explicit#[repr(...)]
, and - one variant has a single field, and
- the other variant has no fields (the "unit variant").
The simplest example is Option<T>
itself, where the Some
variant
has a single field (of type T
), and the None
variant has no
fields. But other enums that fit that same template fit.
Definition. The payload of an option-like enum is the single
field which it contains; in the case of Option<T>
, the payload has
type T
.
Definition. In some cases, the payload type may contain illegal
values, which are called niches. For example, a value of type &T
may never be NULL
, and hence defines a niche consisting of the
bitstring 0
. Similarly, the standard library types NonZeroU8
and friends may never be zero, and hence also define the value of 0
as a niche.
The niche values must be disjoint from the values allowed by the validity invariant. The validity invariant is, as of this writing, the current active discussion topic in the unsafe code guidelines process. rust-lang/rust#60300 specifies that the following types have at least one niche (the all-zeros bit-pattern):
&T
&mut T
extern "C" fn
core::num::NonZero*
core::ptr::NonNull<T>
#[repr(transparent)] struct
around one of the types in this list.
Option-like enums where the payload defines at least one niche value are guaranteed to be represented using the same memory layout as their payload. This is called discriminant elision, as there is no explicit discriminant value stored anywhere. Instead, niche values are used to represent the unit variant.
The most common example is that Option<&u8>
can be represented as an
nullable &u8
reference -- the None
variant is then represented
using the niche value zero. This is because a valid &u8
value can
never be zero, so if we see a zero value, we know that this must be
None
variant.
Example. The type Option<&u32>
will be represented at runtime as
a nullable pointer. FFI interop often depends on this property.
Example. As fn
types are non-nullable, the type Option<extern "C" fn()>
will be represented at runtime as a nullable function
pointer (which is therefore equivalent to a C function pointer) . FFI
interop often depends on this property.
Example. The following enum definition is not option-like, as it has two unit variants:
#![allow(unused)] fn main() { enum Enum1<T> { Present(T), Absent1, Absent2, } }
Example. The following enum definition is not option-like,
as it has an explicit repr
attribute.
#![allow(unused)] fn main() { #[repr(u8)] enum Enum2<T> { Present(T), Absent1, } }
Layout of enums with a single variant
NOTE: the guarantees in this section have not been approved by an RFC process.
Data-carrying enums with a single variant without a repr()
annotation have
the same layout as the variant field. Fieldless enums with a single variant
have the same layout as a unit struct.
For example, here:
#![allow(unused)] fn main() { struct UnitStruct; enum FieldlessSingleVariant { FieldlessVariant } struct SomeStruct { x: u32 } enum DataCarryingSingleVariant { DataCarryingVariant(SomeStruct), } }
FieldSingleVariant
has the same layout asUnitStruct
,DataCarryingSingleVariant
has the same layout asSomeStruct
.
Unresolved questions
See Issue #79.:
- Layout of multi-variant enums where only one variant is inhabited.
Layout of unions
Disclaimer: This chapter represents the consensus from issue #13. The statements in here are not (yet) "guaranteed" not to change until an RFC ratifies them.
Note: This document has not yet been updated to RFC 2645.
Layout of individual union fields
A union consists of several variants, one for each field. All variants have the same size and start at the same memory address, such that in memory the variants overlap. This can be visualized as follows:
[ <--> [field0_ty] <----> ]
[ <----> [field1_ty] <--> ]
[ <---> [field2_ty] <---> ]
Figure 1 (union-field layout): Each row in the picture shows the layout of
the union for each of its variants. The <-...->
and [ ... ]
denote the
differently-sized gaps and fields, respectively.
The individual fields ([field{i}_ty_]
) are blocks of fixed size determined by
the field's layout. Since we allow creating references to union fields
(&u.i
), the only degrees of freedom the compiler has when computing the layout
of a union are the size of the union, which can be larger than the size of its
largest field, and the offset of each union field within its variant. How these
are picked depends on certain constraints like, for example, the alignment
requirements of the fields, the #[repr]
attribute of the union
, etc.
Unions with default layout ("repr(Rust)
")
Except for the guarantees provided below for some specific cases, the default layout of Rust unions is, in general, unspecified.
That is, there are no general guarantees about the offset of the fields, whether all fields have the same offset, what the call ABI of the union is, etc.
Rationale
As of this writing, we want to keep the option of using non-zero offsets open for the future; whether this is useful depends on what exactly the compiler-assumed invariants about union contents are. This might become clearer after the validity of unions is settled.
Even if the offsets happen to be all 0, there might still be differences in the
function call ABI. If you need to pass unions by-value across an FFI boundary,
you have to use #[repr(C)]
.
Layout of unions with a single non-zero-sized field
The layout of unions with a single non-1-ZST-field" is the same as the layout of that field if it has no padding bytes.
For example, here:
use std::mem::{size_of, align_of}; #[derive(Copy, Clone)] #[repr(transparent)] struct SomeStruct(i32); #[derive(Copy, Clone)] struct Zst; union U0 { f0: SomeStruct, f1: Zst, } fn main() { assert_eq!(size_of::<U0>(), size_of::<SomeStruct>()); assert_eq!(align_of::<U0>(), align_of::<SomeStruct>()); }
the union U0
has the same layout as SomeStruct
, because SomeStruct
has no
padding bits - it is equivalent to an i32
due to repr(transparent)
- and
because Zst
is a 1-ZST.
On the other hand, here:
use std::mem::{size_of, align_of}; #[derive(Copy, Clone)] struct SomeOtherStruct(i32); #[derive(Copy, Clone)] #[repr(align(16))] struct Zst2; union U1 { f0: SomeOtherStruct, f1: Zst2, } fn main() { assert_eq!(size_of::<U1>(), align_of::<Zst2>()); assert_eq!(align_of::<U1>(), align_of::<Zst2>()); assert_eq!(align_of::<Zst2>(), 16); }
the layout of U1
is unspecified because:
Zst2
is not a 1-ZST, andSomeOtherStruct
has an unspecified layout and could contain padding bytes.
C-compatible layout ("repr C")
The layout of repr(C)
unions follows the C layout scheme. Per sections
6.5.8.5 and 6.7.2.1.16 of the C11 specification, this means that the offset
of every field is 0. Unsafe code can cast a pointer to the union to a field type
to obtain a pointer to any field, and vice versa.
Padding
Since all fields are at offset 0, repr(C)
unions do not have padding before
their fields. They can, however, have padding in each union variant after the
field, to make all variants have the same size.
Moreover, the entire union can have trailing padding, to make sure the size is a multiple of the alignment:
use std::mem::{size_of, align_of}; #[repr(C, align(2))] union U { x: u8 } fn main() { // The repr(align) attribute raises the alignment requirement of U to 2 assert_eq!(align_of::<U>(), 2); // This introduces trailing padding, raising the union size to 2 assert_eq!(size_of::<U>(), 2); }
Note: Fields are overlapped instead of laid out sequentially, so unlike structs there is no "between the fields" that could be filled with padding.
Zero-sized fields
repr(C)
union fields of zero-size are handled in the same way as in struct
fields, matching the behavior of GCC and Clang for unions in C when zero-sized
types are allowed via their language extensions.
That is, these fields occupy zero-size and participate in the layout computation of the union as usual:
use std::mem::{size_of, align_of}; #[repr(C)] union U { x: u8, y: [u16; 0], } fn main() { // The zero-sized type [u16; 0] raises the alignment requirement to 2 assert_eq!(align_of::<U>(), 2); // This in turn introduces trailing padding, raising the union size to 2 assert_eq!(size_of::<U>(), 2); }
C++ compatibility hazard: C++ does, in general, give a size of 1 to types with no fields. When such types are used as a union field in C++, a "naive" translation of that code into Rust will not produce a compatible result. Refer to the struct chapter for further details.
Layout of reference and pointer types
Disclaimer: Everything this section says about pointers to dynamically sized types represents the consensus from issue #16, but has not been stabilized through an RFC. As such, this is preliminary information.
Terminology
Reference types are types of the form &T
, &mut T
.
Raw pointer types are types of the form *const T
or *mut T
.
Representation
The alignment of &T
, &mut T
, *const T
and *mut T
are the same,
and are at least the word size.
- If
T
is a sized type then the alignment of&T
is the word size. - The alignment of
&dyn Trait
is the word size. - The alignment of
&[T]
is the word size. - The alignment of
&str
is the word size. - Alignment in other cases may be more than the word size (e.g., for other dynamically sized types).
The sizes of &T
, &mut T
, *const T
and *mut T
are the same,
and are at least one word.
- If
T
is a sized type then the size of&T
is one word. - The size of
&dyn Trait
is two words. - The size of
&[T]
is two words. - The size of
&str
is two words. - Size in other cases may be more than one word (e.g., for other dynamically sized types).
Notes
The layouts of &T
, &mut T
, *const T
and *mut T
are the same.
If T
is sized, references and pointers to T
have a size and alignment of one
word and have therefore the same layout as C pointers.
warning: while the layout of references and pointers is compatible with the layout of C pointers, references come with a validity invariant that does not allow them to be used when they could be
NULL
, unaligned, dangling, or, in the case of&mut T
, aliasing.
We do not make any guarantees about the layout of
multi-trait objects &(dyn Trait1 + Trait2)
or references to other dynamically sized types,
other than that they are at least word-aligned, and have size at least one word.
The layout of &dyn Trait
when Trait
is a trait is the same as that of:
#![allow(unused)] fn main() { #[repr(C)] struct DynObject { data: *const u8, vtable: *const u8, } }
note: In the layout of
&mut dyn Trait
the fielddata
is of the type*mut u8
.
The layout of &[T]
is the same as that of:
#![allow(unused)] fn main() { #[repr(C)] struct Slice<T> { ptr: *const T, len: usize, } }
note: In the layout of
&mut [T]
the fieldptr
is of the type*mut T
.
The layout of &str
is the same as that of &[u8]
, and the layout of &mut str
is
the same as that of &mut [u8]
.
Representation of Function Pointers
Terminology
In Rust, a function pointer type, is either fn(Args...) -> Ret
,
extern "ABI" fn(Args...) -> Ret
, unsafe fn(Args...) -> Ret
, or
unsafe extern "ABI" fn(Args...) -> Ret
.
A function pointer is the address of a function,
and has function pointer type.
The pointer is implicit in the fn
type,
and they have no lifetime of their own;
therefore, function pointers are assumed to point to
a block of code with static lifetime.
This is not necessarily always true,
since, for example, you can unload a dynamic library.
Therefore, this is only a safety invariant,
not a validity invariant;
as long as one doesn't call a function pointer which points to freed memory,
it is not undefined behavior.
In C, a function pointer type is Ret (*)(Args...)
, or Ret ABI (*)(Args...)
,
and values of function pointer type are either a null pointer value,
or the address of a function.
Representation
The ABI and layout of (unsafe)? (extern "ABI")? fn(Args...) -> Ret
is exactly that of the corresponding C type --
the lack of a null value does not change this.
On common platforms, this means that *const ()
and fn(Args...) -> Ret
have
the same ABI and layout. This is, in fact, guaranteed by POSIX and Windows.
This means that for the vast majority of platforms,
#![allow(unused)] fn main() { fn go_through_pointer(x: fn()) -> fn() { let ptr = x as *const (); unsafe { std::mem::transmute::<*const (), fn()>(ptr) } } }
is both perfectly safe, and, in fact, required for some APIs -- notably,
GetProcAddress
on Windows requires you to convert from void (*)()
to
void*
, to get the address of a variable;
and the opposite is true of dlsym
, which requires you to convert from
void*
to void (*)()
in order to get the address of functions.
This conversion is not guaranteed by Rust itself, however;
simply the implementation. If the underlying platform allows this conversion,
so will Rust.
However, null values are not supported by the Rust function pointer types --
just like references, the expectation is that you use Option
to create
nullable pointers. Option<fn(Args...) -> Ret>
will have the exact same ABI
as fn(Args...) -> Ret
, but additionally allows null pointer values.
Use
Function pointers are mostly useful for talking to C -- in Rust, you would
mostly use T: Fn()
instead of fn()
. If talking to a C API,
the same caveats as apply to other FFI code should be followed.
As an example, we shall implement the following C interface in Rust:
struct Cons {
int data;
struct Cons *next;
};
struct Cons *cons(struct Cons *self, int data);
/*
notes:
- func must be non-null
- thunk may be null, and shall be passed unchanged to func
- self may be null, in which case no iteration is done
*/
void iterate(struct Cons const *self, void (*func)(int, void *), void *thunk);
bool for_all(struct Cons const *self, bool (*func)(int, void *), void *thunk);
#![allow(unused)] fn main() { use std::{ ffi::c_void, os::raw::c_int, }; #[repr(C)] pub struct Cons { data: c_int, next: Option<Box<Cons>>, } #[no_mangle] pub extern "C" fn cons(node: Option<Box<Cons>>, data: c_int) -> Box<Cons> { Box::new(Cons { data, next: node }) } #[no_mangle] pub unsafe extern "C" fn iterate( node: Option<&Cons>, func: unsafe extern "C" fn(i32, *mut c_void), // note - non-nullable thunk: *mut c_void, // note - this is a thunk, so it's just passed raw ) { let mut it = node; while let Some(node) = it { func(node.data, thunk); it = node.next.as_ref().map(|x| &**x); } } #[no_mangle] pub unsafe extern "C" fn for_all( node: Option<&Cons>, func: unsafe extern "C" fn(i32, *mut c_void) -> bool, thunk: *mut c_void, ) -> bool { let mut it = node; while let Some(node) = node { if !func(node.data, thunk) { return false; } it = node.next.as_ref().map(|x| &**x); } true } }
Layout of Rust array types and slices
Layout of Rust array types
Array types, [T; N]
, store N
values of type T
with a stride that is
equal to the size of T
. Here, stride is the distance between each pair of
consecutive values within the array.
The offset of the first array element is 0
, that is, a pointer to the array
and a pointer to its first element both point to the same memory address.
The alignment of array types is greater or equal to the alignment of its
element type. If the element type is repr(C)
the layout of the array is
guaranteed to be the same as the layout of a C array with the same element type.
Note: the type of array arguments in C function signatures, e.g.,
void foo(T x[N])
, decays to a pointer. That is, these functions do not take arrays as an arguments, they take a pointer to the first element of the array instead. Array types are therefore improper C types (not C FFI safe) in Rust foreign function declarations, e.g.,extern { fn foo(x: [T; N]) -> [U; M]; }
. Pointers to arrays are fine:extern { fn foo(x: *const [T; N]) -> *const [U; M]; }
, andstruct
s andunion
s containing arrays are also fine.
Arrays of zero-size
Arrays [T; N]
have zero size if and only if their count N
is zero or their
element type T
is zero-sized.
Layout compatibility with packed SIMD vectors
The layout of packed SIMD vector types 1 requires the size and alignment of the vector elements to match. That is, types with packed SIMD vector layout are layout compatible with arrays having the same element type and the same number of elements as the vector.
The packed SIMD vector layout is the layout of repr(simd)
types like __m128
.
Layout of Rust slices
The layout of a slice [T]
of length N
is the same as that of a [T; N]
array.
Layout of packed SIMD vectors
Disclaimer: This chapter represents the consensus from issue #38. The statements in here are not (yet) "guaranteed" not to change until an RFC ratifies them.
Rust currently exposes packed1 SIMD vector types like __m128
to users, but it
does not expose a way for users to construct their own vector types.
The set of currently-exposed packed SIMD vector types is implementation-defined and it is currently different for each architecture.
packed denotes that these SIMD vectors have a compile-time fixed size, distinguishing these from SIMD vector types whose size is only known at run-time. Rust currently only supports packed SIMD vector types. This is elaborated further in RFC2366.
Packed SIMD vector types
Packed SIMD vector types are repr(simd)
homogeneous tuple-structs containing
N
elements of type T
where N
is a power-of-two and the size and alignment
requirements of T
are equal:
#[repr(simd)]
struct Vector<T, N>(T_0, ..., T_(N - 1));
The set of supported values of T
and N
is implementation-defined.
The size of Vector
is N * size_of::<T>()
and its alignment is an
implementation-defined function of T
and N
greater than or equal to
align_of::<T>()
. That is:
assert_eq!(size_of::<Vector<T, N>>(), size_of::<T>() * N);
assert!(align_of::<Vector<T, N>>() >= align_of::<T>());
That is, two distinct repr(simd)
vector types that have the same T
and the
same N
have the same size and alignment.
Vector elements are laid out in source field order, enabling random access to vector elements by reinterpreting the vector as an array:
union U {
vec: Vector<T, N>,
arr: [T; N]
}
assert_eq!(size_of::<Vector<T, N>>(), size_of::<[T; N]>());
assert!(align_of::<Vector<T, N>>() >= align_of::<[T; N]>());
unsafe {
let u = U { vec: Vector<T, N>(t_0, ..., t_(N - 1)) };
assert_eq!(u.vec.0, u.arr[0]);
// ...
assert_eq!(u.vec.(N - 1), u.arr[N - 1]);
}
Unresolved questions
-
Blocked: Should the layout of packed SIMD vectors be the same as that of homogeneous tuples ? Such that:
union U { vec: Vector<T, N>, tup: (T_0, ..., T_(N-1)), } assert_eq!(size_of::<Vector<T, N>>(), size_of::<(T_0, ..., T_(N-1))>()); assert!(align_of::<Vector<T, N>>() >= align_of::<(T_0, ..., T_(N-1))>()); unsafe { let u = U { vec: Vector(t_0, ..., t_(N - 1)) }; assert_eq!(u.vec.0, u.tup.0); // ... assert_eq!(u.vec.(N - 1), u.tup.(N - 1)); }
This is blocked on the resolution of issue #36 about the layout of homogeneous structs and tuples.
-
MaybeUninit<T>
does not have the samerepr
asT
, soMaybeUninit<Vector<T, N>>
are notrepr(simd)
, which has performance consequences and means thatMaybeUninit<Vector<T, N>>
is not C-FFI safe.
Validity
Validity of unions
Disclaimer: This chapter is a work-in-progress. What's contained here represents the consensus from issue #73. The statements in here are not (yet) "guaranteed" not to change until an RFC ratifies them.
Validity of unions with zero-sized fields
A union containing a zero-sized field can contain any bit pattern. An example of such
a union is MaybeUninit
.
Validity of function pointers
Disclaimer: This chapter is a work-in-progress. What's contained here represents the consensus from issue #72. The statements in here are not (yet) "guaranteed" not to change until an RFC ratifies them.
A function pointer is "valid" (in the sense that it can be produced without causing immediate UB) if and only if it is non-null.
That makes this code UB:
#![allow(unused)] fn main() { fn bad() { let x: fn() = unsafe { std::mem::transmute(0usize) }; // This is UB! } }
However, any integer value other than NULL is allowed for function pointers:
#![allow(unused)] fn main() { fn good() { let x: fn() = unsafe { std::mem::transmute(1usize) }; // This is not UB. } }
Optimizations
We should turn
// y unused
let mut x = f();
g(&mut x);
y = x;
// x unused
into
y = f();
g(&mut y);
to avoid a copy.
The potential issue here is g
storing the pointer it got as an argument elsewhere.