- Start Date: 2014-11-12
- RFC PR: rust-lang/rfcs#474
- Rust Issue: rust-lang/rust#20034
Summary
This RFC reforms the design of the std::path
module in preparation for API
stabilization. The path API must deal with many competing demands, and the
current design handles many of them, but suffers from some significant problems
given in “Motivation” below. The RFC proposes a redesign modeled loosely on the
current API that addresses these problems while maintaining the advantages of
the current design.
Motivation
The design of a path abstraction is surprisingly hard. Paths work radically differently on different platforms, so providing a cross-platform abstraction is challenging. On some platforms, paths are not required to be in Unicode, posing ergonomic and semantic difficulties for a Rust API. These difficulties are compounded if one also tries to provide efficient path manipulation that does not, for example, require extraneous copying. And, of course, the API should be easy and pleasant to use.
The current std::path
module makes a strong effort to balance these design
constraints, but over time a few key shortcomings have emerged.
Semantic problems
Most importantly, the current std::path
module makes some semantic assumptions
about paths that have turned out to be incorrect.
Normalization
Paths in std::path
are always normalized, meaning that a/../b
is treated
like b
(among other things). Unfortunately, this kind of normalization changes
the meaning of paths when symbolic links are present: if a
is a symbolic link,
then the relative paths a/../b
and b
may refer to completely different
locations. See this issue for
more detail.
For this reason, most path libraries do not perform full normalization of
paths, though they may normalize paths like a/./b
to a/b
. Instead, they
offer (1) methods to optionally normalize and (2) methods to normalize based on
the contents of the underlying file system.
Since our current normalization scheme can silently and incorrectly alter the meaning of paths, it needs to be changed.
Unicode and Windows
In the original std::path
design, it was assumed that all paths on Windows
were Unicode. However, it
turns out that the Windows
filesystem APIs actually work with UCS-2,
which roughly means that they accept arbitrary sequences of u16
values but
interpret them as UTF-16 when it is valid to do so.
The current std::path
implementation is built around the assumption that
Windows paths can be represented as Rust string slices, and will need to be
substantially revised.
Ergonomic problems
Because paths in general are not in Unicode, the std::path
module cannot rely on
an internal string or string slice representation. That in turn causes trouble
for methods like dirname
that are intended to extract a subcomponent of a path
– what should it return?
There are basically three possible options, and today’s std::path
module
chooses all of them:
- Yield a byte sequence:
dirname
yields an&[u8]
- Yield a string slice, accounting for potential non-UTF-8 values:
dirname_str
yields anOption<&str>
- Yield another path:
dir_path
yields aPath
This redundancy is present for most of the decomposition methods. The saving
grace is that, in general, path methods consume BytesContainer
values, so one
can use the &[u8]
variant but continue to work with other path methods. But in
general &[u8]
values are not ergonomic to work with, and the explosion in
methods makes the module more (superficially) complex than one might expect.
You might be tempted to provide only the third option, but Path
values are
owned and mutable, so that would imply cloning on every decomposition
operation. For applications like Cargo that work heavily with paths, this would
be an unfortunate (and seemingly unnecessary) overhead.
Organizational problems
Finally, the std::path
module presents a somewhat complex API organization:
- The
Path
type is a direct alias of a platform-specific path type. - The
GenericPath
trait provides most of the common API expected on both platforms. - The
GenericPathUnsafe
trait provides a few unsafe/unchecked functions for performance reasons. - The
posix
andwindows
submodules provide their ownPath
types and a handful of platform-specific functionality (in particular,windows
provides support for working with volumes and “verbatim” paths prefixed with\\?\
)
This organization needs to be updated to match current conventions and simplified if possible.
One thing to note: with the current organization, it is possible to work with non-native paths, which can sometimes be useful for interoperation. The new design should retain this functionality.
Detailed design
Note: this design is influenced by the Boost filesystem library and Scheme48 and Racket’s approach to encoding issues on windows.
Overview
The basic design uses DST to follow the same pattern as Vec<T>/[T]
and
String/str
: there is a PathBuf
type for owned, mutable paths and an unsized
Path
type for slices. The various “decomposition” methods for extracting
components of a path all return slices, and PathBuf
itself derefs to Path
.
The result is an API that is both efficient and ergonomic: there is no need to
allocate/copy when decomposing a path, but there is also no need to provide
multiple variants of methods to extract bytes versus Unicode strings. For
example, the Path
slice type provides a single method for converting to a
str
slice (when applicable).
A key aspect of the design is that there is no internal normalization of paths at all. Aside from solving the symbolic link problem, this choice also has useful ramifications for the rest of the API, described below.
The proposed API deals with the other problems mentioned above, and also brings the module in line with current Rust patterns and conventions. These details will be discussed after getting a first look at the core API.
The cross-platform API
The proposed core, cross-platform API provided by the new std::path
is as follows:
// A sized, owned type akin to String:
pub struct PathBuf { .. }
// An unsized slice type akin to str:
pub struct Path { .. }
// Some ergonomics and generics, following the pattern in String/str and Vec<T>/[T]
impl Deref<Path> for PathBuf { ... }
impl BorrowFrom<PathBuf> for Path { ... }
// A replacement for BytesContainer; used to cut down on explicit coercions
pub trait AsPath for Sized? {
fn as_path(&self) -> &Path;
}
impl<Sized? P> PathBuf where P: AsPath {
pub fn new<T: IntoString>(path: T) -> PathBuf;
pub fn push(&mut self, path: &P);
pub fn pop(&mut self) -> bool;
pub fn set_file_name(&mut self, file_name: &P);
pub fn set_extension(&mut self, extension: &P);
}
// These will ultimately replace the need for `push_many`
impl<Sized? P> FromIterator<P> for PathBuf where P: AsPath { .. }
impl<Sized? P> Extend<P> for PathBuf where P: AsPath { .. }
impl<Sized? P> Path where P: AsPath {
pub fn new(path: &str) -> &Path;
pub fn as_str(&self) -> Option<&str>
pub fn to_str_lossy(&self) -> Cow<String, str>; // Cow will replace MaybeOwned
pub fn to_owned(&self) -> PathBuf;
// iterate over the components of a path
pub fn iter(&self) -> Iter;
pub fn is_absolute(&self) -> bool;
pub fn is_relative(&self) -> bool;
pub fn is_ancestor_of(&self, other: &P) -> bool;
pub fn path_relative_from(&self, base: &P) -> Option<PathBuf>;
pub fn starts_with(&self, base: &P) -> bool;
pub fn ends_with(&self, child: &P) -> bool;
// The "root" part of the path, if absolute
pub fn root_path(&self) -> Option<&Path>;
// The "non-root" part of the path
pub fn relative_path(&self) -> &Path;
// The "directory" portion of the path
pub fn dir_path(&self) -> &Path;
pub fn file_name(&self) -> Option<&Path>;
pub fn file_stem(&self) -> Option<&Path>;
pub fn extension(&self) -> Option<&Path>;
pub fn join(&self, path: &P) -> PathBuf;
pub fn with_file_name(&self, file_name: &P) -> PathBuf;
pub fn with_extension(&self, extension: &P) -> PathBuf;
}
pub struct Iter<'a> { .. }
impl<'a> Iterator<&'a Path> for Iter<'a> { .. }
pub const SEP: char = ..
pub const ALT_SEPS: &'static [char] = ..
pub fn is_separator(c: char) -> bool { .. }
There is plenty of overlap with today’s API, and the methods being retained here largely have the same semantics.
But there are also a few potentially surprising aspects of this design that merit comment:
-
Why does
PathBuf::new
takeIntoString
? It needs an owned buffer internally, and taking a string means that Unicode input is guaranteed, which works on all platforms. (In general, the assumption is that non-Unicode paths are most commonly produced by reading a path from the filesystem, rather than creating now ones. As we’ll see below, there are platform-specific ways to crate non-Unicode paths.) -
Why no
Path::as_bytes
method? There is no cross-platform way to expose paths directly in terms of byte sequences, because each platform extends beyond Unicode in its own way. In particular, Unix platforms accept arbitrary u8 sequences, while Windows accepts arbitrary u16 sequences (both modulo disallowing interior 0s). The u16 sequences provided by Windows do not have a canonical encoding as bytes; this RFC proposed to use WTF-8 (see below), but does not reveal that choice. -
What about interior nulls? Currently various Rust system APIs will panic when given strings containing interior null values because, while these are valid UTF-8, it is not possible to send them as-is to C APIs that expect null-terminated strings. The API here follows the same approach, panicking if given a path with an interior null.
-
Why do
file_name
andextension
operations work withPath
rather than some other type? In particular, it may seem strange to view an extension as a path. But doing so allows us to not reveal platform differences about the various character sets used in paths. By and large, extensions in practice will be valid Unicode, so the various methods going to and fromstr
will suffice. But as with paths in general, there are platform-specific ways of working with non-Unicode data, explained below. -
Where did
push_many
and friends go? They’re replaced by implementingFromIterator
andExtend
, following a similar pattern with theVec
type. (Some work will be needed to retain full efficiency when doing so.) -
How does
Path::new
work? The ability to directly get a&Path
from an&str
(i.e., with no allocation or other work) is a key part of the representation choices, which are described below. -
Where is the
normalize
method? Since the path type no longer internally normalizes, it may be useful to explicitly request normalization. This can be done by writinglet normalized: PathBuf = p.iter().collect()
for a pathp
, because the iterator performs some on-the-fly normalization (see below). *NOTE this normalization does not include removing..
, for the reasons explained at the beginning of the RFC. -
What does the iterator yield? Unlike today’s
components
, theiter
method here will begin withroot_path
if there is one. Thus,a/b/c
will yielda
,b
andc
, while/a/b/c
will yield/
,a
,b
andc
.
Important semantic rules
The path API is designed to satisfy several semantic rules described below.
Note that ==
here is lazily normalizing, treating ./b
as b
and
a//b
as a/b
; see the next section for more details.
Suppose p
is some &Path
and dot == Path::new(".")
:
p == p.join(dot)
p == dot.join(p)
p == p.root_path().unwrap_or(dot)
.join(p.relative_path())
p.relative_path() == match p.root_path() {
None => p,
Some(root) => p.path_relative_from(root).unwrap()
}
p == p.dir_path()
.join(p.file_name().unwrap_or(dot))
p == p.iter().collect()
p == match p.file_name() {
None => p,
Some(name) => p.with_file_name(name)
}
p == match p.extension() {
None => p,
Some(ext) => p.with_extension(ext)
}
p == match (p.file_stem(), p.extension()) {
(Some(stem), Some(ext)) => p.with_file_name(name).with_extension(ext),
_ => p
}
Representation choices, Unicode, and normalization
A lot of the design in this RFC depends on a key property: both Unix and Windows paths can be easily represented as a flat byte sequence “compatible” with UTF-8. For Unix platforms, this is trivial: they accept any byte sequence, and will generally interpret the byte sequences as UTF-8 when valid to do so. For Windows, this representation involves a clever hack – proposed formally as WTF-8 – that encodes its native UCS-2 in a generalization of UTF-8. This RFC will not go into the details of that hack; please read Simon’s excellent writeup if you’re interested.
The upshot of all of this is that we can uniformly represent path slices as newtyped byte slices, and any UTF-8 encoded data will “do the right thing” on all platforms.
Furthermore, by not doing any internal, up-front normalization, it’s possible to
provide a Path::new
that goes from &str
to &Path
with no intermediate
allocation or validation. In the common case that you’re working with Rust
strings to construct paths, there is zero overhead. It also means that
Path::new(some_str).as_str = Some(some_str)
.
The main downside of this choice is that some of the path functionality must
cope with non-normalized paths. So, for example, the iterator must skip .
path
components (unless it is the entire path), and similarly for methods like
pop
. In general, methods that yield new path slices are expected to work as if:
./b
is justb
a//b
is justa/b
and comparisons between paths should also behave as if the paths had been normalized in this way.
Organization and platform-specific APIs
Finally, the proposed API is organized as std::path
with unix
and windows
submodules, as today. However, there is no GenericPath
or GenericPathUnsafe
;
instead, the API given above is implemented as a trivial wrapper around path
implementations provided by either the unix
or the windows
submodule (based
on #[cfg]
). In other words:
std::path::windows::Path
works with Windows-style pathsstd::path::unix::Path
works with Unix-style pathsstd::path::Path
is a thin newtype wrapper around the current platform’s path implementation
This organization makes it possible to manipulate foreign paths by working with the appropriate submodule.
In addition, each submodule defines some extension traits, explained below, that supplement the path API with functionality relevant to its variant of path.
But what if you’re writing a platform-specific application and wish to use the
extended functionality directly on std::path::Path
? In this case, you will be
able to import the appropriate extension trait via os::unix
or os::windows
,
depending on your platform. This is part of a new, general strategy for
explicitly “opting-in” to platform-specific features by importing from
os::some_platform
(where the some_platform
submodule is available only on
that platform.)
Unix
On Unix platforms, the only additional functionality is to let you work directly with the underlying byte representation of various path types:
pub trait UnixPathBufExt {
fn from_vec(path: Vec<u8>) -> Self;
fn into_vec(self) -> Vec<u8>;
}
pub trait UnixPathExt {
fn from_bytes(path: &[u8]) -> &Self;
fn as_bytes(&self) -> &[u8];
}
This is acceptable because the platform supports arbitrary byte sequences (usually interpreted as UTF-8).
Windows
On Windows, the additional APIs allow you to convert to/from UCS-2 (roughly,
arbitrary u16
sequences interpreted as UTF-16 when applicable); because the
name “UCS-2” does not have a clear meaning, these APIs use u16_slice
and will
be carefully documented. They also provide the remaining Windows-specific path
decomposition functionality that today’s path module supports.
pub trait WindowsPathBufExt {
fn from_u16_slice(path: &[u16]) -> Self;
fn make_non_verbatim(&mut self) -> bool;
}
pub trait WindowsPathExt {
fn is_cwd_relative(&self) -> bool;
fn is_vol_relative(&self) -> bool;
fn is_verbatim(&self) -> bool;
fn prefix(&self) -> PathPrefix;
fn to_u16_slice(&self) -> Vec<u16>;
}
enum PathPrefix<'a> {
Verbatim(&'a Path),
VerbatimUNC(&'a Path, &'a Path),
VerbatimDisk(&'a Path),
DeviceNS(&'a Path),
UNC(&'a Path, &'a Path),
Disk(&'a Path),
}
Drawbacks
The DST/slice approach is conceptually more complex than today’s API, but in practice seems to yield a much tighter API surface.
Alternatives
Due to the known semantic problems, it is not really an option to retain the current path implementation. As explained above, supporting UCS-2 also means that the various byte-slice methods in the current API are untenable, so the API also needs to change.
Probably the main alternative to the proposed API would be to not use
DST/slices, and instead use owned paths everywhere (probably doing some
normalization of .
at the same time). While the resulting API would be simpler
in some respects, it would also be substantially less efficient for common operations.
Unresolved questions
It is not clear how best to incorporate the
WTF-8 implementation (or how much to
incorporate) into libstd
.
There has been a long debate over whether paths should implement Show
given
that they may contain non-UTF-8 data. This RFC does not take a stance on that
(the API may include something like today’s display
adapter), but a follow-up
RFC will address the question more generally.