- Feature Name:
c_str_literal
- Start Date: 2022-11-15
- RFC PR: rust-lang/rfcs#3348
- Rust Issue: rust-lang/rust#105723
Summary
c"…"
string literals.
Motivation
Looking at the amount of cstr!()
invocations just on GitHub (about 3.2k files with matches) it seems like C string literals
are a widely used feature. Implementing cstr!()
as a macro_rules
or proc_macro
requires non-trivial code to get it completely right (e.g. refusing embedded nul bytes),
and is still less flexible than it should be (e.g. in terms of accepted escape codes).
In Rust 2021, we reserved prefixes for (string) literals, so let’s make use of that.
Guide-level explanation
c"abc"
is a &CStr
. A nul byte (b'\0'
) is appended to it in memory and the result is a &CStr
.
All escape codes and characters accepted by ""
and b""
literals are accepted, except nul bytes.
So, both UTF-8 and non-UTF-8 data can co-exist in a C string. E.g. c"hello\x80我叫\u{1F980}"
.
The raw string literal variant is prefixed with cr
. For example, cr"\"
and cr##"Hello "world"!"##
. (Just like r""
and br""
.)
Reference-level explanation
Two new string literal types: c"…"
and cr#"…"#
.
Accepted escape codes: Quote & Unicode & Byte.
Nul bytes are disallowed, whether as escape code or source character (e.g. "\0"
, "\x00"
, "\u{0}"
or "␀"
).
Unicode characters are accepted and encoded as UTF-8. That is, c"🦀"
, c"\u{1F980}"
and c"\xf0\x9f\xa6\x80"
are all accepted and equivalent.
The type of the expression is &core::ffi::CStr
. So, the CStr
type will have to become a lang item.
(no_core
programs that don’t use c""
string literals won’t need to define this lang item.)
Interactions with string related macros:
- The
concat
macro will not accept these literals, just like it doesn’t accept byte string literals. - The
format_args
macro will not accept such a literal as the format string, just like it doesn’t accept a byte string literal.
(This might change in the future. E.g. format_args!(c"…")
would be cool, but that would require generalizing the macro and fmt::Arguments
to work for other kinds of strings. (Ideally also for b"…"
.))
Rationale and alternatives
-
No
c""
literal, but just acstr!()
macro. (Possibly as part of the standard library.)This requires complicated machinery to implement correctly.
The trivial implementation of using
concat!($s, "\0")
is problematic for several reasons, including non-string input and embedded nul bytes. (The unstableconcat_bytes!()
solves some of the problems.)The popular
cstr
crate is a proc macro to work around the limitations of amacro_rules
implementation, but that also has many downsides.Even if we had the right language features for a trivial correct implementation, there are many code bases where C strings are the primary form of string, making
cstr!("..")
syntax quite cumbersome.
-
No
c""
literal, but make it possible for""
to implicitly become a&CStr
through magic.We already allow integer literals (e.g.
123
) to become one of many types, so perhaps we could do the same to string literals.(It could be a built-in fixed set of types (e.g. just
str
,[u8]
, andCStr
), or it could be something extensible through something like aconst trait FromStringLiteral
. Not sure how that would exactly work, but it sounds cool.)
-
Allowing only valid UTF-8 and unicode-oriented escape codes (like in
"…"
, e.g.螃蟹
or\u{1F980}
but not\xff
).For regular string literals, we have this restriction because
&str
is required to be valid UTF-8. However, C literals (and objects of our&CStr
type) aren’t necessarily valid UTF-8. -
Allowing only ASCII characters and byte-oriented escape codes (like in
b"…"
, e.g.\xff
but not螃蟹
or\u{1F980}
).While C literals (and
&CStr
) aren’t necessarily valid UTF-8, they often do contain UTF-8 data. Refusing to put UTF-8 in it would make the feature less useful and would unnecessarily make it harder to use unicode in programs that mainly use C strings. -
Having separate
c"…"
andbc"…"
string literal prefixes for UTF-8 and non-UTF8.Both of those would be the same type (
&CStr
). Unless we add a special “always valid UTF-8 C string” type, there’s not much use in separating them. -
Use
z
instead ofc
(z"…"
), for “zero terminated” instead of “C string”.We already have a type called
CStr
for this, soc
seems consistent.
-
Also add
c'…'
asc_char
literal.It’d be identical to
b'…'
, except it’d be ac_char
instead ofu8
.This would easily lead to unportable code, since
c_char
isi8
oru8
depending on the platform. (Not a wrapper type, but a direct type alias.) E.g.fn f(_: i8) {} f(c'a');
would compile only on some platforms.An alternative is to allow
c'…'
to implicitly be either au8
ori8
. (Just like integer literals can implicitly become one of many types.)
Drawbacks
-
The
CStr
type needs some work.&CStr
is currently a wide pointer, but it’s supposed to be a thin pointer. See https://doc.rust-lang.org/1.65.0/src/core/ffi/c_str.rs.html#87It’s not a blocker, but we might want to try to fix that before stabilizing
c"…"
.
Prior art
- C has C string literals (
"…"
). :) - Nim has
cstring"…"
. - COBOL has
Z"…"
. - Probably a lot more languages, but it’s hard to search for. :)
Unresolved questions
-
Also add
c'…'
C character literals? (u8
,i8
,c_char
, or something more flexible?) -
Should we make
&CStr
a thin pointer before stabilizing this? (If so, how?) -
Should the (unstable)
concat_bytes
macro accept C string literals? (If so, should it evaluate to a C string or byte string?)
Future possibilities
(These aren’t necessarily all good ideas.)
- Make
concat!()
orconcat_bytes!()
work withc"…"
. - Make
format_args!(c"…")
(andformat_args!(b"…")
) work. - Improve the
&CStr
type, and make it FFI safe. - Accept unicode characters and escape codes in
b""
literals too: RFC 3349. - More prefixes!
w""
,os""
,path""
,utf16""
,brokenutf16""
,utf32""
,wtf8""
,ebcdic""
, … - No more prefixes! Have
let a: &CStr = "…";
work through magic, removing the need for prefixes. (That won’t happen any time soon probably, so that shouldn’t blockc"…"
now.)