- Feature Name: str-words
- Start Date: 2015-04-10
- RFC PR: rust-lang/rfcs#1054
- Rust Issue: rust-lang/rust#24543
Rename or replace
str::words to side-step the ambiguity of “a word”.
is currently marked
#[unstable(reason = "the precise algorithm to use is unclear")].
Indeed, the concept of “a word” is not easy to define in presence of punctuation
or languages with various conventions, including not using spaces at all to separate words.
Issue #15628 suggests changing the algorithm to be based on the Word Boundaries section of Unicode Standard Annex #29: Unicode Text Segmentation.
While a Rust implementation of UAX#29 would be useful, it belong on crates.io more than in
It carries significant complexity that may be surprising from something that looks as simple as a parameter-less “words” method in the standard library. Users may not be aware of how subtle defining “a word” can be.
It is not a definitive answer. The standard itself notes:
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.
and gives many examples of such ambiguous situations.
std would be better off avoiding the question of defining word boundaries entirely.
words method to
split_whitespace, and keep the current behavior unchanged.
(That is, return an iterator equivalent to
Rename the return type
Optionally, keep a
words wrapper method for a while, both
with an error message that suggests
split_whitespace or the chosen alternative.
split_whitespace is very similar to the existing
str::split<P: Pattern>(&self, P) method,
and having a separate method seems like weak API design. (But see below.)
struct Whitespace;with a custom
Patternimplementation, which can be used in
str::split. However this requires the
Whitespacesymbol to be imported separately.
str::wordsentirely and tell users to use
Is there a better alternative?