Bug #18931
closedInconsistent handling of invalid codepoints in String#lstrip and String#rstrip
Description
When attempting to strip a string, there are three basic options when an invalid code point is encountered:
- Ignore the code point
- Strip the code point
- Raise an exception
For background, Ruby does not consider the string's code range for lstrip
or rstrip
. It permits stripping strings with a ENC_CODERANGE_BROKEN
so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between lstrip
and rstrip
.
String#lstrip
will unconditionally raise an invalid byte sequence error:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p " \x80abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p " \x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p "\x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p "\x80".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e ' p " a\x80bc".lstrip'
"a\x80bc" # This one is okay because the broken code point appears after a non-whitespace code point.
Things get a lot messier with String#rstrip
, however. Depending on context, rstrip
may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.
String#rstrip
will ignore the invalid code point if it immediately follows a non-whitespace code point:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p "abc\x80 ".rstrip'
"abc\x80"
> ruby -e 'p "abc\x80".rstrip'
"abc\x80"
String#rstrip
will remove the invalid code point if it is surround by whitespace:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p "abc \x80".rstrip'
"abc"
> ruby -e 'p "abc \x80 ".rstrip'
"abc"
> ruby -e 'p " \x80 ".rstrip'
""
String#rstrip
will raise an exception if no valid, non-whitespace code points appear before it:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p "\x80 ".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p "\x80".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., rb_str_lstrip
will call rb_enc_codepoint_len
, which raises on invalid code points, while rb_str_rstrip
calls rb_enc_prev_char
, which doesn't perform the same code point validation. I think it'd make for a better user experience if lstrip
and rstrip
behaved consistently with each other, which would then unify the behavior in rstrip
. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue.