Bug #18931
closedInconsistent handling of invalid codepoints in String#lstrip and String#rstrip
Description
When attempting to strip a string, there are three basic options when an invalid code point is encountered:
- Ignore the code point
- Strip the code point
- Raise an exception
For background, Ruby does not consider the string's code range for lstrip
or rstrip
. It permits stripping strings with a ENC_CODERANGE_BROKEN
so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between lstrip
and rstrip
.
String#lstrip
will unconditionally raise an invalid byte sequence error:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p " \x80abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p " \x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p "\x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p "\x80".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e ' p " a\x80bc".lstrip'
"a\x80bc" # This one is okay because the broken code point appears after a non-whitespace code point.
Things get a lot messier with String#rstrip
, however. Depending on context, rstrip
may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.
String#rstrip
will ignore the invalid code point if it immediately follows a non-whitespace code point:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p "abc\x80 ".rstrip'
"abc\x80"
> ruby -e 'p "abc\x80".rstrip'
"abc\x80"
String#rstrip
will remove the invalid code point if it is surround by whitespace:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p "abc \x80".rstrip'
"abc"
> ruby -e 'p "abc \x80 ".rstrip'
"abc"
> ruby -e 'p " \x80 ".rstrip'
""
String#rstrip
will raise an exception if no valid, non-whitespace code points appear before it:
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> ruby -e 'p "\x80 ".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
> ruby -e 'p "\x80".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
from -e:1:in `<main>'
It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., rb_str_lstrip
will call rb_enc_codepoint_len
, which raises on invalid code points, while rb_str_rstrip
calls rb_enc_prev_char
, which doesn't perform the same code point validation. I think it'd make for a better user experience if lstrip
and rstrip
behaved consistently with each other, which would then unify the behavior in rstrip
. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue.
Updated by nirvdrum (Kevin Menard) over 2 years ago
My own take on three options, with no significance to the order, are:
Ignore the code point
The documentation for lstrip
is "Returns a copy of the receiver with leading whitespace removed." It seems fairly straightforward and there's no mention of string validation; raising an exception might violate user expectations.
Treating broken code points the same as any other non-whitespace code point would be logically consistent. An additional benefit is the method could be implemented more efficiently as the whitespace check can be done without calculating code point boundaries. Only ASCII whitespace code points are stripped and those, by definition, are only one byte wide. However, if lstrip
and rstrip
ever evolve to handle non-ASCII whitespace we'll be back to calculating code point boundaries.
Strip the code point
Despite rstrip
doing it in some cases, I don't think removing the invalid code points is what an end user would expect and runs counter to the method's documentation.
Raise an exception
Given lstrip
's behavior, raising in all cases would be the most backward-compatible and is consistent with equivalent expressions (e.g., " \x80 abc".sub(/^\s+/, "")
will raise an error on the invalid byte sequence). While the documentation makes no mention of string validation, encountering an invalid code point is arguably an exceptional condition.
Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
I submitted a pull request to always raise an exception for rstrip strings with broken coderange: https://github.com/ruby/ruby/pull/6282
That may not match the lstrip behavior exactly, as it complains about cases where the broken coderange is before the last non-whitespace character. However, it seems the simplest solution, and I'm not sure we want to go out of our way to support broken strings in rstrip.
Updated by jeremyevans (Jeremy Evans) about 2 years ago
- Status changed from Open to Closed
Applied in changeset git|571d21fd4a2e877f49b4ff918832bda9a5e8f91c.
Make String#rstrip{,!} raise Encoding::CompatibilityError for broken coderange
It's questionable whether we want to allow rstrip to work for strings
where the broken coderange occurs before the trailing whitespace and
not after, but this approach is probably simpler, and I don't think
users should expect string operations like rstrip to work on broken
strings.
In some cases, this changes rstrip to raise
Encoding::CompatibilityError instead of ArgumentError. However, as
the problem is related to an encoding issue in the receiver, and due
not due to an issue with an argument, I think
Encoding::CompatibilityError is the more appropriate error.
Fixes [Bug #18931]