Bug #18931: Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #18931

closed

Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip

Bug #18931: Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip

Added by nirvdrum (Kevin Menard) over 3 years ago. Updated about 3 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

Backport:

2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN

[ruby-core:109264]

Description

When attempting to strip a string, there are three basic options when an invalid code point is encountered:

Ignore the code point
Strip the code point
Raise an exception

For background, Ruby does not consider the string's code range for lstrip or rstrip. It permits stripping strings with a ENC_CODERANGE_BROKEN so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between lstrip and rstrip.

String#lstrip will unconditionally raise an invalid byte sequence error:

> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p " \x80abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p " \x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e ' p " a\x80bc".lstrip'
"a\x80bc"   # This one is okay because the broken code point appears after a non-whitespace code point.

Things get a lot messier with String#rstrip, however. Depending on context, rstrip may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.

String#rstrip will ignore the invalid code point if it immediately follows a non-whitespace code point:

> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc\x80 ".rstrip'
"abc\x80"

> ruby -e 'p "abc\x80".rstrip'
"abc\x80"

String#rstrip will remove the invalid code point if it is surround by whitespace:

> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc \x80".rstrip'
"abc"

> ruby -e 'p "abc \x80 ".rstrip'
"abc"

> ruby -e 'p " \x80 ".rstrip'
""

String#rstrip will raise an exception if no valid, non-whitespace code points appear before it:

> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "\x80 ".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., rb_str_lstrip will call rb_enc_codepoint_len, which raises on invalid code points, while rb_str_rstrip calls rb_enc_prev_char, which doesn't perform the same code point validation. I think it'd make for a better user experience if lstrip and rstrip behaved consistently with each other, which would then unify the behavior in rstrip. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #18931

Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip

Updated by nirvdrum (Kevin Menard) over 3 years ago Actions
Copy link
#1 [ruby-core:109265]

Updated by jeremyevans0 (Jeremy Evans) over 3 years ago Actions
Copy link
#2 [ruby-core:109668]

Updated by jeremyevans (Jeremy Evans) about 3 years ago Actions
Copy link
#3

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#4

Project

General

Profile

Ruby

Custom queries

Bug #18931

Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip

Updated by nirvdrum (Kevin Menard) over 3 years ago ActionsCopy link #1 [ruby-core:109265]

Updated by jeremyevans0 (Jeremy Evans) over 3 years ago ActionsCopy link #2 [ruby-core:109668]

Updated by jeremyevans (Jeremy Evans) about 3 years ago ActionsCopy link #3

Updated by nobu (Nobuyoshi Nakada) about 1 year ago ActionsCopy link #4

Updated by nirvdrum (Kevin Menard) over 3 years ago Actions
Copy link
#1 [ruby-core:109265]

Updated by jeremyevans0 (Jeremy Evans) over 3 years ago Actions
Copy link
#2 [ruby-core:109668]

Updated by jeremyevans (Jeremy Evans) about 3 years ago Actions
Copy link
#3

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#4