Bug #19784: String#delete_prefix! problem - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #19784

closed

String#delete_prefix! problem

Added by inversion (Yura Babak) about 2 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]

Backport:

3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN

[ruby-core:114276]

Description

Here is the snipped and the question is in the comments:

fp = 'with_BOM_16.txt'
body = File.read(fp).force_encoding('UTF-8')

p body                            # "\xFF\xFE1\u00001\u0000"
p body.start_with?("\xFF\xFE")    # true

body.delete_prefix!("\xFF\xFE")   # !!! why doesn't work?
p body                            # "\xFF\xFE1\u00001\u0000"
p body.start_with?("\xFF\xFE")    # true

body[0, 2] = ''
p body                            # "1\u00001\u0000"
p body.start_with?("\xFF\xFE")    # false

Works same
on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux])
and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt])

Actions

Copy link

#1 [ruby-core:114278]

Updated by byroot (Jean Boussier) about 2 years ago

I suspect it's because "\xFF\xFE1\u00001\u0000" (UTF-8) has invalid encoding.

That said, if starts_with? returns true, delete_prefix should arguably work.

I'll try to see if I can find why, but in the meantime I tested that you can workaround this by casting the string to Encoding::BINARY:

str = "\xff\xfe1\u00001\u0000"
str.force_encoding(Encoding::BINARY)
str.delete_prefix!("\xff\xfe".b)
str.force_encoding(Encoding::UTF_8)
p str # => "1\u00001\u0000"

Actions

Copy link

#2 [ruby-core:114279]

Updated by byroot (Jean Boussier) about 2 years ago

Ok, so as suspected, in the code strings with a broken encoding are very explicitly rejected: https://github.com/ruby/ruby/commit/10082360b9124c3eaabfccf4fe10a3640c40a05e#diff-430d86fdb6c4a558ab0f1b6648bbfae1720e8bde84f026e95a52740014752040R9201

if (is_broken_string(prefix)) return 0;

This is from the initial implementation decided in [Feature #12694], and the PR implementing it was https://github.com/ruby/ruby/pull/1632

I don't see much discussion of that behavior in either the ticket or PR, but there are tests for it so it was intentional from the PR author, not necessarily from committers.

I'd be in favor of changing it so that it matches start_with? behavior, but I think that needs to be discussed at the developer meeting. It's also unclear whether if accepted it would be as a feature or a bug fix.

Actions

Copy link

#3 [ruby-core:114281]

Updated by byroot (Jean Boussier) about 2 years ago

Ticket added to the next dev meeting.

Actions

Copy link

#4 [ruby-core:114294]

Updated by duerst (Martin Dürst) about 2 years ago

I agree that start_with? and delete_prefix! should probably be aligned. But in general, encoding problems should lead to early errors, not be deferred.

Actions

Copy link

#5 [ruby-core:114486]

Updated by mame (Yusuke Endoh) almost 2 years ago

Discussed at the dev meeting.
We agreed that both start_with? and delete_prefix should compare character by character if valid, and byte by byte for invalid.

"\xFF\xFE".start_with?("\xFF")   #=> true   # not changed
"\xFF\xFE".delete_prefix("\xFF") #=> "\xFE" # changed
"Ä".start_with?("\xC3")          #=> false  # changed
"Ä".delete_prefix("\xC3")        #=> "Ä"    # not changed

Note that encoding validity is checked on a position-by-position basis:

"\xFFÄ".delete_prefix("\xFF") #=> should be "Ä"
"\xFFÄ".delete_prefix("\xFF\xC3") #=> should be "\xFFÄ"
"\xFFÄ".delete_prefix("\xFF\xC3\x84") #=> should be ""

Actions

Copy link

#6 [ruby-core:114494]

Updated by Eregon (Benoit Daloze) almost 2 years ago

mame (Yusuke Endoh) wrote in #note-5:

Note that encoding validity is checked on a position-by-position basis:

That's not clear to me.
Do you mean that if the argument or if the receiver String is not String#valid_encoding?, then we compare byte-by-byte, and otherwise we compare character-by-character ?

I think we should not consider whether a given substring is valid, that would be very expensive to check.

I wonder if always comparing byte-by-byte for both is not a lot simpler, faster, and also what most users expect.
And of course still have the encoding check to ensure both strings have a compatible encoding.

Actions

Copy link

#7 [ruby-core:114527]

Updated by mame (Yusuke Endoh) almost 2 years ago

Do you mean that if the argument or if the receiver String is not String#valid_encoding?, then we compare byte-by-byte, and otherwise we compare character-by-character ?

No.

I think we should not consider whether a given substring is valid

I think it's this way. In the case of "\xFF\xC3\x84" (= "\xFFÄ"), the byte sequence from byteoffset 0 is invalid, so we take out one byte "\xFF", and the next byte sequence from byteoffset 1 is valid, so we take out two bytes (one character) "\xC3\x84", and so on, I think @akr (Akira Tanaka) or @naruse (Yui NARUSE) can explain the rationale.

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Status changed from Open to Closed

Applied in changeset git|b054c2fe06598f1141fdc337b10046f41f0e227c.

[Bug #19784] Fix behaviors against prefix with broken encoding

String#start_with?
String#delete_prefix
String#delete_prefix!

Actions

Copy link

#9 [ruby-core:114563]

Updated by byroot (Jean Boussier) almost 2 years ago

Note that we have the same issue with end_with? and delete_suffix.

Actions

Copy link

#10 [ruby-core:114608]

Updated by jhawthorn (John Hawthorn) almost 2 years ago

@ywenc and I found a regression from this patch. We have some code handling a broken UTF-8 String with a combination of valid and invalid bytes (UTF-8 followed by binary, which IMO should probably be binary encoded, but it's surprising that the behaviour changed).

"hello\xBE".start_with?("hello") #=> false in trunk, was true on 3.2
"hello\xFE".start_with?("hello") #=> true (both 3.2 and trunk, intended behaviour)
"hello\xBE".delete_prefix("hello") => "\xBE" (both on 3.2 and trunk), because we skip the check when the prefix is valid
"\xFFhello\xBE".delete_prefix("\xFFhello") => "\xFFhello\xBE" in trunk, "\xBE" on 3.2

This is because we're looking at character following the prefix, observing that it looks like a UTF-8 continuation byte, and so returns false.

This approach might work for ends_with?/delete_suffix, where we don't break on an invalid character in the suffix, but doesn't feel right for prefixes. It sounds like the intended design is that to the user this should feel like we were comparing from the start of the strings char-by-char for valid and byte-by-byte for invalid.

We added tests and tried using the end of the previous character, rather than the "start" of the current, to determine if the prefix ends at a char boundary. https://github.com/ruby/ruby/pull/8348

Actions

Copy link

Also available in: Atom PDF

Like1

Like0Like0Like1Like0Like0Like0Like0Like1Like1Like2

Project

General

Profile

Ruby

Tags

Custom queries

Bug #19784

String#delete_prefix! problem

Updated by byroot (Jean Boussier) about 2 years ago

Updated by byroot (Jean Boussier) about 2 years ago

Updated by byroot (Jean Boussier) about 2 years ago

Updated by duerst (Martin Dürst) about 2 years ago

Updated by mame (Yusuke Endoh) almost 2 years ago

Updated by Eregon (Benoit Daloze) almost 2 years ago

Updated by mame (Yusuke Endoh) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by byroot (Jean Boussier) almost 2 years ago

Updated by jhawthorn (John Hawthorn) almost 2 years ago