Bug #19784
closedString#delete_prefix! problem
Description
Here is the snipped and the question is in the comments:
fp = 'with_BOM_16.txt'
body = File.read(fp).force_encoding('UTF-8')
p body # "\xFF\xFE1\u00001\u0000"
p body.start_with?("\xFF\xFE") # true
body.delete_prefix!("\xFF\xFE") # !!! why doesn't work?
p body # "\xFF\xFE1\u00001\u0000"
p body.start_with?("\xFF\xFE") # true
body[0, 2] = ''
p body # "1\u00001\u0000"
p body.start_with?("\xFF\xFE") # false
Works same
on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux])
and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt])
Updated by byroot (Jean Boussier) over 1 year ago
I suspect it's because "\xFF\xFE1\u00001\u0000" (UTF-8)
has invalid encoding.
That said, if starts_with?
returns true, delete_prefix
should arguably work.
I'll try to see if I can find why, but in the meantime I tested that you can workaround this by casting the string to Encoding::BINARY
:
str = "\xff\xfe1\u00001\u0000"
str.force_encoding(Encoding::BINARY)
str.delete_prefix!("\xff\xfe".b)
str.force_encoding(Encoding::UTF_8)
p str # => "1\u00001\u0000"
Updated by byroot (Jean Boussier) over 1 year ago
Ok, so as suspected, in the code strings with a broken encoding are very explicitly rejected: https://github.com/ruby/ruby/commit/10082360b9124c3eaabfccf4fe10a3640c40a05e#diff-430d86fdb6c4a558ab0f1b6648bbfae1720e8bde84f026e95a52740014752040R9201
if (is_broken_string(prefix)) return 0;
This is from the initial implementation decided in [Feature #12694], and the PR implementing it was https://github.com/ruby/ruby/pull/1632
I don't see much discussion of that behavior in either the ticket or PR, but there are tests for it so it was intentional from the PR author, not necessarily from committers.
I'd be in favor of changing it so that it matches start_with?
behavior, but I think that needs to be discussed at the developer meeting. It's also unclear whether if accepted it would be as a feature or a bug fix.
Updated by byroot (Jean Boussier) over 1 year ago
Ticket added to the next dev meeting.
Updated by duerst (Martin Dürst) over 1 year ago
I agree that start_with?
and delete_prefix!
should probably be aligned. But in general, encoding problems should lead to early errors, not be deferred.
Updated by mame (Yusuke Endoh) over 1 year ago
Discussed at the dev meeting.
We agreed that both start_with?
and delete_prefix
should compare character by character if valid, and byte by byte for invalid.
"\xFF\xFE".start_with?("\xFF") #=> true # not changed
"\xFF\xFE".delete_prefix("\xFF") #=> "\xFE" # changed
"Ä".start_with?("\xC3") #=> false # changed
"Ä".delete_prefix("\xC3") #=> "Ä" # not changed
Note that encoding validity is checked on a position-by-position basis:
"\xFFÄ".delete_prefix("\xFF") #=> should be "Ä"
"\xFFÄ".delete_prefix("\xFF\xC3") #=> should be "\xFFÄ"
"\xFFÄ".delete_prefix("\xFF\xC3\x84") #=> should be ""
Updated by Eregon (Benoit Daloze) over 1 year ago
mame (Yusuke Endoh) wrote in #note-5:
Note that encoding validity is checked on a position-by-position basis:
That's not clear to me.
Do you mean that if the argument or if the receiver String is not String#valid_encoding?
, then we compare byte-by-byte, and otherwise we compare character-by-character ?
I think we should not consider whether a given substring is valid, that would be very expensive to check.
I wonder if always comparing byte-by-byte for both is not a lot simpler, faster, and also what most users expect.
And of course still have the encoding check to ensure both strings have a compatible encoding.
Updated by mame (Yusuke Endoh) over 1 year ago
Do you mean that if the argument or if the receiver String is not
String#valid_encoding?
, then we compare byte-by-byte, and otherwise we compare character-by-character ?
No.
I think we should not consider whether a given substring is valid
I think it's this way. In the case of "\xFF\xC3\x84"
(= "\xFFÄ"
), the byte sequence from byteoffset 0 is invalid, so we take out one byte "\xFF"
, and the next byte sequence from byteoffset 1 is valid, so we take out two bytes (one character) "\xC3\x84"
, and so on, I think @akr (Akira Tanaka) or @naruse (Yui NARUSE) can explain the rationale.
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
- Status changed from Open to Closed
Applied in changeset git|b054c2fe06598f1141fdc337b10046f41f0e227c.
[Bug #19784] Fix behaviors against prefix with broken encoding
- String#start_with?
- String#delete_prefix
- String#delete_prefix!
Updated by byroot (Jean Boussier) over 1 year ago
Note that we have the same issue with end_with?
and delete_suffix
.
Updated by jhawthorn (John Hawthorn) over 1 year ago
@ywenc and I found a regression from this patch. We have some code handling a broken UTF-8 String with a combination of valid and invalid bytes (UTF-8 followed by binary, which IMO should probably be binary encoded, but it's surprising that the behaviour changed).
"hello\xBE".start_with?("hello") #=> false in trunk, was true on 3.2
"hello\xFE".start_with?("hello") #=> true (both 3.2 and trunk, intended behaviour)
"hello\xBE".delete_prefix("hello") => "\xBE" (both on 3.2 and trunk), because we skip the check when the prefix is valid
"\xFFhello\xBE".delete_prefix("\xFFhello") => "\xFFhello\xBE" in trunk, "\xBE" on 3.2
This is because we're looking at character following the prefix, observing that it looks like a UTF-8 continuation byte, and so returns false.
This approach might work for ends_with?/delete_suffix, where we don't break on an invalid character in the suffix, but doesn't feel right for prefixes. It sounds like the intended design is that to the user this should feel like we were comparing from the start of the strings char-by-char for valid and byte-by-byte for invalid.
We added tests and tried using the end of the previous character, rather than the "start" of the current, to determine if the prefix ends at a char boundary. https://github.com/ruby/ruby/pull/8348