Feature #9111
open
Encoding-free String comparison
Added by sawa (Tsuyoshi Sawada) about 11 years ago.
Updated about 11 years ago.
Description
=begin
Currently, strings with the same content but with different encodings count as different strings. This causes strange behaviour as below (noted in StackOverflow question http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206):
[128].pack("C") # => "\x80"
[128].pack("C") == "\x80" # => false
Since [128].pack("C")
has the encoding ASCII-8BIT and "\x80"
(by default) has the encoding UTF-8, the two strings are not equal.
Also, comparison of strings with different encodings may end up with a messy, unintended result.
I suggest that the comparison String#<=>
should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.
=end
sawa (Tsuyoshi Sawada) wrote:
I suggest that the comparison String#<=>
should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.
It's unacceptable to always convert all strings to UTF-8, should restrict to comparison with an ASCII-8BIT string.
Following nobu's suggestion, I came up with the following several possibilities:
When two strings with different encodings are to be compared by String#<=>
, then one of the following options should be taken:
- Raise a Warning message
- Raise an error
- Convert one of the strings to the other one.
I am not sure which option would be the best, but feel the feature should not be left as is now.
what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?
Hanmac: "â" can be maked from "a" + "^"
Treating them the same is too much, I think. There are various marking methods. For example, â
would have a different marking in TeX. Assuming them equal is going too much. They should be treated differently.
- Related to Feature #10084: Add Unicode String Normalization to String class added
Also available in: Atom
PDF
Like0
Like0Like0Like0Like0Like0Like0Like0