Encoding-free String comparison
Currently, strings with the same content but with different encodings count as different strings. This causes strange behaviour as below (noted in StackOverflow question http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206):
.pack("C") # => "\x80" .pack("C") == "\x80" # => false
.pack("C") has the encoding ASCII-8BIT and
"\x80" (by default) has the encoding UTF-8, the two strings are not equal.
Also, comparison of strings with different encodings may end up with a messy, unintended result.
I suggest that the comparison
String#<=> should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.
#1 [ruby-core:58338] Updated by nobu (Nobuyoshi Nakada) almost 4 years ago
sawa (Tsuyoshi Sawada) wrote:
I suggest that the comparison
String#<=>should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.
It's unacceptable to always convert all strings to UTF-8, should restrict to comparison with an ASCII-8BIT string.
#2 [ruby-core:58339] Updated by sawa (Tsuyoshi Sawada) almost 4 years ago
Following nobu's suggestion, I came up with the following several possibilities:
When two strings with different encodings are to be compared by
String#<=>, then one of the following options should be taken:
- Raise a Warning message
- Raise an error
- Convert one of the strings to the other one.
I am not sure which option would be the best, but feel the feature should not be left as is now.
#4 [ruby-core:58354] Updated by sawa (Tsuyoshi Sawada) almost 4 years ago
Hanmac: "â" can be maked from "a" + ""
Treating them the same is too much, I think. There are various marking methods. For example,
â would have a different marking in TeX. Assuming them equal is going too much. They should be treated differently.
#5 [ruby-core:58364] Updated by Hanmac (Hans Mackowiak) almost 4 years ago
i found the wikipedia source: http://en.wikipedia.org/wiki/Combining_character
its not about treating "a" or "a" the same as "â" but there is a way to clue the chars together
i think thats also a reason for http://api.rubyonrails.org/classes/String.html#method-i-mb_chars ?
i found another interesting gems http://rubygems.org/gems/unicode_utils
with that is also possible to do something like this: "ä".upcase => "Ä"
there is another page about combining character: http://sbp.so/supercombiner
#6 [ruby-core:58459] Updated by naruse (Yui NARUSE) almost 4 years ago
Hanmac (Hans Mackowiak) wrote:
what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "" somehow, should they also treated as equal?
The standard practice is NFD("â") == NFD("a" + "").
To NFD, you can use some libraries.
see also http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicode-normalization-alternatives/