Feature #9111
openEncoding-free String comparison
Description
=begin
Currently, strings with the same content but with different encodings count as different strings. This causes strange behaviour as below (noted in StackOverflow question http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206):
[128].pack("C") # => "\x80"
[128].pack("C") == "\x80" # => false
Since [128].pack("C")
has the encoding ASCII-8BIT and "\x80"
(by default) has the encoding UTF-8, the two strings are not equal.
Also, comparison of strings with different encodings may end up with a messy, unintended result.
I suggest that the comparison String#<=>
should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.
=end
Updated by nobu (Nobuyoshi Nakada) about 11 years ago
sawa (Tsuyoshi Sawada) wrote:
I suggest that the comparison
String#<=>
should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.
It's unacceptable to always convert all strings to UTF-8, should restrict to comparison with an ASCII-8BIT string.
Updated by sawa (Tsuyoshi Sawada) about 11 years ago
Following nobu's suggestion, I came up with the following several possibilities:
When two strings with different encodings are to be compared by String#<=>
, then one of the following options should be taken:
- Raise a Warning message
- Raise an error
- Convert one of the strings to the other one.
I am not sure which option would be the best, but feel the feature should not be left as is now.
Updated by Hanmac (Hans Mackowiak) about 11 years ago
what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?
Updated by sawa (Tsuyoshi Sawada) about 11 years ago
Hanmac: "â" can be maked from "a" + "^"
Treating them the same is too much, I think. There are various marking methods. For example, â
would have a different marking in TeX. Assuming them equal is going too much. They should be treated differently.
Updated by Hanmac (Hans Mackowiak) about 11 years ago
i found the wikipedia source: http://en.wikipedia.org/wiki/Combining_character
its not about treating "^a" or "a^" the same as "â" but there is a way to clue the chars together
i think thats also a reason for http://api.rubyonrails.org/classes/String.html#method-i-mb_chars ?
i found another interesting gems http://rubygems.org/gems/unicode_utils
with that is also possible to do something like this: "ä".upcase => "Ä"
there is another page about combining character: http://sbp.so/supercombiner
Updated by naruse (Yui NARUSE) about 11 years ago
Hanmac (Hans Mackowiak) wrote:
what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?
The standard practice is NFD("â") == NFD("a" + "^").
To NFD, you can use some libraries.
see also http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicode-normalization-alternatives/
Updated by duerst (Martin Dürst) over 10 years ago
- Related to Feature #10084: Add Unicode String Normalization to String class added