Feature #9111: Encoding-free String comparison - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #9111

open

Encoding-free String comparison

Feature #9111: Encoding-free String comparison

Added by sawa (Tsuyoshi Sawada) over 12 years ago. Updated over 12 years ago.

Status:

Open

Assignee:

Target version:

[ruby-core:58337]

Description

=begin
Currently, strings with the same content but with different encodings count as different strings. This causes strange behaviour as below (noted in StackOverflow question http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206):

[128].pack("C")             # => "\x80"
[128].pack("C") == "\x80"   # => false

Since [128].pack("C") has the encoding ASCII-8BIT and "\x80" (by default) has the encoding UTF-8, the two strings are not equal.

Also, comparison of strings with different encodings may end up with a messy, unintended result.

I suggest that the comparison String#<=> should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.

=end

Related issues 1 (0 open — 1 closed)

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#1 [ruby-core:58338]

sawa (Tsuyoshi Sawada) wrote:

I suggest that the comparison String#<=> should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.

It's unacceptable to always convert all strings to UTF-8, should restrict to comparison with an ASCII-8BIT string.

Updated by sawa (Tsuyoshi Sawada) over 12 years ago Actions
Copy link
#2 [ruby-core:58339]

Following nobu's suggestion, I came up with the following several possibilities:

When two strings with different encodings are to be compared by String#<=>, then one of the following options should be taken:

Raise a Warning message
Raise an error
Convert one of the strings to the other one.

I am not sure which option would be the best, but feel the feature should not be left as is now.

Updated by Hanmac (Hans Mackowiak) over 12 years ago Actions
Copy link
#3 [ruby-core:58343]

what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?

Updated by sawa (Tsuyoshi Sawada) over 12 years ago Actions
Copy link
#4 [ruby-core:58354]

Hanmac: "â" can be maked from "a" + "^"

Treating them the same is too much, I think. There are various marking methods. For example, â would have a different marking in TeX. Assuming them equal is going too much. They should be treated differently.

Updated by Hanmac (Hans Mackowiak) over 12 years ago Actions
Copy link
#5 [ruby-core:58364]

i found the wikipedia source: http://en.wikipedia.org/wiki/Combining_character
its not about treating "^a" or "a^" the same as "â" but there is a way to clue the chars together

i think thats also a reason for http://api.rubyonrails.org/classes/String.html#method-i-mb_chars ?

i found another interesting gems http://rubygems.org/gems/unicode_utils
with that is also possible to do something like this: "ä".upcase => "Ä"

there is another page about combining character: http://sbp.so/supercombiner

Updated by naruse (Yui NARUSE) over 12 years ago Actions
Copy link
#6 [ruby-core:58459]

Hanmac (Hans Mackowiak) wrote:

what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?

The standard practice is NFD("â") == NFD("a" + "^").
To NFD, you can use some libraries.
see also http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicode-normalization-alternatives/

Updated by duerst (Martin Dürst) almost 12 years ago Actions
Copy link
#7 [ruby-core:63961]

Related to Feature #10084: Add Unicode String Normalization to String class added

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Feature #9111

Encoding-free String comparison

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#1 [ruby-core:58338]

Updated by sawa (Tsuyoshi Sawada) over 12 years ago Actions
Copy link
#2 [ruby-core:58339]

Updated by Hanmac (Hans Mackowiak) over 12 years ago Actions
Copy link
#3 [ruby-core:58343]

Updated by sawa (Tsuyoshi Sawada) over 12 years ago Actions
Copy link
#4 [ruby-core:58354]

Updated by Hanmac (Hans Mackowiak) over 12 years ago Actions
Copy link
#5 [ruby-core:58364]

Updated by naruse (Yui NARUSE) over 12 years ago Actions
Copy link
#6 [ruby-core:58459]

Updated by duerst (Martin Dürst) almost 12 years ago Actions
Copy link
#7 [ruby-core:63961]

Project

General

Profile

Ruby

Custom queries

Feature #9111

Encoding-free String comparison

Updated by nobu (Nobuyoshi Nakada) over 12 years ago ActionsCopy link #1 [ruby-core:58338]

Updated by sawa (Tsuyoshi Sawada) over 12 years ago ActionsCopy link #2 [ruby-core:58339]

Updated by Hanmac (Hans Mackowiak) over 12 years ago ActionsCopy link #3 [ruby-core:58343]

Updated by sawa (Tsuyoshi Sawada) over 12 years ago ActionsCopy link #4 [ruby-core:58354]

Updated by Hanmac (Hans Mackowiak) over 12 years ago ActionsCopy link #5 [ruby-core:58364]

Updated by naruse (Yui NARUSE) over 12 years ago ActionsCopy link #6 [ruby-core:58459]

Updated by duerst (Martin Dürst) almost 12 years ago ActionsCopy link #7 [ruby-core:63961]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#1 [ruby-core:58338]

Updated by sawa (Tsuyoshi Sawada) over 12 years ago Actions
Copy link
#2 [ruby-core:58339]

Updated by Hanmac (Hans Mackowiak) over 12 years ago Actions
Copy link
#3 [ruby-core:58343]

Updated by sawa (Tsuyoshi Sawada) over 12 years ago Actions
Copy link
#4 [ruby-core:58354]

Updated by Hanmac (Hans Mackowiak) over 12 years ago Actions
Copy link
#5 [ruby-core:58364]

Updated by naruse (Yui NARUSE) over 12 years ago Actions
Copy link
#6 [ruby-core:58459]

Updated by duerst (Martin Dürst) almost 12 years ago Actions
Copy link
#7 [ruby-core:63961]