Bug #10097: Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #10097

closed

Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ

Added by duerst (Martin Dürst) almost 11 years ago. Updated over 9 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

1.9.3p545

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN

[ruby-core:64049]

Description

By chance I had a look at enc/iso_8859_1.c and found

ENC_REPLICATE("Windows-1252", "ISO-8859-1")

on line 288. But this does not work for case folding:

# http://en.wikipedia.org/wiki/Windows-1252
s1 = "\u0160".encode 'windows-1252' # 'Š'
r1 = Regexp.new("\u0161".encode('windows-1252'), Regexp::IGNORECASE) # /š/i
s1 =~ r1
   # => nil
s2 = "\u0178".encode 'windows-1252' # 'Ÿ'
r2 = Regexp.new("\u00FF".encode('windows-1252'), Regexp::IGNORECASE) # /ÿ/i
s2 =~ r2
   # => nil
s3 = "\u00C0".encode 'windows-1252' # 'À'
r3 = Regexp.new("\u00E0".encode('windows-1252'), Regexp::IGNORECASE) # /à/i
s3 =~ r3
   # => 0

So case-insensitive matching works when both characters are in iso-8859-1, but not when one (ÿŸ) or both (ŠšŽžŒœ) characters are not in iso-8859-1.

Actions

Copy link

#1 [ruby-core:64071]

Updated by nobu (Nobuyoshi Nakada) almost 11 years ago

Description updated (diff)

Is this correct?
https://github.com/nobu/ruby/compare/windows-1252

Actions

Copy link

#2 [ruby-core:64072]

Updated by duerst (Martin Dürst) almost 11 years ago

Nobuyoshi Nakada wrote:

Is this correct?
https://github.com/nobu/ruby/compare/windows-1252

Thanks a lot for this very quick work!

Unfortunately, it's not correct. I haven't checked everything, but at least cp1252_get_case_fold_codes_by_str doesn't deal with the special cases in get_case_fold_codes_by_str for ss/SS/ß.

I suggest that we do some more exploratory work before addressing this bug directly.

First, I suspect that other (windows-12xx,...) encodings have very similar problems.

Second, I found this bug because I was trying to find out what information that the encoding primitives already provide for case folding and case conversion.

I have only just started that, but depending on what I/we find, we may want/need to:

use this information and be done;
use this information and add some more information separately;
change this information (e.g. add or change some primitives) so that it covers all the needs for case conversion;
provide the information for case conversion completely separately.

I suggest that we wait with fixing this bug until we are able to rule out choice 3).

If there is a (short, up-to-date) summary of what each of the encoding primitives does, that would help me a lot (Japanese would be okay).

Actions

Copy link

#3 [ruby-core:64093]