Bug #10097

Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ

Added by Martin Dürst about 1 year ago. Updated 12 months ago.

ruby -v:1.9.3p545 Backport:2.0.0: UNKNOWN, 2.1: UNKNOWN


By chance I had a look at enc/iso_8859_1.c and found

ENC_REPLICATE("Windows-1252", "ISO-8859-1")

on line 288. But this does not work for case folding:

# http://en.wikipedia.org/wiki/Windows-1252
s1 = "\u0160".encode 'windows-1252' # 'Š'
r1 = Regexp.new("\u0161".encode('windows-1252'), Regexp::IGNORECASE) # /š/i
s1 =~ r1
   # => nil
s2 = "\u0178".encode 'windows-1252' # 'Ÿ'
r2 = Regexp.new("\u00FF".encode('windows-1252'), Regexp::IGNORECASE) # /ÿ/i
s2 =~ r2
   # => nil
s3 = "\u00C0".encode 'windows-1252' # 'À'
r3 = Regexp.new("\u00E0".encode('windows-1252'), Regexp::IGNORECASE) # /à/i
s3 =~ r3
   # => 0

So case-insensitive matching works when both characters are in iso-8859-1, but not when one (ÿŸ) or both (ŠšŽžŒœ) characters are not in iso-8859-1.


#1 Updated by Nobuyoshi Nakada about 1 year ago

  • Description updated (diff)

#2 Updated by Martin Dürst about 1 year ago

Nobuyoshi Nakada wrote:

Is this correct?

Thanks a lot for this very quick work!

Unfortunately, it's not correct. I haven't checked everything, but at least cp1252_get_case_fold_codes_by_str doesn't deal with the special cases in get_case_fold_codes_by_str for ss/SS/ß.

I suggest that we do some more exploratory work before addressing this bug directly.

First, I suspect that other (windows-12xx,...) encodings have very similar problems.

Second, I found this bug because I was trying to find out what information that the encoding primitives already provide for case folding and case conversion.

I have only just started that, but depending on what I/we find, we may want/need to:

1) use this information and be done;
2) use this information and add some more information separately;
3) change this information (e.g. add or change some primitives) so that it covers all the needs for case conversion;
4) provide the information for case conversion completely separately.

I suggest that we wait with fixing this bug until we are able to rule out choice 3).

If there is a (short, up-to-date) summary of what each of the encoding primitives does, that would help me a lot (Japanese would be okay).

#3 Updated by Nobuyoshi Nakada about 1 year ago

I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?

#4 Updated by Martin Dürst 12 months ago

Nobuyoshi Nakada wrote:

I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?

Kimihito Matsui, one of my students, is working on tests (not only for windows 1252, but also for other encodings).

Can you (or somebody else) tell me what the case-related encoding primitives are supposed to do?

Also available in: Atom PDF