Project

General

Profile

Actions

Bug #10097

closed

Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ

Added by duerst (Martin Dürst) almost 10 years ago. Updated over 8 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
1.9.3p545
[ruby-core:64049]

Description

By chance I had a look at enc/iso_8859_1.c and found

ENC_REPLICATE("Windows-1252", "ISO-8859-1")

on line 288. But this does not work for case folding:

# http://en.wikipedia.org/wiki/Windows-1252
s1 = "\u0160".encode 'windows-1252' # 'Š'
r1 = Regexp.new("\u0161".encode('windows-1252'), Regexp::IGNORECASE) # /š/i
s1 =~ r1
   # => nil
s2 = "\u0178".encode 'windows-1252' # 'Ÿ'
r2 = Regexp.new("\u00FF".encode('windows-1252'), Regexp::IGNORECASE) # /ÿ/i
s2 =~ r2
   # => nil
s3 = "\u00C0".encode 'windows-1252' # 'À'
r3 = Regexp.new("\u00E0".encode('windows-1252'), Regexp::IGNORECASE) # /à/i
s3 =~ r3
   # => 0

So case-insensitive matching works when both characters are in iso-8859-1, but not when one (ÿŸ) or both (ŠšŽžŒœ) characters are not in iso-8859-1.

Updated by duerst (Martin Dürst) over 9 years ago

Nobuyoshi Nakada wrote:

Is this correct?
https://github.com/nobu/ruby/compare/windows-1252

Thanks a lot for this very quick work!

Unfortunately, it's not correct. I haven't checked everything, but at least cp1252_get_case_fold_codes_by_str doesn't deal with the special cases in get_case_fold_codes_by_str for ss/SS/ß.

I suggest that we do some more exploratory work before addressing this bug directly.

First, I suspect that other (windows-12xx,...) encodings have very similar problems.

Second, I found this bug because I was trying to find out what information that the encoding primitives already provide for case folding and case conversion.

I have only just started that, but depending on what I/we find, we may want/need to:

  1. use this information and be done;
  2. use this information and add some more information separately;
  3. change this information (e.g. add or change some primitives) so that it covers all the needs for case conversion;
  4. provide the information for case conversion completely separately.

I suggest that we wait with fixing this bug until we are able to rule out choice 3).

If there is a (short, up-to-date) summary of what each of the encoding primitives does, that would help me a lot (Japanese would be okay).

Updated by nobu (Nobuyoshi Nakada) over 9 years ago

I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?

Updated by duerst (Martin Dürst) over 9 years ago

Nobuyoshi Nakada wrote:

I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?

Kimihito Matsui, one of my students, is working on tests (not only for windows 1252, but also for other encodings).

Can you (or somebody else) tell me what the case-related encoding primitives are supposed to do?
(誰か大文字・小文字関連のプリミティブの働き・役割を説明できないでしょうか。よろしくお願いします。)

Updated by duerst (Martin Dürst) over 8 years ago

Nobuyoshi Nakada wrote:

Is this correct?
https://github.com/nobu/ruby/compare/windows-1252

Sorry for the very slow response. Please commit. Thanks!

Actions #6

Updated by nobu (Nobuyoshi Nakada) over 8 years ago

  • Status changed from Open to Closed

Applied in changeset r53046.


enc/windows_1252.c: new

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0