Project

General

Profile

Bug #10097

Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ

Added by Martin Dürst over 1 year ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
ruby -v:
1.9.3p545
Backport:
2.0.0: UNKNOWN, 2.1: UNKNOWN
[ruby-core:64049]

Description

By chance I had a look at enc/iso_8859_1.c and found

ENC_REPLICATE("Windows-1252", "ISO-8859-1")

on line 288. But this does not work for case folding:

# http://en.wikipedia.org/wiki/Windows-1252
s1 = "\u0160".encode 'windows-1252' # 'Š'
r1 = Regexp.new("\u0161".encode('windows-1252'), Regexp::IGNORECASE) # /š/i
s1 =~ r1
   # => nil
s2 = "\u0178".encode 'windows-1252' # 'Ÿ'
r2 = Regexp.new("\u00FF".encode('windows-1252'), Regexp::IGNORECASE) # /ÿ/i
s2 =~ r2
   # => nil
s3 = "\u00C0".encode 'windows-1252' # 'À'
r3 = Regexp.new("\u00E0".encode('windows-1252'), Regexp::IGNORECASE) # /à/i
s3 =~ r3
   # => 0

So case-insensitive matching works when both characters are in iso-8859-1, but not when one (ÿŸ) or both (ŠšŽžŒœ) characters are not in iso-8859-1.

Associated revisions

Revision 53046
Added by Nobuyoshi Nakada about 2 months ago

enc/windows_1252.c: new

  • enc/windows_1252.c: separate from ISO-8859-1 to fix 0x80..0x9e range. [Bug #10097]

History

#2 [ruby-core:64072] Updated by Martin Dürst over 1 year ago

Nobuyoshi Nakada wrote:

Is this correct?
https://github.com/nobu/ruby/compare/windows-1252

Thanks a lot for this very quick work!

Unfortunately, it's not correct. I haven't checked everything, but at least cp1252_get_case_fold_codes_by_str doesn't deal with the special cases in get_case_fold_codes_by_str for ss/SS/ß.

I suggest that we do some more exploratory work before addressing this bug directly.

First, I suspect that other (windows-12xx,...) encodings have very similar problems.

Second, I found this bug because I was trying to find out what information that the encoding primitives already provide for case folding and case conversion.

I have only just started that, but depending on what I/we find, we may want/need to:

1) use this information and be done;
2) use this information and add some more information separately;
3) change this information (e.g. add or change some primitives) so that it covers all the needs for case conversion;
4) provide the information for case conversion completely separately.

I suggest that we wait with fixing this bug until we are able to rule out choice 3).

If there is a (short, up-to-date) summary of what each of the encoding primitives does, that would help me a lot (Japanese would be okay).

#3 [ruby-core:64093] Updated by Nobuyoshi Nakada over 1 year ago

I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?

#4 [ruby-core:64185] Updated by Martin Dürst over 1 year ago

Nobuyoshi Nakada wrote:

I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?

Kimihito Matsui, one of my students, is working on tests (not only for windows 1252, but also for other encodings).

Can you (or somebody else) tell me what the case-related encoding primitives are supposed to do?
(誰か大文字・小文字関連のプリミティブの働き・役割を説明できないでしょうか。よろしくお願いします。)

#5 [ruby-core:72051] Updated by Martin Dürst about 2 months ago

Nobuyoshi Nakada wrote:

Is this correct?
https://github.com/nobu/ruby/compare/windows-1252

Sorry for the very slow response. Please commit. Thanks!

#6 Updated by Nobuyoshi Nakada about 2 months ago

  • Status changed from Open to Closed

Applied in changeset r53046.


enc/windows_1252.c: new

  • enc/windows_1252.c: separate from ISO-8859-1 to fix 0x80..0x9e range. [Bug #10097]

Also available in: Atom PDF