Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ
|ruby -v:||1.9.3p545||Backport:||2.0.0: UNKNOWN, 2.1: UNKNOWN|
By chance I had a look at enc/iso_8859_1.c and found
on line 288. But this does not work for case folding:
# http://en.wikipedia.org/wiki/Windows-1252 s1 = "\u0160".encode 'windows-1252' # 'Š' r1 = Regexp.new("\u0161".encode('windows-1252'), Regexp::IGNORECASE) # /š/i s1 =~ r1 # => nil s2 = "\u0178".encode 'windows-1252' # 'Ÿ' r2 = Regexp.new("\u00FF".encode('windows-1252'), Regexp::IGNORECASE) # /ÿ/i s2 =~ r2 # => nil s3 = "\u00C0".encode 'windows-1252' # 'À' r3 = Regexp.new("\u00E0".encode('windows-1252'), Regexp::IGNORECASE) # /à/i s3 =~ r3 # => 0
So case-insensitive matching works when both characters are in iso-8859-1, but not when one (ÿŸ) or both (ŠšŽžŒœ) characters are not in iso-8859-1.
#2 Updated by Martin Dürst 9 months ago
Nobuyoshi Nakada wrote:
Is this correct?
Thanks a lot for this very quick work!
Unfortunately, it's not correct. I haven't checked everything, but at least cp1252_get_case_fold_codes_by_str doesn't deal with the special cases in get_case_fold_codes_by_str for ss/SS/ß.
I suggest that we do some more exploratory work before addressing this bug directly.
First, I suspect that other (windows-12xx,...) encodings have very similar problems.
Second, I found this bug because I was trying to find out what information that the encoding primitives already provide for case folding and case conversion.
I have only just started that, but depending on what I/we find, we may want/need to:
1) use this information and be done;
2) use this information and add some more information separately;
3) change this information (e.g. add or change some primitives) so that it covers all the needs for case conversion;
4) provide the information for case conversion completely separately.
I suggest that we wait with fixing this bug until we are able to rule out choice 3).
If there is a (short, up-to-date) summary of what each of the encoding primitives does, that would help me a lot (Japanese would be okay).
#4 Updated by Martin Dürst 9 months ago
Nobuyoshi Nakada wrote:
I've forgotten the test file, "test/ruby/enc/test_windows_1252.rb", and added it now.
What tests are needed?
Kimihito Matsui, one of my students, is working on tests (not only for windows 1252, but also for other encodings).
Can you (or somebody else) tell me what the case-related encoding primitives are supposed to do?