Bug #4044

Regex matching errors when using \W character class and /i option

Added by Ben Hoskings over 3 years ago. Updated 5 months ago.

[ruby-core:33139]
Status:Feedback
Priority:Normal
Assignee:Yui NARUSE
Category:core
Target version:next minor
ruby -v:ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0] Backport:

Description

=begin
Hi all,

Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)

The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'.

The following expression demonstrates the problem in irb:

 puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect }

As a reference, the following two expressions are working properly:

 puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect }
 puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }

Cheers
Ben Hoskings & Josh Bassett
=end


Related issues

Duplicated by ruby-trunk - Bug #5871: regexp \W matches some word characters when inside a case... Rejected 01/10/2012
Duplicated by ruby-trunk - Bug #7534: /(?i:[\W])/ and /(?i:[\w])/ both match "s" Closed 12/08/2012
Duplicated by ruby-trunk - Bug #7533: Oniguruma hates the letter 's' :( Closed 12/08/2012
Duplicated by ruby-trunk - Bug #9087: swallowing "s" letters when "i" flag is on Closed 11/06/2013

Associated revisions

Revision 43657
Added by Zachary Scott 5 months ago

History

#1 Updated by Benoit Daloze over 3 years ago

=begin
On 11 November 2010 09:08, Ben Hoskings redmine@ruby-lang.org wrote:

Bug #4044: Regex matching errors when using \W character class and /i option
http://redmine.ruby-lang.org/issues/show/4044

Author: Ben Hoskings
Status: Open, Priority: Normal
Category: core, Target version: 1.9.2
ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]

Hi all,

Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)

The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'.

The following expression demonstrates the problem in irb:

   puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\W]/i] ].inspect }

As a reference, the following two expressions are working properly:

   puts ('a'..'z').toa.map {|c| [c, c.ord, c[/[\W]/] ].inspect }
   puts ('a'..'z').to
a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }

Cheers
Ben Hoskings & Josh Bassett

Confirmed on trunk (ruby 1.9.3dev (2010-11-09 trunk 29728)
[x86_64-darwin10.4.0]).

=end

#2 Updated by Yui NARUSE over 3 years ago

  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE

=begin
I confirmed this, but this may take long.
=end

#3 Updated by Heesob Park over 3 years ago

=begin
I confirmed this on ruby 1.9.3dev (2010-11-10) [i386-mswin32_90]
irb(main):001:0> /[\W]/iu =~ 'k'
=> nil
irb(main):002:0> /[\W]/iu =~ 's'
=> nil

This bug is due to mutiple Case Unfold definitions in unicode.c

static const CaseUnfold11Type CaseUnfold_11[] = {
{ 0x0061, {1, {0x0041 }}},
{ 0x0062, {1, {0x0042 }}},
{ 0x0063, {1, {0x0043 }}},
{ 0x0064, {1, {0x0044 }}},
{ 0x0065, {1, {0x0045 }}},
{ 0x0066, {1, {0x0046 }}},
{ 0x0067, {1, {0x0047 }}},
{ 0x0068, {1, {0x0048 }}},
{ 0x006a, {1, {0x004a }}},
{ 0x006b, {2, {0x212a, 0x004b }}}, //----- 'k'
{ 0x006c, {1, {0x004c }}},
{ 0x006d, {1, {0x004d }}},
{ 0x006e, {1, {0x004e }}},
{ 0x006f, {1, {0x004f }}},
{ 0x0070, {1, {0x0050 }}},
{ 0x0071, {1, {0x0051 }}},
{ 0x0072, {1, {0x0052 }}},
{ 0x0073, {2, {0x0053, 0x017f }}}, //---- 's'

And a possible patch is

--- regparse.c 2010-11-12 15:10:07.000000000 +0900
+++ regparse.c.new 2010-11-12 15:29:34.000000000 +0900
@@ -5075,7 +5075,7 @@
int isin = onigiscodeincc(env->enc, from, cc);
#ifdef CASE
FOLDISAPPLIEDINSIDENEGATIVECCLASS
if ((is
in != 0 && !ISNCCLASSNOT(cc)) ||
- (isin == 0 && ISNCCLASSNOT(cc))) {
+ (is
in == 0 && ISNCCLASSNOT(cc) && from < SINGLEBYTESIZE)) {
if (ONIGENCMBCMINLEN(env->enc) > 1 || *to >= SINGLEBYTESIZE) {
addcoderange0(&(cc->mbuf), env, *to, *to, 0);
}

=end

#4 Updated by Yui NARUSE over 3 years ago

=begin
(2010/11/12 15:36), Heesob Park wrote:

And a possible patch is

--- regparse.c 2010-11-12 15:10:07.000000000 +0900
+++ regparse.c.new 2010-11-12 15:29:34.000000000 +0900
@@ -5075,7 +5075,7 @@
int isin = onigiscodeincc(env->enc, from, cc);
#ifdef CASE
FOLDISAPPLIEDINSIDENEGATIVECCLASS
if ((is
in != 0&& !ISNCCLASSNOT(cc)) ||
- (isin == 0&& ISNCCLASSNOT(cc))) {
+ (is
in == 0&& ISNCCLASSNOT(cc)&& from< SINGLEBYTESIZE)) {
if (ONIGENCMBCMINLEN(env->enc)> 1 || *to>= SINGLEBYTESIZE) {
addcoderange0(&(cc->mbuf), env, *to, *to, 0);
}

Thank you for a patch, but it breaks
/[\u0100]/i=~"\u0101"

--
NARUSE, Yui naruse@airemix.jp

=end

#5 Updated by Heesob Park over 3 years ago

=begin
2010/11/14 NARUSE, Yui naruse@airemix.jp:

(2010/11/12 15:36), Heesob Park wrote:

And a possible patch is

--- regparse.c  2010-11-12 15:10:07.000000000 +0900
+++ regparse.c.new      2010-11-12 15:29:34.000000000 +0900
@@ -5075,7 +5075,7 @@
     int isin = onigiscodeincc(env->enc, from, cc);
 #ifdef CASE
FOLDISAPPLIEDINSIDENEGATIVECCLASS
     if ((is
in != 0&&  !ISNCCLASSNOT(cc)) ||
-       (isin == 0&&   ISNCCLASSNOT(cc))) {
+       (is
in == 0&&   ISNCCLASSNOT(cc)&&  from<  SINGLEBYTESIZE)) {
       if (ONIGENCMBCMINLEN(env->enc)>  1 || *to>= SINGLEBYTESIZE) {
        addcoderange0(&(cc->mbuf), env, *to, *to, 0);
       }

Thank you for a patch, but it breaks
/[\u0100]/i=~"\u0101"

OK, Here is a revised patch

--- regparse.c 2010-11-15 10:02:34.000000000 +0900
+++ regparse.c.new 2010-11-15 10:01:20.000000000 +0900
@@ -5075,7 +5075,9 @@
int isin = onigiscodeincc(env->enc, from, cc);
#ifdef CASE
FOLDISAPPLIEDINSIDENEGATIVECCLASS
if ((is
in != 0 && !ISNCCLASSNOT(cc)) ||
- (isin == 0 && ISNCCLASSNOT(cc))) {
+ (is
in == 0 && ISNCCLASSNOT(cc) &&
+ ((from < SINGLEBYTESIZE && *to < SINGLEBYTESIZE)||
+ (from >= SINGLEBYTESIZE && *to >= SINGLEBYTESIZE)))) {
if (ONIGENCMBCMINLEN(env->enc) > 1 || *to >= SINGLEBYTESIZE) {
addcoderange0(&(cc->mbuf), env, *to, *to, 0);
}

Regards,
Park Heesob

=end

#6 Updated by Yui NARUSE over 3 years ago

=begin
It is still a hack.
Current behavior has a reason:
\W -> (ignore case) -> \W (\u017F) + s + S + ... -> not

An experimental patch is following but this is also wrong.

diff --git a/ChangeLog b/ChangeLog
index 18567e3..9dbe329 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+Wed Nov 17 17:19:02 2010 NARUSE, Yui naruse@ruby-lang.org
+
+ * regparse.c: don't apply ignore case to posix bracket, character
+ type, and character property.
+
Wed Nov 17 15:16:48 2010 NARUSE, Yui naruse@ruby-lang.org

* regint.h (OnigOpInfoType): constify name.

diff --git a/regparse.c b/regparse.c
index bf40603..118081f 100644
--- a/regparse.c
+++ b/regparse.c
@@ -4270,6 +4270,8 @@ codeexistcheck(OnigCodePoint c, UChar* from, UChar* end, int ignore_escaped,
return 0;
}

+static int cclasscasefold(Node** np, CClassNode cc, ScanEnv env);
+
static int
parsecharclass(Node** np, OnigToken* tok, UChar** src, UChar* end,
ScanEnv* env)
@@ -4279,13 +4281,14 @@ parsecharclass(Node** np, OnigToken* tok, UChar** src, UChar* end,
UChar p;
Node
node;
CClassNode *cc, *prevcc;
- CClassNode work
cc;
+ CClassNode workcc, casedcc;

enum CCSTATE state;
enum CCVALTYPE val_type, in_type;
int val_israw, in_israw;

prev_cc = (CClassNode* )NULL;
  • initializecclass(&casedcc);
    np = NULLNODE;
    r = fetch
    tokenincc(tok, src, end, env);
    if (r == TKCHAR && tok->u.c == '' && tok->escaped == 0) {
    @@ -4406,7 +4409,7 @@ parse
    char_class(Node
    * np, OnigToken* tok, UChar** src, UChar* end,
    break;

    case TKPOSIXBRACKET_OPEN:

  •  r = parse_posix_bracket(cc, &p, end, env);
    
  •  r = parse_posix_bracket(&cased_cc, &p, end, env);
    if (r < 0) goto err;
    if (r == 1) {  /* is not POSIX bracket */
    

    CCESCWARN(env, (UChar* )"[");
    @@ -4419,7 +4422,7 @@ parsecharclass(Node** np, OnigToken* tok, UChar** src, UChar* end,
    break;

    case TKCHARTYPE:

  •  r = add_ctype_to_cc(cc, tok->u.prop.ctype, tok->u.prop.not, env);
    
  •  r = add_ctype_to_cc(&cased_cc, tok->u.prop.ctype, tok->u.prop.not, env);
    if (r != 0) return r;
    

    nextclass:
    @@ -4433,7 +4436,7 @@ parse
    char_class(Node** np, OnigToken* tok, UChar** src, UChar* end,

    ctype = fetchcharpropertytoctype(&p, end, env);
    if (ctype < 0) return ctype;

  • r = addctypeto_cc(cc, ctype, tok->u.prop.not, env);

  • r = addctypetocc(&casedcc, ctype, tok->u.prop.not, env);
    if (r != 0) return r;
    goto nextclass;
    }
    @@ -4501,7 +4504,7 @@ parse
    charclass(Node** np, OnigToken* tok, UChar** src, UChar* end,
    r = parse
    char_class(&anode, tok, &p, end, env);
    if (r == 0) {
    acc = NCCLASS(anode);

  • r = or_cclass(cc, acc, env);

  • r = orcclass(&casedcc, acc, env);
    }
    onignodefree(anode);
    if (r != 0) goto err;
    @@ -4519,6 +4522,13 @@ parsecharclass(Node** np, OnigToken* tok, UChar** src, UChar* end,
    andstart = 1;
    state = CCS
    START;

  • if (IS_IGNORECASE(env->option)) {

  • cclasscasefold(np, cc, env);

  • }

  • if (ISNOTNULL(&cased_cc)) {

  • r = orcclass(cc, &casedcc, env);

  • initializecclass(&casedcc);

  • }
    if (ISNOTNULL(prevcc)) {
    r = and
    cclass(prevcc, cc, env);
    if (r != 0) goto err;
    @@ -4556,6 +4566,13 @@ parse
    char_class(Node** np, OnigToken* tok, UChar** src, UChar* end,
    if (r != 0) goto err;
    }

  • if (IS_IGNORECASE(env->option)) {

  • cclasscasefold(np, cc, env);

  • }

  • if (ISNOTNULL(&cased_cc)) {

  • r = orcclass(cc, &casedcc, env);

  • initializecclass(&casedcc);

  • }
    if (ISNOTNULL(prevcc)) {
    r = and
    cclass(prevcc, cc, env);
    if (r != 0) goto err;
    @@ -5136,6 +5153,32 @@ i
    applycasefold(OnigCodePoint from, OnigCodePoint to[],
    }

    static int
    +cclasscasefold(Node** np, CClassNode cc, ScanEnv env)
    +{

  • int r;

  • IApplyCaseFoldArg iarg;

  • iarg.env = env;

  • iarg.cc = cc;

  • iarg.altroot = NULLNODE;

  • iarg.ptail = &(iarg.alt_root);
    +

  • r = ONIGENCAPPLYALLCASEFOLD(env->enc, env->casefoldflag,

  •          i_apply_case_fold, &iarg);
    
  • if (r != 0) {

  • onignodefree(iarg.alt_root);

  • return r;

  • }

  • if (ISNOTNULL(iarg.alt_root)) {

  • Node* work = onignodenewalt(*np, iarg.altroot);

  • if (IS_NULL(work)) {

  •  onig_node_free(iarg.alt_root);
    
  •  return ONIGERR_MEMORY;
    
  • }

  • *np = work;

  • }

  • return r;
    +}
    +static int
    parseexp(Node** np, OnigToken* tok, int term,
    UChar** src, UChar* end, ScanEnv* env)
    {
    @@ -5382,35 +5425,8 @@ parse
    exp(Node** np, OnigToken* tok, int term,

    case TKCCOPEN:
    {

  •  CClassNode* cc;
    

    r = parse_char_class(np, tok, src, end, env);
    if (r != 0) return r;
    

  •  cc = NCCLASS(*np);
    
  •  if (IS_IGNORECASE(env->option)) {
    
  • IApplyCaseFoldArg iarg;

  • iarg.env = env;

  • iarg.cc = cc;

  • iarg.altroot = NULLNODE;

  • iarg.ptail = &(iarg.alt_root);

  • r = ONIGENCAPPLYALLCASEFOLD(env->enc, env->casefoldflag,

  •              i_apply_case_fold, &iarg);
    
  • if (r != 0) {

  • onignodefree(iarg.alt_root);

  • return r;

  • }

  • if (ISNOTNULL(iarg.alt_root)) {

  •      Node* work = onig_node_new_alt(*np, iarg.alt_root);
    
  •      if (IS_NULL(work)) {
    
  •        onig_node_free(iarg.alt_root);
    
  •        return ONIGERR_MEMORY;
    
  •      }
    
  •      *np = work;
    
  • }

  •  }
    

    }
    break;

    diff --git a/test/ruby/testregexp.rb b/test/ruby/testregexp.rb
    index 346979d..aaceacf 100644
    --- a/test/ruby/testregexp.rb
    +++ b/test/ruby/test
    regexp.rb
    @@ -190,6 +190,16 @@ class TestRegexp < Test::Unit::TestCase
    assert_equal(false, /(?i:a)/.casefold?)
    end

  • def testcaselessmatch

  • assert_match(/a/iu, "A")

  • assert_match(/[a-z]/iu, "A")

  • assertnotmatch(/[:lower:]/iu, "A")

  • assertnotmatch(/\p{Ll}/iu, "A")

  • assertnotmatch(/\p{Lower}/iu, "A")

  • assert_match(/[\p{Lower}]/iu, "A")

  • assert_match(/[\W]/iu, "A")

  • end
    +
    def testoptions
    assert
    equal(Regexp::IGNORECASE, /a/i.options)
    assert_equal(Regexp::EXTENDED, /a/x.options)
    =end

#7 Updated by Yui NARUSE over 3 years ago

  • Status changed from Assigned to Rejected

=begin
I think, current behavior is reasonable.
=end

#8 Updated by Mark Towfiq about 3 years ago

=begin
Yui NARUSE wrote:

I think, current behavior is reasonable.

Perhaps there is a misunderstanding? The current behavior means that \W does not mean [A-Za-z0-9_] in Ruby 1.9 in some cases. This is a basic functionality - if people cannot trust the Regexp class abbreviations this will be very difficult. This works correctly in Ruby 1.8.7 BTW. I believe this is a critical bug which must be fixed urgently.

Mark Towfiq
CTO, FanSnap
=end

#9 Updated by Yui NARUSE about 3 years ago

=begin

The current behavior means that \W does not mean [A-Za-z0-9_] in Ruby 1.9 in some cases.

Unicode ignore case breaks it.
http://unicode.org/reports/tr21/

212A; C; 006B; # KELVIN SIGN
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

\W includes U+212A and U+00DF
/i adds U+006B (k) and U+0073 (S) to [\W]
^ reverses the class; it doesn't include k & S.

This works correctly in Ruby 1.8.7 BTW.

1.8 doesn't have Unicode ignore case.
=end

#10 Updated by Martin Dürst over 2 years ago

  • Status changed from Rejected to Open

In reply to my analysis at https://bugs.ruby-lang.org/issues/5871#note-7, Yui Naruse suggested at https://bugs.ruby-lang.org/issues/5871#note-8 that I open this issue rather than #5871, which I'm doing herewith.

Yui also suggested that I propose a concrete plan. My current proposal is that we analyse what casing data is being used in what places when using /i (case insensitive matching) in regular expressions, and that we then fix that. If we don't make progress, I'll also write to the Unicode mailing list to hopefully collect input from other implementers.

By the way, can somebody explain the following difference:

$ ruby -e "puts /[\W]|\u1234/i.match('k').inspect"
#

$ ruby -e "puts /\W|\u1234/i.match('k').inspect"
nil

(|\u1234 is there just to force the regexp to be in UTF-8.)

#11 Updated by Yui NARUSE about 2 years ago

  • Status changed from Open to Feedback

#12 Updated by Kenta Murata about 2 years ago

I think this is bug:

$ ruby -e "puts /[\W]|\u1234/i.match('k').inspect"
#
$ ruby -e "puts /[\W]|\u1234/.match('k').inspect"
nil

#13 Updated by Akira Tanaka about 2 years ago

Interesting example:

% ruby -ve '("a".."z").each {|ch| p(/[\W]/i.match(ch)) }'
ruby 2.0.0dev (2012-03-16 trunk 35049) [x86_64-linux]
-e:1: warning: character class has duplicated range: /[\W]/
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
#
nil
nil
nil
nil
nil
nil
nil
#
nil
nil
nil
nil
nil
nil
nil

#14 Updated by Martin Dürst about 2 years ago

Hello Yui,

We discussed this issue at today's developpers' meeting in Akihabara.

There was wide consensus among the attendees that it is very strange to have 'k' and 's' included in the set of non-word (\W) characters. Therefore we are sorry, but we don't agree with your https://bugs.ruby-lang.org/issues/4044#note-7.

duerst (Martin Dürst) wrote:

My current proposal is that we analyse what casing data is being used in what places when using /i (case insensitive matching) in regular expressions, and that we then fix that.

We have discussed this a bit. The first question is what \w should refer to in Ruby. I personally would hope that in the long term, we can move this to include all word characters (i.e. also non-ascii Latin, other scripts, Hiragana, Katakana, Kanji,...). But the general opinion today was that we should keep this as ASCII only currently. Anyway, this bug is independent of this problem, because in both cases, \w includes 'k' and 's', and therefore in both cases, \W must not include 'k' nor 's'.

Also, we noted that regular expression components such as \w or \W should be independent of whether /i is set or not. The reason for that is that \w already takes care of combining lower- and upper-case characters. So there's nothing a /i can improve, and it should not make things worse.

By the way, can somebody explain the following difference:

$ ruby -e "puts /[\W]|\u1234/i.match('k').inspect"
#

$ ruby -e "puts /\W|\u1234/i.match('k').inspect"
nil

(|\u1234 is there just to force the regexp to be in UTF-8.)

I suspect that this is due to the fact that \W in character classes gets expanded to an actual list of characters (or ranges) before case-extension (/i), whereas \W outside character classes does not get affected by case-extension.

Given the above, I have reopened this bug. I hope to be able to help you over the next two weeks, but I hope you can take the lead.

Regards, Martin.

#15 Updated by Ian MacLeod almost 2 years ago

One additional note is that this only seems to occur when \W is in a character group:

➜ ruby -ve '("a".."z").each {|ch| p(/\W/i.match(ch)) }'
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin12.0.0]
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil

Edit: sorry if this is duplicate info (unsure)

#16 Updated by Ben Hoskings over 1 year ago

Hi all, long time no see :)

naruse (Yui NARUSE) wrote:

=begin

The current behavior means that \W does not mean [A-Za-z0-9_] in Ruby 1.9 in some cases.

Unicode ignore case breaks it.
http://unicode.org/reports/tr21/

212A; C; 006B; # KELVIN SIGN
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

\W includes U+212A and U+00DF
/i adds U+006B (k) and U+0073 (S) to [\W]
^ reverses the class; it doesn't include k & S.

I think I see the misunderstanding: there are multiple characters that render as 'k' and 's'.

K, S, k, s are basic word characters, and so [\W] should match them (along with all A-Z and a-z):
0x004B (Latin capital letter K)
0x0053 (Latin capital letter S)
0x006B (Latin capital letter k)
0x0073 (Latin capital letter s)

But, I'm not sure how [\W] should treat these characters:
0x00DF (Latin small letter sharp s)
0x017F (Latin small letter long s)
0x212A (Kelvin sign)

The important thing is that all the characters in A-Z (0x41-0x5A) & a-z (0x61-0x7A) are word characters, so [\W] should match all of them.

Cheers,
Ben

#17 Updated by Matthew Kerwin over 1 year ago

ben_h (Ben Hoskings) wrote:

But, I'm not sure how [\W] should treat these characters:
0x00DF (Latin small letter sharp s)
0x017F (Latin small letter long s)
0x212A (Kelvin sign)

Can you just fall back on the Unicode categories? If we define "word characters" as Letters and Numbers, U+212A is {Lu} and thus a word character. Similary U+017F is {Ll}.

Seems a bit weird in the case of Kelvin (also the Angstrom Sign U+212B = {Lu}) but at least Unicode is a fixed and universally accessible standard.

#18 Updated by Yui NARUSE about 1 year ago

  • Target version changed from 1.9.2 to next minor

#19 Updated by Rodrigo Rosenfeld Rosas 6 months ago

Shouldn't this bug be mentioned in the docs for \W in the Regexp documentation?

http://www.ruby-doc.org/core-2.0.0/Regexp.html

People would like to be aware of it until it's fixed.

#20 Updated by Martin Dürst 5 months ago

On 2013/11/07 21:50, rosenfeld (Rodrigo Rosenfeld Rosas) wrote:

Issue #4044 has been updated by rosenfeld (Rodrigo Rosenfeld Rosas).

Shouldn't this bug be mentioned in the docs for \W in the Regexp documentation?

http://www.ruby-doc.org/core-2.0.0/Regexp.html

People would like to be aware of it until it's fixed.

I'd really prefer it to be fixed, but if you want to contribute a patch
on the docu, that would help.

Regards, Martin.


Bug #4044: Regex matching errors when using \W character class and /i option
https://bugs.ruby-lang.org/issues/4044#change-42799

Author: benh (Ben Hoskings)
Status: Feedback
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: core
Target version: next minor
ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86
64-darwin10.4.0]
Backport:

=begin
Hi all,

Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)

The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'.

The following expression demonstrates the problem in irb:

  puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect }

As a reference, the following two expressions are working properly:

  puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect }
  puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }

Cheers
Ben Hoskings& Josh Bassett
=end

#22 Updated by Zachary Scott 5 months ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r43657.
Ben, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


#23 Updated by Zachary Scott 5 months ago

  • Status changed from Closed to Feedback
  • % Done changed from 100 to 0

Ooops, didn't mean to close this only mention..

Also available in: Atom PDF