Feature #8678: Allow invalid string to work with regexp - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #8678

open

Allow invalid string to work with regexp

Feature #8678: Allow invalid string to work with regexp

Added by naruse (Yui NARUSE) over 12 years ago. Updated over 8 years ago.

Status:

Assigned

Assignee:

matz (Yukihiro Matsumoto)

Target version:

[ruby-core:<unknown>]

Description

Legacy Ruby 1.8 could regexp match with broken strings.
People can find characters from binary data on the age.

After Ruby 1.9, Ruby raises Exception if it does regexp match with broken strings.
So it became hard to work with character-wise regexp matching with binary data.

Following patch allows it with the constant Regexp::LOOSEENCODING.

commit eb0111ff7ae3f563ce201c4a5f724f121336d42d
Author: NARUSE, Yui naruse@ruby-lang.org
Date: Mon Jul 22 05:37:44 2013 +0900

* Regexp
  * New constant:
    * Regexp::ENCODINGLOOSE: declare execute matching even if the target string
      is invalid byte sequence. [experimental]

diff --git a/NEWS b/NEWS
index f5fe388..ade0b03 100644
--- a/NEWS
+++ b/NEWS
@@ -35,6 +35,11 @@ with all sufficient information, see the ChangeLog file.

misc
- Mutex#owned? is no longer experimental.

+* Regexp

- New constant:
- Regexp::ENCODINGLOOSE: declare execute matching even if the target string

 is invalid byte sequence. [experimental]

String
- New methods:
  - String#scrub and String#scrub! verify and fix invalid byte sequence.
    diff --git a/re.c b/re.c
    index e5cc79d..230a2e0 100644
    --- a/re.c
    +++ b/re.c
    @@ -256,6 +256,7 @@ rb_memsearch(const void *x0, long m, const void *y0, long n, rb_encoding *enc)

#define REG_LITERAL FL_USER5
#define REG_ENCODING_NONE FL_USER6
+#define REG_ENCODING_LOOSE FL_USER7

#define KCODE_FIXED FL_USER4

@@ -263,6 +264,7 @@ rb_memsearch(const void *x0, long m, const void *y0, long n, rb_encoding *enc)
(ONIG_OPTION_IGNORECASE|ONIG_OPTION_MULTILINE|ONIG_OPTION_EXTEND)
#define ARG_ENCODING_FIXED 16
#define ARG_ENCODING_NONE 32
+#define ARG_ENCODING_LOOSE 64

static int
char_to_option(int c)
@@ -1251,7 +1253,8 @@ rb_reg_prepare_enc(VALUE re, VALUE str, int warn)
{
rb_encoding *enc = 0;

if (rb_enc_str_coderange(str) == ENC_CODERANGE_BROKEN) {

if (!(RBASIC(re)->flags & REG_ENCODING_LOOSE) &&

   rb_enc_str_coderange(str) == ENC_CODERANGE_BROKEN) {
   rb_raise(rb_eArgError,
       "invalid byte sequence in %s",
       rb_enc_name(rb_enc_get(str)));

@@ -2433,6 +2436,9 @@ rb_reg_initialize(VALUE obj, const char *s, long len, rb_encoding *enc,
if (options & ARG_ENCODING_NONE) {
re->basic.flags |= REG_ENCODING_NONE;
}

if (options & ARG_ENCODING_LOOSE) {

   re->basic.flags |= REG_ENCODING_LOOSE;

}

re->ptr = make_regexp(RSTRING_PTR(unescaped), RSTRING_LEN(unescaped), enc,
options & ARG_REG_OPTION_MASK, err,
@@ -3091,6 +3097,7 @@ rb_reg_options(VALUE re)
options = RREGEXP(re)->ptr->options & ARG_REG_OPTION_MASK;
if (RBASIC(re)->flags & KCODE_FIXED) options |= ARG_ENCODING_FIXED;
if (RBASIC(re)->flags & REG_ENCODING_NONE) options |= ARG_ENCODING_NONE;
if (RBASIC(re)->flags & REG_ENCODING_LOOSE) options |= ARG_ENCODING_LOOSE;
return options;
}

@@ -3579,6 +3586,8 @@ Init_Regexp(void)
rb_define_const(rb_cRegexp, "FIXEDENCODING", INT2FIX(ARG_ENCODING_FIXED));
/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "NOENCODING", INT2FIX(ARG_ENCODING_NONE));

/* see Regexp.options and Regexp.new */
rb_define_const(rb_cRegexp, "LOOSEENCODING", INT2FIX(ARG_ENCODING_LOOSE));

rb_global_variable(&reg_cache);

diff --git a/string.c b/string.c
index 1d784e3..caf0baf 100644
--- a/string.c
+++ b/string.c
@@ -3970,7 +3970,7 @@ str_gsub(int argc, VALUE *argv, VALUE str, int bang)
cp = sp;
str_enc = STR_ENC_GET(str);
rb_enc_associate(dest, str_enc);

ENC_CODERANGE_SET(dest, rb_enc_asciicompat(str_enc) ? ENC_CODERANGE_7BIT : ENC_CODERANGE_VALID);

/ENC_CODERANGE_SET(dest, rb_enc_asciicompat(str_enc) ? ENC_CODERANGE_7BIT : ENC_CODERANGE_VALID);/

do {
n++;
diff --git a/test/ruby/test_regexp.rb b/test/ruby/test_regexp.rb
index 11e86ec..b8f6897 100644
--- a/test/ruby/test_regexp.rb
+++ b/test/ruby/test_regexp.rb
@@ -8,6 +8,10 @@ class TestRegexp < Test::Unit::TestCase
$VERBOSE = nil
end
def u(str)
str.dup.force_encoding(Encoding::UTF_8)
end
def teardown
$VERBOSE = @verbose
end
@@ -958,6 +962,17 @@ class TestRegexp < Test::Unit::TestCase
}
end
def test_encoding_loose
str = u("\x80\xE3\x81\x82\x81")
assert_equal(0, Regexp.new(".", Regexp::LOOSEENCODING) =~ str)
assert_equal(1, Regexp.new(u('\p{Any}'), Regexp::LOOSEENCODING) =~ str)
assert_equal(1, Regexp.new("\u3042", Regexp::LOOSEENCODING) =~ str)
assert_equal(1, Regexp.new(u('\p{Hiragana}'), Regexp::LOOSEENCODING) =~ str)
assert_equal(0, Regexp.new(u('\A.\p{Hiragana}.\z'), Regexp::LOOSEENCODING) =~ str)
str = u("\xf1\x80\xE3\x81\x82\x81")
assert_equal(0, Regexp.new(u('\A..\p{Hiragana}.\z'), Regexp::LOOSEENCODING) =~ str)
end
This assertion is for porting x2() tests in testpy.py of Onigmo.¶

def assert_match_at(re, str, positions, msg = nil)
re = Regexp.new(re) unless re.is_a?(Regexp)

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Feature #8678

Allow invalid string to work with regexp

This assertion is for porting x2() tests in testpy.py of Onigmo.¶

Updated by matz (Yukihiro Matsumoto) over 12 years ago Actions
Copy link
#1 [ruby-core:56183]

Updated by duerst (Martin Dürst) over 12 years ago Actions
Copy link
#2

Updated by naruse (Yui NARUSE) over 12 years ago Actions
Copy link
#3 [ruby-core:56214]

Updated by hsbt (Hiroshi SHIBATA) about 12 years ago Actions
Copy link
#4 [ruby-core:60303]

Updated by naruse (Yui NARUSE) over 8 years ago Actions
Copy link
#5

Project

General

Profile

Ruby

Custom queries

Feature #8678

Allow invalid string to work with regexp

This assertion is for porting x2() tests in testpy.py of Onigmo.¶

Updated by matz (Yukihiro Matsumoto) over 12 years ago ActionsCopy link #1 [ruby-core:56183]

Updated by duerst (Martin Dürst) over 12 years ago ActionsCopy link #2

Updated by naruse (Yui NARUSE) over 12 years ago ActionsCopy link #3 [ruby-core:56214]

Updated by hsbt (Hiroshi SHIBATA) about 12 years ago ActionsCopy link #4 [ruby-core:60303]

Updated by naruse (Yui NARUSE) over 8 years ago ActionsCopy link #5

Updated by matz (Yukihiro Matsumoto) over 12 years ago Actions
Copy link
#1 [ruby-core:56183]

Updated by duerst (Martin Dürst) over 12 years ago Actions
Copy link
#2

Updated by naruse (Yui NARUSE) over 12 years ago Actions
Copy link
#3 [ruby-core:56214]

Updated by hsbt (Hiroshi SHIBATA) about 12 years ago Actions
Copy link
#4 [ruby-core:60303]

Updated by naruse (Yui NARUSE) over 8 years ago Actions
Copy link
#5