Feature #8678

Allow invalid string to work with regexp

Added by Yui NARUSE 9 months ago. Updated 3 months ago.

[ruby-core:<unknown>]
Status:Assigned
Priority:Normal
Assignee:Yukihiro Matsumoto
Category:M17N
Target version:current: 2.2.0

Description

Legacy Ruby 1.8 could regexp match with broken strings.
People can find characters from binary data on the age.

After Ruby 1.9, Ruby raises Exception if it does regexp match with broken strings.
So it became hard to work with character-wise regexp matching with binary data.

Following patch allows it with the constant Regexp::LOOSEENCODING.

commit eb0111ff7ae3f563ce201c4a5f724f121336d42d
Author: NARUSE, Yui naruse@ruby-lang.org
Date: Mon Jul 22 05:37:44 2013 +0900

* Regexp
  * New constant:
    * Regexp::ENCODINGLOOSE: declare execute matching even if the target string
      is invalid byte sequence. [experimental]

diff --git a/NEWS b/NEWS
index f5fe388..ade0b03 100644
--- a/NEWS
+++ b/NEWS
@@ -35,6 +35,11 @@ with all sufficient information, see the ChangeLog file.
* misc
* Mutex#owned? is no longer experimental.

+* Regexp
+ * New constant:
+ * Regexp::ENCODINGLOOSE: declare execute matching even if the target string
+ is invalid byte sequence. [experimental]
+
* String
* New methods:
* String#scrub and String#scrub! verify and fix invalid byte sequence.
diff --git a/re.c b/re.c
index e5cc79d..230a2e0 100644
--- a/re.c
+++ b/re.c
@@ -256,6 +256,7 @@ rbmemsearch(const void *x0, long m, const void *y0, long n, rbencoding *enc)

#define REGLITERAL FLUSER5
#define REGENCODINGNONE FLUSER6
+#define REG
ENCODINGLOOSE FLUSER7

#define KCODEFIXED FLUSER4

@@ -263,6 +264,7 @@ rbmemsearch(const void *x0, long m, const void *y0, long n, rbencoding *enc)
(ONIGOPTIONIGNORECASE|ONIGOPTIONMULTILINE|ONIGOPTIONEXTEND)
#define ARGENCODINGFIXED 16
#define ARGENCODINGNONE 32
+#define ARGENCODINGLOOSE 64

static int
chartooption(int c)
@@ -1251,7 +1253,8 @@ rbregprepareenc(VALUE re, VALUE str, int warn)
{
rb
encoding *enc = 0;

  • if (rbencstrcoderange(str) == ENCCODERANGE_BROKEN) {
  • if (!(RBASIC(re)->flags & REGENCODINGLOOSE) &&
  • rbencstrcoderange(str) == ENCCODERANGEBROKEN) { rbraise(rbeArgError, "invalid byte sequence in %s", rbencname(rbencget(str))); @@ -2433,6 +2436,9 @@ rbreginitialize(VALUE obj, const char *s, long len, rbencoding *enc, if (options & ARGENCODINGNONE) { re->basic.flags |= REGENCODINGNONE; }
  • if (options & ARGENCODINGLOOSE) {
  • re->basic.flags |= REGENCODINGLOOSE;
  • }

    re->ptr = makeregexp(RSTRINGPTR(unescaped), RSTRINGLEN(unescaped), enc,
    options & ARG
    REGOPTIONMASK, err,
    @@ -3091,6 +3097,7 @@ rbregoptions(VALUE re)
    options = RREGEXP(re)->ptr->options & ARGREGOPTIONMASK;
    if (RBASIC(re)->flags & KCODE
    FIXED) options |= ARGENCODINGFIXED;
    if (RBASIC(re)->flags & REGENCODINGNONE) options |= ARGENCODINGNONE;

  • if (RBASIC(re)->flags & REGENCODINGLOOSE) options |= ARGENCODINGLOOSE;
    return options;
    }

@@ -3579,6 +3586,8 @@ InitRegexp(void)
rb
defineconst(rbcRegexp, "FIXEDENCODING", INT2FIX(ARGENCODINGFIXED));
/* see Regexp.options and Regexp.new /
rbdefineconst(rbcRegexp, "NOENCODING", INT2FIX(ARGENCODING_NONE));
+ /
see Regexp.options and Regexp.new */
+ rbdefineconst(rbcRegexp, "LOOSEENCODING", INT2FIX(ARGENCODING_LOOSE));

 rb_global_variable(&reg_cache);

diff --git a/string.c b/string.c
index 1d784e3..caf0baf 100644
--- a/string.c
+++ b/string.c
@@ -3970,7 +3970,7 @@ strgsub(int argc, VALUE *argv, VALUE str, int bang)
cp = sp;
str
enc = STRENCGET(str);
rbencassociate(dest, strenc);
- ENC
CODERANGESET(dest, rbencasciicompat(strenc) ? ENCCODERANGE7BIT : ENCCODERANGEVALID);
+ /ENCCODERANGESET(dest, rbencasciicompat(strenc) ? ENCCODERANGE7BIT : ENCCODERANGE_VALID);/

 do {
n++;

diff --git a/test/ruby/testregexp.rb b/test/ruby/testregexp.rb
index 11e86ec..b8f6897 100644
--- a/test/ruby/testregexp.rb
+++ b/test/ruby/test
regexp.rb
@@ -8,6 +8,10 @@ class TestRegexp < Test::Unit::TestCase
$VERBOSE = nil
end

  • def u(str)
  • str.dup.forceencoding(Encoding::UTF8)
  • end
    +
    def teardown
    $VERBOSE = @verbose
    end
    @@ -958,6 +962,17 @@ class TestRegexp < Test::Unit::TestCase
    }
    end

  • def testencodingloose

  • str = u("\x80\xE3\x81\x82\x81")

  • assert_equal(0, Regexp.new(".", Regexp::LOOSEENCODING) =~ str)

  • assert_equal(1, Regexp.new(u('\p{Any}'), Regexp::LOOSEENCODING) =~ str)

  • assert_equal(1, Regexp.new("\u3042", Regexp::LOOSEENCODING) =~ str)

  • assert_equal(1, Regexp.new(u('\p{Hiragana}'), Regexp::LOOSEENCODING) =~ str)

  • assert_equal(0, Regexp.new(u('\A.\p{Hiragana}.\z'), Regexp::LOOSEENCODING) =~ str)

  • str = u("\xf1\x80\xE3\x81\x82\x81")

  • assert_equal(0, Regexp.new(u('\A..\p{Hiragana}.\z'), Regexp::LOOSEENCODING) =~ str)

  • end
    +

    This assertion is for porting x2() tests in testpy.py of Onigmo.

    def assertmatchat(re, str, positions, msg = nil)
    re = Regexp.new(re) unless re.is_a?(Regexp)

History

#1 Updated by Yukihiro Matsumoto 9 months ago

I am positive. I'd rather want to make this default (if possible).

Matz.

#2 Updated by Martin Dürst 9 months ago

Sorry to be late with my comment.

naruse (Yui NARUSE) wrote:

Legacy Ruby 1.8 could regexp match with broken strings.

Well, in Ruby 1.8, strings were binary, so this isn't much of a surprise.

People can find characters from binary data on the age.

Sorry, I don't understad "on the age"? Can you explain (Japanese is fine).

After Ruby 1.9, Ruby raises Exception if it does regexp match with broken strings.

My understanding is that in Ruby 1.9, we don't test for valid encoding at each corner, because otherwise Ruby would be too slow, but we don't promote or allow operations on invalid data if the check happens anyway.

Creating functionality that is targetted at invalid data starts a slippery slope. We may get more and more requests for places where invalid encoding should produce some "sensible" result, and it will be more and more complex to remember all the rules. "Invalid data doesn't match." is much simpler to work with.

So it became hard to work with character-wise regexp matching with binary data.

Don't we have BINARY encoding for binary data?

How would matching character-wise regexps in binary data actually work? For single-byte encodings, it's very easy, because in many cases, there is no invalid data. For other encodings, in particular UTF-8 and GB-18030 (and also Shift_JIS,...), it may be difficult to define what exactly happens, i.e. what bytes exactly are treated as binary data.

Next, what are the security implications (in particular if this is on by default, as Matz proposes)?

Also, what exactly happens to bytes or byte sequences that are invalid? Are they matched by any part of a regular expression (e.g. a simple /./)? Are they counted for positions? Can they be matched literally with \x? Are they non-word characters? ... Is there a way to match them directly (e.g. by putting invalid data into the regexp)? My guess is that all these questions may have different preferred answers depending on the exact use case.

So the next question is what is the actual use case? Finding sequences that match characters in binary data seems to be the use case, but this can be done by converting the characters being searched for to binary encoding and using a binary regexp on binary data. So I don't see how this provides new functionality. Another way to address it is to clean up the data first, becaus this should anyway happen better sooner than later.

#3 Updated by Yui NARUSE 9 months ago

duerst (Martin Dürst) wrote:

Sorry to be late with my comment.

naruse (Yui NARUSE) wrote:

People can find characters from binary data on the age.

Sorry, I don't understad "on the age"? Can you explain (Japanese is fine).

I mean "During the Ruby 1.8 era people can find characters from binary data by regexp matching"

After Ruby 1.9, Ruby raises Exception if it does regexp match with broken strings.

My understanding is that in Ruby 1.9, we don't test for valid encoding at each corner, because otherwise Ruby would be too slow, but we don't promote or allow operations on invalid data if the check happens anyway.

This affect only for regexp related one.
Other methods are note affected.

Creating functionality that is targetted at invalid data starts a slippery slope. We may get more and more requests for places where invalid encoding should produce some "sensible" result, and it will be more and more complex to remember all the rules. "Invalid data doesn't match." is much simpler to work with.

It is OK if the request is good one.

So it became hard to work with character-wise regexp matching with binary data.

Don't we have BINARY encoding for binary data?

How would matching character-wise regexps in binary data actually work? For single-byte encodings, it's very easy, because in many cases, there is no invalid data. For other encodings, in particular UTF-8 and GB-18030 (and also Shift_JIS,...), it may be difficult to define what exactly happens, i.e. what bytes exactly are treated as binary data.

As far as I tested, the behavior of my patch is likely to what people will expect.

Next, what are the security implications (in particular if this is on by default, as Matz proposes)?

It is what I mainly worry about.

Also, what exactly happens to bytes or byte sequences that are invalid? Are they matched by any part of a regular expression (e.g. a simple /./)? Are they counted for positions? Can they be matched literally with \x? Are they non-word characters? ... Is there a way to match them directly (e.g. by putting invalid data into the regexp)? My guess is that all these questions may have different preferred answers depending on the exact use case.

See tests in my patch.

So the next question is what is the actual use case? Finding sequences that match characters in binary data seems to be the use case, but this can be done by converting the characters being searched for to binary encoding and using a binary regexp on binary data. So I don't see how this provides new functionality. Another way to address it is to clean up the data first, becaus this should anyway happen better sooner than later.

#4 Updated by Hiroshi SHIBATA 3 months ago

  • Target version changed from 2.1.0 to current: 2.2.0

Also available in: Atom PDF