Project

General

Profile

Actions

Feature #18653

closed

Use RE2 for Regexp

Added by mame (Yusuke Endoh) 10 months ago. Updated 10 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:108016]

Description

I have tried to use RE2 as Ruby's regular expression engine. As it turns out, it seems difficult to merge it to Ruby right now. But I'd like to share some of my findings for those who may consider doing the same in the future.

What I did

Here is my prototype.

https://github.com/mame/ruby/tree/re2-prototype

My prototype first attempts to use RE2 for any Regexp matching. If it is impossible for any reason, it degenerates into onigmo, which is the regular expression engine that Ruby is currently using. This hybrid approach is the same as TruffleRuby's. Actually I was inspired by @eregon's talk at RubyKaigi takeout 2021.

My prototype degenerates into onigmo in the following cases.

  • A Regexp uses any feature that RE2 does not support
    • For example, lookahead, back references (\1), and many advanced features are not supported by RE2.
    • Notably, RE2 does not support a large repeat like /a{0,9999}/, which is often used in some actual projects.
  • The encoding of a match target string is not in UTF-8, US-ASCII, or ASCII-8BIT
    • This is because RE2 supports only UTF-8 and Latin1 encoding.
    • This means, the backend engine is not determined when the Regexp object is created. It can switch to each other depending on the encoding of a match target string.
  • A Regexp uses any option but //m.
    • Even //i degenerates to onigmo because they are incompatible. In RE2, /ss/i =~ "ß" does not match.
    • We may support //x by preprocessing a pattern string to remove spaces.
  • A Regexp includes ^.
    • In onigmo, ^ matches "the beginning of the string" or "after \n and before any character".
    • In RE2, ^ matches "the beginning of the string" or "after \n".
    • For example, "abc\n" =~ /^$/ does not match in onigmo, but does in RE2. Some actual projects (like rdoc) seem to depend on this behavior of onigmo.
  • A Regexp includes \b.
    • In onigmo, \b matches the boundary of space and non-space.
    • In RE2, \b matches ASCII word boundary.
    • For example, "α " =~ /.\b./ does match in onigmo, but does not in RE2.

Also, my prototype applies some preprocessing to a pattern string. For example, it replaces \s with /[\t\n\v\f\r\x20]/ because /\s/ does not match with "\v" in RE2.

Evaluation

My prototype passes almost all tests in make test-all (except some tests that are checking warning messages emitted from onigmo).

By running rails new foo && cd foo && rails s, half of Regexps are processed by RE2: 865 unique Regexps are processed by RE2, and 811 unique Regexps are by onigmo.

I think that it is possible to increase the percentage of RE2 by increasing custom preprocessing, but I am not sure that it would pay for the complexity of adding new code.

In terms of performance, make rdoc takes 20.2s before the patch, and 22.6s after the patch ;-( I guess that degeneration to onigmo is the main overhead.

Notes / Problems

  • According to make test-spec, there are still some minor incompatibilities: for example, /(a|())*/.match("aaa"); $1 #=> RE2: "a", onigmo: "".
  • One of the main motivations to use RE2 is security measure against ReDoS. However, RE2 supports only UTF-8 and Latin1, so it seems difficult for us to satisfy this motivation (unless we decide to ignore non-ASCII / non-UTF-8 encodings).
  • My prototype requires C++ compiler because RE2 only provides C++ API.
  • RE2 seems not to provide interruption API. So, we cannot stop RE2 matching by Ctrl+C. (In general, RE2 matching is faster enough, but it can take longer when a target string is long enough.)

Updated by vo.x (Vit Ondruch) 10 months ago

Could you please elaborate what was the motivation for this experiment? Was it to evaluate compatibility? Or possible speed gains?

Updated by mame (Yusuke Endoh) 10 months ago

vo.x (Vit Ondruch) wrote in #note-1:

Could you please elaborate what was the motivation for this experiment?

My original motivation was a security measure against ReDoS. If RE2 worked well, we might not have to introduce Regexp.timeout= (#17837).

Also, I just wanted to give it a try because @eregon's talk in RubyKaigi 2021 was interesting to me.

Updated by duerst (Martin Dürst) 10 months ago

mame (Yusuke Endoh) wrote in #note-2:

My original motivation was a security measure against ReDoS. If RE2 worked well, we might not have to introduce Regexp.timeout= (#17837).

Aren't there still regular expressions that RE2 can't handle but that may be used as ReDoS, or may otherwise take way longer than the person who wrote it thinks?

Updated by mame (Yusuke Endoh) 10 months ago

duerst (Martin Dürst) wrote in #note-3:

Aren't there still regular expressions that RE2 can't handle but that may be used as ReDoS

Yes there are. According to TruffleRuby's approach, the interpreter warns a Regexp that RE2 can't handle, and encourage users to rewrite the Regexp by using only features that RE2 supports.

Precisely, TruffleRuby does not use RE2 but TRegex which is their custom DFA-based implementation that is much more compatible with oniguruma/onigmo than RE2.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0