Feature #18653
closedUse RE2 for Regexp
Description
I have tried to use RE2 as Ruby's regular expression engine. As it turns out, it seems difficult to merge it to Ruby right now. But I'd like to share some of my findings for those who may consider doing the same in the future.
What I did¶
Here is my prototype.
https://github.com/mame/ruby/tree/re2-prototype
My prototype first attempts to use RE2 for any Regexp matching. If it is impossible for any reason, it degenerates into onigmo, which is the regular expression engine that Ruby is currently using. This hybrid approach is the same as TruffleRuby's. Actually I was inspired by @eregon's talk at RubyKaigi takeout 2021.
My prototype degenerates into onigmo in the following cases.
- A Regexp uses any feature that RE2 does not support
- For example, lookahead, back references (
\1
), and many advanced features are not supported by RE2. - Notably, RE2 does not support a large repeat like
/a{0,9999}/
, which is often used in some actual projects.
- For example, lookahead, back references (
- The encoding of a match target string is not in UTF-8, US-ASCII, or ASCII-8BIT
- This is because RE2 supports only UTF-8 and Latin1 encoding.
- This means, the backend engine is not determined when the Regexp object is created. It can switch to each other depending on the encoding of a match target string.
- A Regexp uses any option but
//m
.- Even
//i
degenerates to onigmo because they are incompatible. In RE2,/ss/i =~ "ß"
does not match. - We may support
//x
by preprocessing a pattern string to remove spaces.
- Even
- A Regexp includes
^
.- In onigmo,
^
matches "the beginning of the string" or "after \n and before any character". - In RE2,
^
matches "the beginning of the string" or "after \n". - For example,
"abc\n" =~ /^$/
does not match in onigmo, but does in RE2. Some actual projects (like rdoc) seem to depend on this behavior of onigmo.
- In onigmo,
- A Regexp includes
\b
.- In onigmo,
\b
matches the boundary of space and non-space. - In RE2,
\b
matches ASCII word boundary. - For example,
"α " =~ /.\b./
does match in onigmo, but does not in RE2.
- In onigmo,
Also, my prototype applies some preprocessing to a pattern string. For example, it replaces \s
with /[\t\n\v\f\r\x20]/
because /\s/
does not match with "\v"
in RE2.
Evaluation¶
My prototype passes almost all tests in make test-all
(except some tests that are checking warning messages emitted from onigmo).
By running rails new foo && cd foo && rails s
, half of Regexps are processed by RE2: 865 unique Regexps are processed by RE2, and 811 unique Regexps are by onigmo.
I think that it is possible to increase the percentage of RE2 by increasing custom preprocessing, but I am not sure that it would pay for the complexity of adding new code.
In terms of performance, make rdoc
takes 20.2s before the patch, and 22.6s after the patch ;-( I guess that degeneration to onigmo is the main overhead.
Notes / Problems¶
- According to
make test-spec
, there are still some minor incompatibilities: for example,/(a|())*/.match("aaa"); $1 #=> RE2: "a", onigmo: ""
. - One of the main motivations to use RE2 is security measure against ReDoS. However, RE2 supports only UTF-8 and Latin1, so it seems difficult for us to satisfy this motivation (unless we decide to ignore non-ASCII / non-UTF-8 encodings).
- My prototype requires C++ compiler because RE2 only provides C++ API.
- RE2 seems not to provide interruption API. So, we cannot stop RE2 matching by Ctrl+C. (In general, RE2 matching is faster enough, but it can take longer when a target string is long enough.)
Updated by vo.x (Vit Ondruch) over 2 years ago
Could you please elaborate what was the motivation for this experiment? Was it to evaluate compatibility? Or possible speed gains?
Updated by mame (Yusuke Endoh) over 2 years ago
vo.x (Vit Ondruch) wrote in #note-1:
Could you please elaborate what was the motivation for this experiment?
My original motivation was a security measure against ReDoS. If RE2 worked well, we might not have to introduce Regexp.timeout= (#17837).
Also, I just wanted to give it a try because @eregon's talk in RubyKaigi 2021 was interesting to me.
Updated by duerst (Martin Dürst) over 2 years ago
mame (Yusuke Endoh) wrote in #note-2:
My original motivation was a security measure against ReDoS. If RE2 worked well, we might not have to introduce Regexp.timeout= (#17837).
Aren't there still regular expressions that RE2 can't handle but that may be used as ReDoS, or may otherwise take way longer than the person who wrote it thinks?
Updated by mame (Yusuke Endoh) over 2 years ago
duerst (Martin Dürst) wrote in #note-3:
Aren't there still regular expressions that RE2 can't handle but that may be used as ReDoS
Yes there are. According to TruffleRuby's approach, the interpreter warns a Regexp that RE2 can't handle, and encourage users to rewrite the Regexp by using only features that RE2 supports.
Precisely, TruffleRuby does not use RE2 but TRegex which is their custom DFA-based implementation that is much more compatible with oniguruma/onigmo than RE2.