Project

General

Profile

Actions

Feature #18653

closed

Use RE2 for Regexp

Added by mame (Yusuke Endoh) almost 3 years ago. Updated almost 3 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
[ruby-core:108016]

Description

I have tried to use RE2 as Ruby's regular expression engine. As it turns out, it seems difficult to merge it to Ruby right now. But I'd like to share some of my findings for those who may consider doing the same in the future.

What I did

Here is my prototype.

https://github.com/mame/ruby/tree/re2-prototype

My prototype first attempts to use RE2 for any Regexp matching. If it is impossible for any reason, it degenerates into onigmo, which is the regular expression engine that Ruby is currently using. This hybrid approach is the same as TruffleRuby's. Actually I was inspired by @eregon's talk at RubyKaigi takeout 2021.

My prototype degenerates into onigmo in the following cases.

  • A Regexp uses any feature that RE2 does not support
    • For example, lookahead, back references (\1), and many advanced features are not supported by RE2.
    • Notably, RE2 does not support a large repeat like /a{0,9999}/, which is often used in some actual projects.
  • The encoding of a match target string is not in UTF-8, US-ASCII, or ASCII-8BIT
    • This is because RE2 supports only UTF-8 and Latin1 encoding.
    • This means, the backend engine is not determined when the Regexp object is created. It can switch to each other depending on the encoding of a match target string.
  • A Regexp uses any option but //m.
    • Even //i degenerates to onigmo because they are incompatible. In RE2, /ss/i =~ "ß" does not match.
    • We may support //x by preprocessing a pattern string to remove spaces.
  • A Regexp includes ^.
    • In onigmo, ^ matches "the beginning of the string" or "after \n and before any character".
    • In RE2, ^ matches "the beginning of the string" or "after \n".
    • For example, "abc\n" =~ /^$/ does not match in onigmo, but does in RE2. Some actual projects (like rdoc) seem to depend on this behavior of onigmo.
  • A Regexp includes \b.
    • In onigmo, \b matches the boundary of space and non-space.
    • In RE2, \b matches ASCII word boundary.
    • For example, "α " =~ /.\b./ does match in onigmo, but does not in RE2.

Also, my prototype applies some preprocessing to a pattern string. For example, it replaces \s with /[\t\n\v\f\r\x20]/ because /\s/ does not match with "\v" in RE2.

Evaluation

My prototype passes almost all tests in make test-all (except some tests that are checking warning messages emitted from onigmo).

By running rails new foo && cd foo && rails s, half of Regexps are processed by RE2: 865 unique Regexps are processed by RE2, and 811 unique Regexps are by onigmo.

I think that it is possible to increase the percentage of RE2 by increasing custom preprocessing, but I am not sure that it would pay for the complexity of adding new code.

In terms of performance, make rdoc takes 20.2s before the patch, and 22.6s after the patch ;-( I guess that degeneration to onigmo is the main overhead.

Notes / Problems

  • According to make test-spec, there are still some minor incompatibilities: for example, /(a|())*/.match("aaa"); $1 #=> RE2: "a", onigmo: "".
  • One of the main motivations to use RE2 is security measure against ReDoS. However, RE2 supports only UTF-8 and Latin1, so it seems difficult for us to satisfy this motivation (unless we decide to ignore non-ASCII / non-UTF-8 encodings).
  • My prototype requires C++ compiler because RE2 only provides C++ API.
  • RE2 seems not to provide interruption API. So, we cannot stop RE2 matching by Ctrl+C. (In general, RE2 matching is faster enough, but it can take longer when a target string is long enough.)
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0