https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112022-11-07T02:32:21ZRuby Issue Tracking SystemRuby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=999642022-11-07T02:32:21Zmatz (Yukihiro Matsumoto)matz@ruby.or.jp
<ul></ul><p>It sounds good. My only concern is memory consumption, but I don't think guessing game won't work for this case.<br>
Merge this PR and experiment with 3.1 previews.</p>
<p>Matz.</p> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=999652022-11-07T02:43:03Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/99965/diff?detail_id=63444">diff</a>)</li></ul> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=999852022-11-07T18:26:23ZEregon (Benoit Daloze)
<ul></ul><p>From <a href="https://bugs.ruby-lang.org/issues/19074#note-6" class="external">https://bugs.ruby-lang.org/issues/19074#note-6</a> from <a class="user active user-mention" href="https://bugs.ruby-lang.org/users/18">@mame (Yusuke Endoh)</a>:</p>
<blockquote>
<p>... (2) should we provide a way to tell users if a given regexp is optimizable or not (e.g., a warning at unoptimizable regexp creation, a new API like <code>Regexp#optimizable?</code>, or a new opt-in Regexp flag <code>/foo/r</code> to raise a SyntaxError when it is not optimizable, etc.)</p>
</blockquote>
<p>I think a command-line flag or a new warning category is best for this.<br>
For instance I envisioned <code>--warn-slow-regexps</code> in <a href="https://eregon.me/blog/assets/research/just-in-time-compiling-ruby-regexps-on-truffleruby.pdf" class="external">https://eregon.me/blog/assets/research/just-in-time-compiling-ruby-regexps-on-truffleruby.pdf</a> slide 18.<br>
A warning category is nice because it can be used to e.g. raise a exception on such slow Regexps and so ensure an application doesn't use any such Regexp.</p>
<p>"a new API like <code>Regexp#optimizable?</code>" seems useless to me, the value of this is checking all Regexps in the system can be efficiently executed.<br>
Same problem with the opt-in flag, to be useful it needs to apply to all Regexps.</p>
<p>We need to keep in mind that the set of "slow regexps" can change per Ruby/Regexp implementation and over time.<br>
Although it seems it's very very similar features being problematic between this approach and <a href="https://eregon.me/blog/assets/research/just-in-time-compiling-ruby-regexps-on-truffleruby.pdf" class="external">https://eregon.me/blog/assets/research/just-in-time-compiling-ruby-regexps-on-truffleruby.pdf</a> slide 11.</p>
<p>I'll summarize it here.</p>
<p>Not supported on both:</p>
<ul>
<li>back-reference <code>\1, \k<name></code> in the Regexp (not in replacement strings: #gsub)</li>
<li>recursive subexpression call <code>(?<sqbr>[\g<sqbr>*])</code>
</li>
<li>negative lookahead <code>(?!)</code> and lookbehind <code>(?<!)</code>
</li>
<li>atomic groups <code>(?>)</code>
</li>
<li>absent operator <code>(?~)</code>
</li>
<li>possessive quantifiers <code>*+</code>, <code>++</code>, <code>?+</code>, <code>{n,m}+</code>
</li>
<li>too large bounded or fixed times repetition (e.g. <code>/(a|b){100000,200000}/</code>)</li>
</ul>
<p>Not supported on CRuby, but supported on TruffleRuby:</p>
<ul>
<li>non-recursive subexpression call</li>
<li>positive lookahead and lookbehind</li>
</ul>
<p>More might be supported in the future.</p> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=1001182022-11-16T07:37:26Zbyroot (Jean Boussier)byroot@ruby-lang.org
<ul></ul><p>I believe this change may have introduced a weird bug which is causing the <code>sass</code> gem to fail in unpredictable ways. I was able to produce a short reproduction script:</p>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="k">module</span> <span class="nn">Sass</span>
<span class="no">H</span> <span class="o">=</span> <span class="sr">/[0-9a-fA-F]/</span>
<span class="no">UNICODE</span> <span class="o">=</span> <span class="sr">/\\</span><span class="si">#{</span><span class="no">H</span><span class="si">}</span><span class="sr">{1,6}[ \t\r\n\f]?/</span>
<span class="n">s</span> <span class="o">=</span> <span class="s1">'\u{80}-\u{D7FF}\u{E000}-\u{FFFD}\u{10000}-\u{10FFFF}'</span>
<span class="no">NONASCII</span> <span class="o">=</span> <span class="sr">/[</span><span class="si">#{</span><span class="n">s</span><span class="si">}</span><span class="sr">]/</span>
<span class="no">ESCAPE</span> <span class="o">=</span> <span class="sr">/</span><span class="si">#{</span><span class="no">UNICODE</span><span class="si">}</span><span class="sr">|\\[^0-9a-fA-F\r\n\f]/</span>
<span class="no">NMSTART</span> <span class="o">=</span> <span class="sr">/[_a-zA-Z]|</span><span class="si">#{</span><span class="no">NONASCII</span><span class="si">}</span><span class="sr">|</span><span class="si">#{</span><span class="no">ESCAPE</span><span class="si">}</span><span class="sr">/</span>
<span class="no">NMCHAR</span> <span class="o">=</span> <span class="sr">/[a-zA-Z0-9_-]|</span><span class="si">#{</span><span class="no">NONASCII</span><span class="si">}</span><span class="sr">|</span><span class="si">#{</span><span class="no">ESCAPE</span><span class="si">}</span><span class="sr">/</span>
<span class="no">VALID_UNIT</span> <span class="o">=</span> <span class="sr">/</span><span class="si">#{</span><span class="no">NMSTART</span><span class="si">}#{</span><span class="no">NMCHAR</span><span class="si">}</span><span class="sr">|%*/</span>
<span class="mi">100_000</span><span class="p">.</span><span class="nf">times</span> <span class="k">do</span>
<span class="nb">print</span> <span class="s1">'.'</span>
<span class="k">raise</span> <span class="s2">"WTF?"</span> <span class="k">if</span> <span class="s2">"%"</span> <span class="o">!~</span> <span class="no">VALID_UNIT</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre>
<p>The above regexp will sometime not match after a random number of iterations.</p>
<p>cc <a class="user active user-mention" href="https://bugs.ruby-lang.org/users/18">@mame (Yusuke Endoh)</a></p> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=1001212022-11-16T14:09:21Zbyroot (Jean Boussier)byroot@ruby-lang.org
<ul></ul><p>Not sure if helpful, but after working around this bug, we realized that it wasn't just returning the wrong value.</p>
<p>Avoiding this codepath also caused a lot of weird crashes to vanish (e.g. <code>Segmentation fault at 0x0000000000000000</code> and <code>[BUG] Tried to mark T_NONE</code>). So I heavily suspect that one of the side effects of this bug is to overwrite parts of the heap with <code>0</code>s.</p> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=1001232022-11-16T14:47:08Zmake_now_just (Hiroya Fujinami)make.just.on@gmail.com
<ul></ul><p>I created a pull request for the report. Thanks <a class="user active user-mention" href="https://bugs.ruby-lang.org/users/7941">@byroot (Jean Boussier)</a>.</p>
<p><a href="https://github.com/ruby/ruby/pull/6744" class="external">https://github.com/ruby/ruby/pull/6744</a></p> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=1005422022-12-10T06:41:14Zmmizutani (Minoru Mizutani)
<ul></ul><p>Regex fuzzing encountered an edge-case regression:</p>
<pre><code class="bash syntaxhl" data-language="bash"><span class="nv">$ </span>ruby <span class="nt">--version</span>
ruby 3.1.2p20 <span class="o">(</span>2022-04-12 revision 4491bb740a<span class="o">)</span> <span class="o">[</span>x86_64-linux]
<span class="nv">$ </span><span class="nb">time </span>ruby <span class="nt">-e</span> <span class="s1">'puts /あ(?~い)/.match?("あ")'</span>
<span class="nb">true
</span>0.04s user 0.04s system 74% cpu 0.118 total
</code></pre>
<pre><code class="bash syntaxhl" data-language="bash"><span class="nv">$ </span>ruby <span class="nt">--version</span>
ruby 3.2.0rc1 <span class="o">(</span>2022-12-10<span class="o">)</span> <span class="o">[</span>x86_64-linux]
<span class="nv">$ </span><span class="nb">time </span>ruby <span class="nt">-e</span> <span class="s1">'puts /あ(?~い)/.match?("あ")'</span>
<span class="c"># This regex matching involving an absent operator and non-ASCII characters runs indefinitely and consumes gigabytes of memory, eventually killed by the OS.</span>
</code></pre> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=1005462022-12-10T19:25:16Zmame (Yusuke Endoh)mame@ruby-lang.org
<ul></ul><p>mmizutani (Minoru Mizutani) wrote in <a href="#note-7">#note-7</a>:</p>
<blockquote>
<p>Regex fuzzing encountered an edge-case regression:</p>
</blockquote>
<p>Thanks. This has nothing to do with this ticket. It's a different issue caused by my change <a class="changeset" title="Prevent potential buffer overrun in onigmo A code pattern `p + enclen(enc, p, pend)` may lead to..." href="https://bugs.ruby-lang.org/projects/ruby-master/repository/git/revisions/1d2d25dcadda0764f303183ac091d0c87b432566">1d2d25dcadda0764f303183ac091d0c87b432566</a> .</p> Ruby master - Feature #19104: Introduce the cache-based optimization for Regexp matchinghttps://bugs.ruby-lang.org/issues/19104?journal_id=1005672022-12-12T04:56:22Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset <a class="changeset" title="[DOC] NEWS about [Feature #19104]" href="https://bugs.ruby-lang.org/projects/ruby-master/repository/git/revisions/f093b619a4863be96e6ebfa2fd58c77f4a360eae">git|f093b619a4863be96e6ebfa2fd58c77f4a360eae</a>.</p>
<hr>
<p>[DOC] NEWS about [Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Introduce the cache-based optimization for Regexp matching (Closed)" href="https://bugs.ruby-lang.org/issues/19104">#19104</a>]</p>