https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112019-09-09T15:19:04ZRuby Issue Tracking SystemRuby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=814822019-09-09T15:19:04Zmichaeltomko (Michael Tomko)
<ul></ul><p>Sorry, bad paste on the OP with the success examples.</p>
<pre><code>2.5.0 :007 > pat = /((?<!Costa)Mesa|Arlington(?=(\p{Space}|\p{Punct})+(AZ|Arizona)))/
=> /((?<!Costa)Mesa|Arlington(?=(\p{Space}|\p{Punct})+(AZ|Arizona)))/
2.5.0 :008 > pat = /((?<!Costa)Mesa|Arlington(?=([:space:]|[:punct:])+(AZ|Arizona)))/i
=> /((?<!Costa)Mesa|Arlington(?=([:space:]|[:punct:])+(AZ|Arizona)))/i
2.5.0 :009 > pat = /((?<!Costa)Mesa|Arlington(?=(\s|\W)+(AZ|Arizona)))/i
=> /((?<!Costa)Mesa|Arlington(?=(\s|\W)+(AZ|Arizona)))/i
2.5.0 :056 > pat = /(?<!a st)(?i)(?<!juice)\p{Space}/
=> /(?<!a st)(?i)(?<!juice)\p{Space}/
2.5.0 :058 > pat = /(?<!a st)(?i)(?<!stark)\p{Space}/
=> /(?<!a st)(?i)(?<!stark)\p{Space}/
</code></pre> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=814952019-09-09T23:49:28Zjeremyevans0 (Jeremy Evans)merch-redmine@jeremyevans.net
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Feedback</i></li></ul><p>I tried on Ruby 2.5.6 and was not able to reproduce:</p>
<pre><code>$ irb25
irb(main):001:0> pat = /(?<!a st)\p{Space}/i
=> /(?<!a st)\p{Space}/i
irb(main):002:0> pat = /(?i)(?<!a st)\p{Space}/
=> /(?i)(?<!a st)\p{Space}/
irb(main):003:0> pat = /(?<!Costa)Mesa(\p{Space}|\p{Punct})+(AZ|Arizona)/i
=> /(?<!Costa)Mesa(\p{Space}|\p{Punct})+(AZ|Arizona)/i
irb(main):004:0> RUBY_VERSION
=> "2.5.6"
</code></pre>
<p>Can you please try 2.5.6 and see if the problem has been fixed?</p> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=815132019-09-11T13:09:57ZDan0042 (Daniel DeLorme)
<ul></ul><p>I can reproduce the bug in all versions of ruby.</p>
<pre><code>$ 2.5 ruby -ve 'pat = /(?<!a st)\p{Space}/i'
ruby 2.5.6p167 (2019-05-30 revision 67709) [x86_64-linux]
-e:1: invalid pattern in look-behind: /(?<!a st)\p{Space}/i
$ 2.6 ruby -ve 'pat = /(?<!a st)\p{Space}/i'
ruby 2.6.3p69 (2019-07-28 revision 67716) [x86_64-linux]
-e:1: invalid pattern in look-behind: /(?<!a st)\p{Space}/i
$ 2.7 ruby -ve 'pat = /(?<!a st)\p{Space}/i'
ruby 2.7.0dev (2019-09-02T16:53:49Z master a848b62819) [x86_64-linux]
-e:1: invalid pattern in look-behind: /(?<!a st)\p{Space}/i
</code></pre> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=815142019-09-11T13:28:15Zjeremyevans0 (Jeremy Evans)merch-redmine@jeremyevans.net
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Open</i></li><li><strong>ruby -v</strong> changed from <i>2.5.0, 2.5.3</i> to <i>ruby 2.7.0dev (2019-09-11 master 146677a1e7) [x86_64-openbsd6.5]</i></li></ul><p>This appears to be an issue if the default encoding is UTF-8:</p>
<pre><code>$ ruby -ve 'pat = /(?<!a st)\p{Space}/i'
ruby 2.6.4p104 (2019-08-28 revision 67798) [x86_64-openbsd]
$ LC_CTYPE=en_US.UTF-8 ruby -ve 'pat = /(?<!a st)\p{Space}/i'
ruby 2.6.4p104 (2019-08-28 revision 67798) [x86_64-openbsd]
-e:1: invalid pattern in look-behind: /(?<!a st)\p{Space}/i
</code></pre> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=815192019-09-11T15:59:59Zmichaeltomko (Michael Tomko)
<ul></ul><p>Thank you both.</p>
<p>I can confirm the encoding being a factor. It's an issue even if it is not the default.</p>
<pre><code>2.5.6 :013 > str = "(?<!a st)\\p{Space}".force_encoding("ISO-8859-5")
=> "(?<!a st)\\p{Space}"
2.5.6 :014 > Regexp.new(str,"i")
=> /(?<!a st)\p{Space}/i
2.5.6 :015 > str = "(?<!a st)\\p{Space}".force_encoding("UTF-8")
=> "(?<!a st)\\p{Space}"
2.5.6 :016 > Regexp.new(str,"i")
Traceback (most recent call last):
3: from (irb):16
2: from (irb):16:in `new'
1: from (irb):16:in `initialize'
RegexpError (invalid pattern in look-behind: /(?<!a st)\p{Space}/i)
</code></pre> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=815602019-09-17T08:48:59Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>I've had a hunch, and have now been able to confirm this hunch:</p>
<p>The problem must be related to the fact that there is a 'st' ligature (U+FB06) in Unicode. The problem occurs for all the other Latin ligatures just before U+FB06, i.e. for 'ff', 'fi', 'fl', 'ffi' 'ffl', long s with t, and st. It also occurs for the components of the Armenian ligatures just following, e.g.</p>
<pre><code>$ ruby -ve 'pat = /(?<!a \u0574\u0576)\p{Space}/i'
ruby 2.7.0dev (2019-07-06T03:43:38Z trunk f296c260ef) [x86_64-cygwin]
-e:1: invalid pattern in look-behind: /(?<!a \u0574\u0576)\p{Space}/i
</code></pre>
<p>It doesn't occur for Hebrew ligatures:</p>
<pre><code>$ ruby -ve 'pat = /(?<!a \u05D9\u05B4)\p{Space}/i'
ruby 2.7.0dev (2019-07-06T03:43:38Z trunk f296c260ef) [x86_64-cygwin]
</code></pre>
<p>My guess is that this is because Latin and Armenian have case conversion, but Hebrew doesn't. This would match with the fact that the error is only produced when matching is case-insensitive.</p> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=815612019-09-17T09:37:53Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>Some more information: The onigmo documentation says (<a href="https://github.com/k-takata/Onigmo/blob/master/doc/RE#L270" class="external">https://github.com/k-takata/Onigmo/blob/master/doc/RE#L270</a>):</p>
<pre><code> Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
</code></pre>
<p>Now what onigmo does internally is apparently that it considers the st ligature as case equivalent to upper-case ST, which is again case equivalent to lowercase st. You can see that as follows:</p>
<pre><code>$ ruby -ve 'puts(/\uFB06/i =~ "most")'
ruby 2.7.0dev (2019-07-06T03:43:38Z trunk f296c260ef) [x86_64-cygwin]
2
</code></pre>
<p>The st ligature is a single character, so its length is 1, but the length of ST and st is 2. So with the //i option, st seems to no longer be fixed width, and that's why onigmo refuses to deal with this and produces an error. So in some way, this is as per spec, although it's surprising and annoying.</p> Ruby master - Bug #16158: "st" Character Sequence In Regex Look-Behind Causes Illegal Pattern Error When Combined With POSIX Bracket Expressions And Case Insensitivity Flaghttps://bugs.ruby-lang.org/issues/16158?journal_id=821382019-10-17T22:25:07Zjeremyevans0 (Jeremy Evans)merch-redmine@jeremyevans.net
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-1 priority-4 priority-default" href="/issues/14838">Bug #14838</a>: RegexpError with double "s" in look-behind assertion in case-insensitive unicode regexp</i> added</li></ul>