Project

General

Profile

Actions

Bug #18009

open

Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection

Added by jirkamarsik (Jirka Marsik) over 2 years ago. Updated about 1 month ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
[ruby-core:104422]

Description

This is a follow up to issue 4044. Its fix (https://github.com/k-takata/Onigmo/issues/4) handled the cases that were reported in the original issue, but there are other cases, which were omitted and now produce inconsistent results.

If the \w character set is used inside a nested negated character class, it will not be picked up by the part of the character class analyzer that's responsible for limiting the case-folding of certain character sets (like \w and \W) across the ASCII boundary. We then end up with the situation where /[^\w]/iu and /[[^\w]]/iu match different sets of characters.

irb(main):001:0> ("a".."z").to_a.join.scan(/\W/iu)
=> []
irb(main):002:0> ("a".."z").to_a.join.scan(/[^\w]/iu)
=> []
irb(main):003:0> ("a".."z").to_a.join.scan(/[[^\w]]/iu)
=> ["k", "s"]

This can also be demonstrated using the inverted matcher:

irb(main):004:0> ("a".."z").to_a.join.scan(/\w/iu).length
=> 26
irb(main):005:0> ("a".."z").to_a.join.scan(/[^[^\w]]/iu).length
=> 24

A similar issue also arises when using character class intersection. The idea behind the pattern compiler's analysis is that characters are allowed to case-fold across the ASCII boundary only if they are included in the character class by some other means than just being included in \w (or in one of several other character sets which have special treatment). Therefore, in the below, /[\w]/iu will not match the Kelvin sign \u212a, because that would mean crossing the ASCII boundary from k to \u212a. However, /[kx]/iu will match the Kelvin sign, because the k was not contributed by \w and therefore is not subject to the ASCII boundary restriction (we have to use /[kx]/iu instead of /[k]/iu in our examples, or else the pattern analyzer would replace [k] with k and follow a different code path).

irb(main):006:0> /[\w]/iu.match("\u212a")
=> nil
irb(main):007:0> /[kx]/iu.match("\u212a")
=> #<MatchData "K">

The problem then is when we perform an intersection of these two character sets. Since [kx] is a subset of \w, we would expect their intersection to behave the same as [kx], but that is not the case.

irb(main):008:0> /[\w&&kx]/i.match("\u212a")
=> nil

The underlying issue in these cases is the manner in which the ascCc character set is computed during the parsing of character classes. The ascCc character set should contain all characters of the character class except those which were contributed by \w and similar character sets. This is done in a way that these character sets are essentially ignored in the calculation of ascCc, which works well for set union and top-most negation (which is handled explicitly), but it doesn't handle nested set negation and set intersection.

Updated by mjrzasa (Maciek Rząsa) about 1 month ago

One more case:

[26] pry(main)> ("a".."z").to_a.join.scan(/[\W]/iu)  
=> ["st"]
Actions

Also available in: Atom PDF

Like0
Like0