Bug #18012
openCase-insensitive character classes can only match multiple code points when top-level character class is not negated
Description
Some Unicode characters case-fold to strings of multiple code points, e.g. the ligature \ufb00 can match the string ff.
irb(main):001:0> /\A[\ufb00]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):002:0> /\A[\ufb00]\z/i.match("ff")
=> #<MatchData "ff">
As expected, when we negate this character class, we can no longer match neither the ligature character \ufb00 nor the string ff.
irb(main):003:0> /\A[^\ufb00]\z/i.match("\ufb00")
=> nil
irb(main):004:0> /\A[^\ufb00]\z/i.match("ff")
=> nil
Then, when we add a second negation, the \ufb00 ligature reappears in the character set but the string ff is no longer accepted.
irb(main):005:0> /\A[^[^\ufb00]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):006:0> /\A[^[^\ufb00]]\z/i.match("ff")
=> nil
This reveals that the multi-code-point matches in character classes are blocked by negation. However, this is implemented only by checking whether the topmost character class is negated. If we wrap the character class in another set of brackets, the semantics change.
irb(main):007:0> /\A[[^[^\ufb00]]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):008:0> /\A[[^[^\ufb00]]]\z/i.match("ff")
=> #<MatchData "ff">
The cause behind this discrepancy (the fact that [^[^\ufb00]] and [[^[^\ufb00]]] match different strings) is the extra IS_NCCLASS_NOT check in i_apply_case_fold (https://github.com/ruby/ruby/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5568).
No data to display