Bug #21870
openRegexp: Warnings when using slightly overlapping \p{...} classes
Added by jneen (Jeanine Adkisson) 20 days ago. Updated 3 days ago.
Description
$VERBOSE = true
# warning: character class has duplicated range: /[\p{Word}\p{S}]/
regex = /[\p{Word}\p{S}]/
As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using /(?:\p{Word}|\p{S})/ is kind of a workaround, but it is slower (see benchmarks below), and also less clear.
They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges.
For a similar example, consider /[\p{Word}\p{Cf}]/, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges.
This warning was introduced back in 2009 with #1831, to help surface instances of things like /[:lower:]/ instead of /[[:lower:]]/, but even then the reporter suggested only warning if the class both begins and ends with :.
Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address?
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#1
- ruby -v changed from 4.0.0, 4.0.1, earlier versions to a lesser extent to 4.0.1
Updated by tompng (tomoya ishida) 19 days ago
Actions
#2
[ruby-core:124718]
I found 130 (5 sets of 26 alphabets) characters matching both \p{S} and \p{Word}.
The visual looks like alphabet-ish symbol character
(0..0x10ffff).select{(s=''<<it; s=~/\p{Word}/&&s=~/\p{S}/) rescue false}.map{''<<it}.join
# ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
# ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
# 🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
# 🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
# 🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉
I'm not sure how to read unicode properties, but it looks like these characters are Alphabetic:Yes and also in Other_Symbol category https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%92%B6
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#3
[ruby-core:124719]
I see! So they do have some overlap. Is it really correct to warn here though? "Fixing" the warning would require falling back to manual unicode ranges.
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#4
- Subject changed from Regexp: Warnings when using multiple non-overlapping \p{...} classes to Regexp: Warnings when using slightly overlapping \p{...} classes
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#5
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#6
[ruby-core:124724]
Another example of this is /[\p{Word}\p{Cf}]/, which seem to overlap precisely on ZWNJ (U+200C) and ZWJ (U+200D).
[1] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16 }
=> ["200c", "200d"]
[2] pry(main)> /[\p{Word}\p{Cf}]/
(pry):5: warning: character class has duplicated range: /[\p{Word}\p{Cf}]/
=> /[\p{Word}\p{Cf}]/
[3] pry(main)>
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#7
[ruby-core:124725]
- Description updated (diff)
That specific case also appears to have changed, e.g. on 3.4.1:
[2] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16}
=> []
Maybe for preset classes like \p{...} and [[:alpha:]] we should only warn if one range completely subsumes another?
Updated by jneen (Jeanine Adkisson) 19 days ago
Actions
#8
- Description updated (diff)
Updated by mame (Yusuke Endoh) 19 days ago
Actions
#9
[ruby-core:124728]
Updated by mame (Yusuke Endoh) 19 days ago
Actions
#10
- Related to Bug #21503: \p{Word} does not match on \p{Join_Control} while docs say it does added
Updated by trinistr (Alexander Bulancov) 18 days ago
Actions
#11
[ruby-core:124736]
Using
/(\p{Word}|\p{S})/is kind of a workaround, but it is slower.
Have you tried a non-capturing group? /(?:\p{Word}|\p{S})/ should have better performance.
Updated by kddnewton (Kevin Newton) 18 days ago
Actions
#12
[ruby-core:124737]
This might be a good opportunity to add the || operator from the Unicode spec (https://www.unicode.org/reports/tr18/#Subtraction_and_Intersection. We could make that one not warn, because it's explicitly desired. As in:
$VERBOSE = true
regex = /[\p{Word}\p{S}]/ # warning
regex = /[\p{Word}||\p{S}]/ # no warning
Updated by jneen (Jeanine Adkisson) 18 days ago
· Edited
Actions
#13
[ruby-core:124739]
trinistr (Alexander Bulancov) wrote in #note-11:
Using
/(\p{Word}|\p{S})/is kind of a workaround, but it is slower.Have you tried a non-capturing group?
/(?:\p{Word}|\p{S})/should have better performance.
This is what I actually tested. Still much slower.
mame (Yusuke Endoh) wrote in #note-9:
jneen (Jeanine Adkisson) wrote in #note-7:
That specific case also appears to have changed, e.g. on 3.4.1:
It is an intentional bug fix. See #21503.
While I understand your trouble, this warning is functioning exactly as intended. How do you suggest resolving it?
I suppose the question is - what is the purpose of a warning here? What fix are you asking the code author to implement? If my downstream users are running with warnings on and Ruby prints 1000 lines of warnings loading my library, what exactly am I being warned about?
Is there a specific danger to using overlapping character classes? Or should this kind of thing live in a linter like Rubocop, which has overrides and toggles?
Updated by maxfelsher (Max Felsher) 18 days ago
Actions
#14
[ruby-core:124750]
If I'm reading the history right, the warning was added in #1831 in order to catch mistakes like a regexp defined as /[:lower:]/ (as opposed to /[[:lower:]]/, I assume). I can see the value in that, but it does seem like there should be a way to list overlapping character classes without a warning (and without turning warnings off completely).
Updated by jneen (Jeanine Adkisson) 17 days ago
Actions
#15
[ruby-core:124761]
That's a very interesting find!
I do think it makes sense to warn if an explicitly written character repeats in a character class, or if the class begins and ends with a colon. But for overlapping unicode properties, there doesn't seem to be any danger in including both in a character class.
That said, there's still an argument that all of this is a job for a linter. Rubocop didn't exist until about a year after #1831 was opened.
Updated by jneen (Jeanine Adkisson) 17 days ago
· Edited
Actions
#16
[ruby-core:124764]
Some benchmarks:
$ ruby --version
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [arm64-darwin25]
require 'benchmark'
LENGTH = 1000000
REPEAT = 100
TEST_STR = 'a' * LENGTH
Benchmark.bm do |bm|
bm.report "char class:" do
REPEAT.times { /[\p{Word}\p{S}]*/o.match?(TEST_STR) }
end
bm.report "alternation:" do
REPEAT.times { /(?:\p{Word}|\p{S})*/o.match?(TEST_STR) }
end
end
output:
user system total real
char class: 0.634908 0.302112 0.937020 ( 0.937089)
alternation: 0.983069 0.449849 1.432918 ( 1.433005)
The alternation syntax is understandably a bit slower, as it would be two nodes in the state machine rather than one unified range test. I expect this effect would be worse when more unicode properties are piled on (as they tend to be in practice), resulting in extra nodes.
Either way, /[\p{Word}\p{S}]/ is a perfectly valid regular expression that as far as I know doesn't have any practical issues, so I don't think it is helpful to warn. Perhaps if one class completely subsumes another (say, /[\p{Alnum}\p{Alpha}]/) but even then I don't think it's particularly helpful, or anything that couldn't be handled by a static linter.
Updated by jneen (Jeanine Adkisson) 17 days ago
Actions
#17
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 17 days ago
Actions
#18
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 10 days ago
Actions
#19
[ruby-core:124846]
This isn't even possible to work around by targeting RUBY_VERSION, as Ruby warns even in unreachable cases:
regex = if RUBY_VERSION < '4'
/[\p{Word}\p{Cf}]/
else
/[\p{Word}]/
end
still warns on Ruby 4+, even though the code is not reachable in that version.
Updated by jneen (Jeanine Adkisson) 4 days ago
Actions
#20
[ruby-core:124875]
Having looked through the onigmo code a bit now, I can think of a few ways forward.
a) Simply don't warn on overlapping ctype classes.
I believe this would only involve removing the check on line 1860 from regparse.c. This would preserve a warning for /[:foo:]/, as in #1831, as well as maybe rarer situations like /[a-fb-g]/. It would not warn on cases like /[a-z\p{Word}]/ or /[\p{Alnum}\p{Word}]/. Whether this is a common enough mistake to warrant a warning I'm not entirely sure. I will also check the performance characteristics of these, in case overlapping ranges is a performance issue (which I doubt, but I think it is best to check).
b) Find a way to check if a character class or range completely subsumes another.
I honestly am not sure how I would go about implementing this, as it is a much deeper check which would require a greater understanding of onigmo internals than I have so far. The idea would be to warn on /[a-z\p{Word}]/ but not on e.g. /[_-z\p{Word}], since the range _-z contains a character not matched by \p{Word}. This would also catch /[\p{Alnum}\p{Word}]/.
c) Rethink the overlapping character warning entirely, and (maybe) more specifically target things like /[:x:]/.
This would involve warning only if the first and last character of a char class are literal :. Similar to (a), it may turn out that repeated characters in classes are not a performance or correctness issue it is worth warning about at all. But this is a judgment I leave to the team.
Updated by jneen (Jeanine Adkisson) 4 days ago
Actions
#21
[ruby-core:124876]
A quick benchmark shows we are within error bars for matching performance:
#!/usr/bin/env ruby
require 'benchmark'
NON_REPEAT = Regexp.new("[" + ("a-z" * 1) + "]")
YES_REPEAT = Regexp.new("[" + ("a-z" * 100000) + "]")
Benchmark.bm do |bm|
bm.report('non-repeat') { 1000000.times { NON_REPEAT.match?('a') } }
bm.report('yes-repeat') { 1000000.times { YES_REPEAT.match?('a') } }
end
Output:
; ruby /tmp/regex-test
user system total real
non-repeat 0.105758 0.000233 0.105991 ( 0.106004)
yes-repeat 0.103658 0.000223 0.103881 ( 0.103881)
Updated by duerst (Martin Dürst) 3 days ago
Actions
#22
[ruby-core:124881]
Using two or more overlapping Unicode properties may not be very frequent, but in most cases isn't a mistake. If a user writes /[\p{Word}\p{S}]/, that expression should just match all word characters and all symbol characters, because that's most probably what the user wanted. The fact that there are some characters that are both word characters and symbol characters is irrelevant for that query, and should not produce a warning. There are many overlapping Unicode properties, because Unicode properties identify different aspects of characters (e.g. script, block, age, numeric properties,...). If we want to continue to warn about /[:lower:]/, that's fine, but we should warn about that specific case, not overlapping properties in general.