Bug #21870
openRegexp: Warnings when using slightly overlapping \p{...} classes
Added by jneen (Jeanine Adkisson) 2 days ago. Updated about 1 hour ago.
Description
$VERBOSE = true
# warning: character class has duplicated range: /[\p{Word}\p{S}]/
regex = /[\p{Word}\p{S}]/
As far as I can tell this is a perfectly valid and non-overlapping set of unicode properties, but I am still being spammed with warnings. Using /(\p{Word}|\p{S})/ is kind of a workaround, but it is slower.
Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges.
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#1
- ruby -v changed from 4.0.0, 4.0.1, earlier versions to a lesser extent to 4.0.1
Updated by tompng (tomoya ishida) 1 day ago
Actions
#2
[ruby-core:124718]
I found 130 (5 sets of 26 alphabets) characters matching both \p{S} and \p{Word}.
The visual looks like alphabet-ish symbol character
(0..0x10ffff).select{(s=''<<it; s=~/\p{Word}/&&s=~/\p{S}/) rescue false}.map{''<<it}.join
# ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
# ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
# 🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
# 🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
# 🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉
I'm not sure how to read unicode properties, but it looks like these characters are Alphabetic:Yes and also in Other_Symbol category https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%92%B6
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#3
[ruby-core:124719]
I see! So they do have some overlap. Is it really correct to warn here though? "Fixing" the warning would require falling back to manual unicode ranges.
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#4
- Subject changed from Regexp: Warnings when using multiple non-overlapping \p{...} classes to Regexp: Warnings when using slightly overlapping \p{...} classes
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#5
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#6
[ruby-core:124724]
Another example of this is /[\p{Word}\p{Cf}]/, which seem to overlap precisely on ZWNJ (U+200C) and ZWJ (U+200D).
[1] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16 }
=> ["200c", "200d"]
[2] pry(main)> /[\p{Word}\p{Cf}]/
(pry):5: warning: character class has duplicated range: /[\p{Word}\p{Cf}]/
=> /[\p{Word}\p{Cf}]/
[3] pry(main)>
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#7
[ruby-core:124725]
- Description updated (diff)
That specific case also appears to have changed, e.g. on 3.4.1:
[2] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16}
=> []
Maybe for preset classes like \p{...} and [[:alpha:]] we should only warn if one range completely subsumes another?
Updated by jneen (Jeanine Adkisson) 1 day ago
Actions
#8
- Description updated (diff)
Updated by mame (Yusuke Endoh) about 22 hours ago
Actions
#9
[ruby-core:124728]
Updated by mame (Yusuke Endoh) about 22 hours ago
Actions
#10
- Related to Bug #21503: \p{Word} does not match on \p{Join_Control} while docs say it does added
Updated by trinistr (Alexander Bulancov) about 14 hours ago
Actions
#11
[ruby-core:124736]
Using
/(\p{Word}|\p{S})/is kind of a workaround, but it is slower.
Have you tried a non-capturing group? /(?:\p{Word}|\p{S})/ should have better performance.
Updated by kddnewton (Kevin Newton) about 14 hours ago
Actions
#12
[ruby-core:124737]
This might be a good opportunity to add the || operator from the Unicode spec (https://www.unicode.org/reports/tr18/#Subtraction_and_Intersection. We could make that one not warn, because it's explicitly desired. As in:
$VERBOSE = true
regex = /[\p{Word}\p{S}]/ # warning
regex = /[\p{Word}||\p{S}]/ # no warning
Updated by jneen (Jeanine Adkisson) about 12 hours ago
· Edited
Actions
#13
[ruby-core:124739]
trinistr (Alexander Bulancov) wrote in #note-11:
Using
/(\p{Word}|\p{S})/is kind of a workaround, but it is slower.Have you tried a non-capturing group?
/(?:\p{Word}|\p{S})/should have better performance.
This is what I actually tested. Still much slower.
mame (Yusuke Endoh) wrote in #note-9:
jneen (Jeanine Adkisson) wrote in #note-7:
That specific case also appears to have changed, e.g. on 3.4.1:
It is an intentional bug fix. See #21503.
While I understand your trouble, this warning is functioning exactly as intended. How do you suggest resolving it?
I suppose the question is - what is the purpose of a warning here? What fix are you asking the code author to implement? If my downstream users are running with warnings on and Ruby prints 1000 lines of warnings loading my library, what exactly am I being warned about?
Is there a specific danger to using overlapping character classes? Or should this kind of thing live in a linter like Rubocop, which has overrides and toggles?
Updated by maxfelsher (Max Felsher) about 1 hour ago
Actions
#14
[ruby-core:124750]
If I'm reading the history right, the warning was added in #1831 in order to catch mistakes like a regexp defined as /[:lower:]/ (as opposed to /[[:lower:]]/, I assume). I can see the value in that, but it does seem like there should be a way to list overlapping character classes without a warning (and without turning warnings off completely).