Bug #21870
openRegexp: Warnings when using slightly overlapping \p{...} classes
Description
$VERBOSE = true
# warning: character class has duplicated range: /[\p{Word}\p{S}]/
regex = /[\p{Word}\p{S}]/
As far as I can tell this is a perfectly valid and non-overlapping set of unicode properties, but I am still being spammed with warnings. Using /(\p{Word}|\p{S})/ is kind of a workaround, but it is slower.
Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges.
Updated by jneen (Jeanine Adkisson) about 20 hours ago
- ruby -v changed from 4.0.0, 4.0.1, earlier versions to a lesser extent to 4.0.1
Updated by tompng (tomoya ishida) about 15 hours ago
I found 130 (5 sets of 26 alphabets) characters matching both \p{S} and \p{Word}.
The visual looks like alphabet-ish symbol character
(0..0x10ffff).select{(s=''<<it; s=~/\p{Word}/&&s=~/\p{S}/) rescue false}.map{''<<it}.join
# ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
# ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
# 🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
# 🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
# 🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉
I'm not sure how to read unicode properties, but it looks like these characters are Alphabetic:Yes and also in Other_Symbol category https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%92%B6
Updated by jneen (Jeanine Adkisson) about 15 hours ago
I see! So they do have some overlap. Is it really correct to warn here though? "Fixing" the warning would require falling back to manual unicode ranges.
Updated by jneen (Jeanine Adkisson) about 4 hours ago
- Subject changed from Regexp: Warnings when using multiple non-overlapping \p{...} classes to Regexp: Warnings when using slightly overlapping \p{...} classes
Updated by jneen (Jeanine Adkisson) about 4 hours ago
- Description updated (diff)
Updated by jneen (Jeanine Adkisson) about 4 hours ago
Another example of this is /[\p{Word}\p{Cf}]/, which seem to overlap precisely on ZWNJ (U+200C) and ZWJ (U+200D).
[1] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16 }
=> ["200c", "200d"]
[2] pry(main)> /[\p{Word}\p{Cf}]/
(pry):5: warning: character class has duplicated range: /[\p{Word}\p{Cf}]/
=> /[\p{Word}\p{Cf}]/
[3] pry(main)>
Updated by jneen (Jeanine Adkisson) about 4 hours ago
- Description updated (diff)
That specific case also appears to have changed, e.g. on 3.4.1:
[2] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16}
=> []
Maybe for preset classes like \p{...} and [[:alpha:]] we should only warn if one range completely subsumes another?
Updated by jneen (Jeanine Adkisson) about 4 hours ago
- Description updated (diff)
Updated by mame (Yusuke Endoh) about 2 hours ago
Updated by mame (Yusuke Endoh) about 2 hours ago
- Related to Bug #21503: \p{Word} does not match on \p{Join_Control} while docs say it does added