Bug #19417
closedRegexp \p{Word} and [[:word:]] do not match Unicode Other_Number character
Description
According to the documentation for Regexp, \p{Word}
and [[:word:]]
both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number).
puts "Ruby version: %s" % RUBY_VERSION
puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2")
puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2")
puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2")
puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2")
Expected output:
Ruby version: 3.2.0
p{Word} matches? true
[[:word:]] matches? true
Is a Number charater? true
Is an Other_Number character? true
Actual output:
Ruby version: 3.2.0
p{Word} matches? false
[[:word:]] matches? false
Is a Number charater? true
Is an Other_Number character? true
I notice that the upstream Onigmo library doc defines the [[:word:]]
class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how \p{Word}
is defined though. But perhaps the documentation needs to be changed?
Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago
Assuming this is a documentation bug, I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/7287
Updated by janosch-x (Janosch Müller) almost 2 years ago
regarding the documentation, letter
in the upstream doc is also incorrect, so the downstream doc actually has two errors.
as implemented here, word
actually matches anything with the alphabetic
property (effectively a superset of the letter
category comprising about ~1600 chars more).
demonstration:
%w[
word
letter
mark
decimal_number
connector_punctuation
alpha
].select { |p| eval("/\\p{#{p}}/ =~ ?Ⅷ") } # roman eight
# => ["word", "alpha"]
a better wording might be:
A character with the <i>_Alphabetic_</i> unicode property or one of the following Unicode general categories: <i>Mark</i>, <i>Decimal\_Number</i>,
<i>Connector\_Punctuation</i>
regarding the behavior, i think it could be changed to match number
instead of decimal_number
. some scripts (e.g. Malayalam) have characters for numbers higher than 9, and these would disrupt matching at the moment (e.g. the Malayalam 9 is matched but the 10 is not). this change would also make word
match fractions and superscripts as the one mentioned by OP ("²"). to me, this would seem like the less unexpected behavior.
Updated by naruse (Yui NARUSE) over 1 year ago
The document is wrong. The definition of word
is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
naruse (Yui NARUSE) wrote in #note-3:
The document is wrong. The definition of
word
is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word
I've updated my pull request to match the description in the standard linked by @naruse (Yui NARUSE). @janosch-x (Janosch Müller) or @naruse (Yui NARUSE), could you review?
Updated by jeremyevans (Jeremy Evans) 12 months ago
- Status changed from Open to Closed
Applied in changeset git|060f14bf62ad3f426a6666901c45b82d4334fa26.
Update documentation for [[:word:]] and \p{Word} in regexps
Onigmo uses Decimal_Number and not Number for these.
Fixes [Bug #19417]