Bug #19417
closedRegexp \p{Word} and [[:word:]] do not match Unicode Other_Number character
Description
According to the documentation for Regexp, \p{Word} and [[:word:]] both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number).
puts "Ruby version: %s" % RUBY_VERSION
puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2")
puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2")
puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2")
puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2")
Expected output:
Ruby version: 3.2.0
p{Word} matches? true
[[:word:]] matches? true
Is a Number charater? true
Is an Other_Number character? true
Actual output:
Ruby version: 3.2.0
p{Word} matches? false
[[:word:]] matches? false
Is a Number charater? true
Is an Other_Number character? true
I notice that the upstream Onigmo library doc defines the [[:word:]] class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how \p{Word} is defined though. But perhaps the documentation needs to be changed?
        
           Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
          Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
          
          
        
        
      
      Assuming this is a documentation bug, I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/7287
        
           Updated by janosch-x (Janosch Müller) over 2 years ago
          Updated by janosch-x (Janosch Müller) over 2 years ago
          
          
        
        
      
      regarding the documentation, letter in the upstream doc is also incorrect, so the downstream doc actually has two errors.
as implemented here, word actually matches anything with the alphabetic property (effectively a superset of the letter category comprising about ~1600 chars more).
demonstration:
%w[
  word
  letter
  mark
  decimal_number
  connector_punctuation
  alpha
].select { |p| eval("/\\p{#{p}}/ =~ ?Ⅷ") } # roman eight
# => ["word", "alpha"]
a better wording might be:
A character with the <i>_Alphabetic_</i> unicode property or one of the following Unicode general categories: <i>Mark</i>, <i>Decimal\_Number</i>,
  <i>Connector\_Punctuation</i>
regarding the behavior, i think it could be changed to match number instead of decimal_number. some scripts (e.g. Malayalam) have characters for numbers higher than 9, and these would disrupt matching at the moment (e.g. the Malayalam 9 is matched but the 10 is not). this change would also make word match fractions and superscripts as the one mentioned by OP ("²"). to me, this would seem like the less unexpected behavior.
        
           Updated by naruse (Yui NARUSE) over 2 years ago
          Updated by naruse (Yui NARUSE) over 2 years ago
          
          
        
        
      
      The document is wrong. The definition of word is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word
        
           Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
          Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
          
          
        
        
      
      naruse (Yui NARUSE) wrote in #note-3:
The document is wrong. The definition of
wordis defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word
I've updated my pull request to match the description in the standard linked by @naruse (Yui NARUSE). @janosch-x or @naruse (Yui NARUSE), could you review?
        
           Updated by jeremyevans (Jeremy Evans) almost 2 years ago
          Updated by jeremyevans (Jeremy Evans) almost 2 years ago
          
          
        
        
      
      - Status changed from Open to Closed
Applied in changeset git|060f14bf62ad3f426a6666901c45b82d4334fa26.
Update documentation for [[:word:]] and \p{Word} in regexps
Onigmo uses Decimal_Number and not Number for these.
Fixes [Bug #19417]
        
           Updated by mame (Yusuke Endoh) 4 months ago
          Updated by mame (Yusuke Endoh) 4 months ago
          
          
        
        
      
      - Related to Bug #21503: \p{Word} does not match on \p{Join_Control} while docs say it does added