Project

General

Profile

Actions

Bug #19417

closed

Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character

Added by ObjectBoxPC (Philip Chung) about 1 year ago. Updated 5 months ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:112223]

Description

According to the documentation for Regexp, \p{Word} and [[:word:]] both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number).

puts "Ruby version: %s" % RUBY_VERSION
puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2")
puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2")
puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2")
puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2")

Expected output:

Ruby version: 3.2.0
p{Word} matches? true
[[:word:]] matches? true
Is a Number charater? true
Is an Other_Number character? true

Actual output:

Ruby version: 3.2.0
p{Word} matches? false
[[:word:]] matches? false
Is a Number charater? true
Is an Other_Number character? true

I notice that the upstream Onigmo library doc defines the [[:word:]] class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how \p{Word} is defined though. But perhaps the documentation needs to be changed?

Updated by jeremyevans0 (Jeremy Evans) about 1 year ago

Assuming this is a documentation bug, I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/7287

Updated by janosch-x (Janosch Müller) about 1 year ago

regarding the documentation, letter in the upstream doc is also incorrect, so the downstream doc actually has two errors.

as implemented here, word actually matches anything with the alphabetic property (effectively a superset of the letter category comprising about ~1600 chars more).

demonstration:

%w[
  word
  letter
  mark
  decimal_number
  connector_punctuation
  alpha
].select { |p| eval("/\\p{#{p}}/ =~ ?Ⅷ") } # roman eight
# => ["word", "alpha"]

a better wording might be:

A character with the <i>_Alphabetic_</i> unicode property or one of the following Unicode general categories: <i>Mark</i>, <i>Decimal\_Number</i>,
  <i>Connector\_Punctuation</i>

regarding the behavior, i think it could be changed to match number instead of decimal_number. some scripts (e.g. Malayalam) have characters for numbers higher than 9, and these would disrupt matching at the moment (e.g. the Malayalam 9 is matched but the 10 is not). this change would also make word match fractions and superscripts as the one mentioned by OP ("²"). to me, this would seem like the less unexpected behavior.

Updated by naruse (Yui NARUSE) about 1 year ago

The document is wrong. The definition of word is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word

Updated by jeremyevans0 (Jeremy Evans) about 1 year ago

naruse (Yui NARUSE) wrote in #note-3:

The document is wrong. The definition of word is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word

I've updated my pull request to match the description in the standard linked by @naruse (Yui NARUSE). @janosch-x (Janosch Müller) or @naruse (Yui NARUSE), could you review?

Actions #5

Updated by jeremyevans (Jeremy Evans) 5 months ago

  • Status changed from Open to Closed

Applied in changeset git|060f14bf62ad3f426a6666901c45b82d4334fa26.


Update documentation for [[:word:]] and \p{Word} in regexps

Onigmo uses Decimal_Number and not Number for these.

Fixes [Bug #19417]

Actions

Also available in: Atom PDF

Like0
Like0Like0Like2Like0Like0