Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #19417

closed

Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character

Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character

Added by ObjectBoxPC (Philip Chung) about 3 years ago. Updated over 2 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

3.2.0

Backport:

2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN

[ruby-core:112223]

Description

According to the documentation for Regexp, \p{Word} and [[:word:]] both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number).

puts "Ruby version: %s" % RUBY_VERSION
puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2")
puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2")
puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2")
puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2")

Expected output:

Ruby version: 3.2.0
p{Word} matches? true
[[:word:]] matches? true
Is a Number charater? true
Is an Other_Number character? true

Actual output:

Ruby version: 3.2.0
p{Word} matches? false
[[:word:]] matches? false
Is a Number charater? true
Is an Other_Number character? true

I notice that the upstream Onigmo library doc defines the [[:word:]] class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how \p{Word} is defined though. But perhaps the documentation needs to be changed?

Related issues 1 (0 open — 1 closed)

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago Actions
Copy link
#1 [ruby-core:112341]

Assuming this is a documentation bug, I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/7287

Updated by janosch-x (Janosch Müller) about 3 years ago Actions
Copy link
#2 [ruby-core:112396]

regarding the documentation, letter in the upstream doc is also incorrect, so the downstream doc actually has two errors.

as implemented here, word actually matches anything with the alphabetic property (effectively a superset of the letter category comprising about ~1600 chars more).

demonstration:

%w[
  word
  letter
  mark
  decimal_number
  connector_punctuation
  alpha
].select { |p| eval("/\\p{#{p}}/ =~ ?Ⅷ") } # roman eight
# => ["word", "alpha"]

a better wording might be:

A character with the <i>_Alphabetic_</i> unicode property or one of the following Unicode general categories: <i>Mark</i>, <i>Decimal\_Number</i>,
  <i>Connector\_Punctuation</i>

regarding the behavior, i think it could be changed to match number instead of decimal_number. some scripts (e.g. Malayalam) have characters for numbers higher than 9, and these would disrupt matching at the moment (e.g. the Malayalam 9 is matched but the 10 is not). this change would also make word match fractions and superscripts as the one mentioned by OP ("²"). to me, this would seem like the less unexpected behavior.

Updated by naruse (Yui NARUSE) about 3 years ago 2Actions
Copy link
#3 [ruby-core:112759]

The document is wrong. The definition of word is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago Actions
Copy link
#4 [ruby-core:112999]

naruse (Yui NARUSE) wrote in #note-3:

The document is wrong. The definition of word is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS.
https://unicode.org/reports/tr18/#word

I've updated my pull request to match the description in the standard linked by @naruse (Yui NARUSE). @janosch-x (Janosch Müller) or @naruse (Yui NARUSE), could you review?

Updated by jeremyevans (Jeremy Evans) over 2 years ago Actions
Copy link
#5

Status changed from Open to Closed

Applied in changeset git|060f14bf62ad3f426a6666901c45b82d4334fa26.

Update documentation for [[:word:]] and \p{Word} in regexps

Onigmo uses Decimal_Number and not Number for these.

Fixes [Bug #19417]

Updated by mame (Yusuke Endoh) 10 months ago Actions
Copy link
#6

Related to Bug #21503: \p{Word} does not match on \p{Join_Control} while docs say it does added

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #19417

Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago Actions
Copy link
#1 [ruby-core:112341]

Updated by janosch-x (Janosch Müller) about 3 years ago Actions
Copy link
#2 [ruby-core:112396]

Updated by naruse (Yui NARUSE) about 3 years ago 2Actions
Copy link
#3 [ruby-core:112759]

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago Actions
Copy link
#4 [ruby-core:112999]

Updated by jeremyevans (Jeremy Evans) over 2 years ago Actions
Copy link
#5

Updated by mame (Yusuke Endoh) 10 months ago Actions
Copy link
#6

Project

General

Profile

Ruby

Custom queries

Bug #19417

Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago ActionsCopy link #1 [ruby-core:112341]

Updated by janosch-x (Janosch Müller) about 3 years ago ActionsCopy link #2 [ruby-core:112396]

Updated by naruse (Yui NARUSE) about 3 years ago 2ActionsCopy link #3 [ruby-core:112759]

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago ActionsCopy link #4 [ruby-core:112999]

Updated by jeremyevans (Jeremy Evans) over 2 years ago ActionsCopy link #5

Updated by mame (Yusuke Endoh) 10 months ago ActionsCopy link #6

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago Actions
Copy link
#1 [ruby-core:112341]

Updated by janosch-x (Janosch Müller) about 3 years ago Actions
Copy link
#2 [ruby-core:112396]

Updated by naruse (Yui NARUSE) about 3 years ago 2Actions
Copy link
#3 [ruby-core:112759]

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago Actions
Copy link
#4 [ruby-core:112999]

Updated by jeremyevans (Jeremy Evans) over 2 years ago Actions
Copy link
#5

Updated by mame (Yusuke Endoh) 10 months ago Actions
Copy link
#6