Bug #10891
closed/[[:punct:]]/ POSIX group broken (with string literals?)
Description
The regular expression: /[[:punct:]]/
should match the following characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
However, it only works for these characters:
! " # % & ' ( ) * , - . / : ; ? @ [ \\ ] _ { }
And does not work for these characters:
$ + < = > ^ ` | ~
However, this is where it gets really weird... Consider the following:
60.chr == "<" # true
60.chr =~ /[[:punct:]]/ # => 0
"<" =~ /[[:punct:]]/ # => nil
So, it seems that the regular expression only fails for string literals!
Updated by nobu (Nobuyoshi Nakada) almost 10 years ago
- Description updated (diff)
It occurs with UTF-8 encoding only.
Updated by tom-lord (Tom Lord) almost 10 years ago
Nobuyoshi Nakada wrote:
It occurs with UTF-8 encoding only.
Ahhhhh, of course - that's what the difference between 60.chr
and "<"
is!
Like you said, the issue only affects UTF-8 encodings:
#<Encoding:UTF-8>, #<Encoding:UTF8-MAC>, #<Encoding:UTF8-DoCoMo>, #<Encoding:UTF8-KDDI>, #<Encoding:UTF8-SoftBank>
Updated by tom-lord (Tom Lord) almost 10 years ago
On further investigation, this is a known issue in Onigmo (Ruby 2.x's regexp parser).
However, it was apparently "fixed" way back in 2006: https://github.com/k-takata/Onigmo/blob/d0b3173893b9499a4e53ae1da16ba76c06d85571/HISTORY#L584-585 (Note: I can't find a reference to any Oniguruma/Onigmo source control dating back this far, to see the actual commit)
...And yet, it remains an open issue: https://github.com/k-takata/Onigmo/issues/42
Updated by shugo (Shugo Maeda) about 9 years ago
- Assignee changed from core to naruse (Yui NARUSE)
How about to interpret [[:punct]]
as [\p{P}\p{S}]
for unicode strings so that [[:punct]]
will be a superset of POSIX's one?
Updated by naruse (Yui NARUSE) about 9 years ago
- Status changed from Open to Feedback
It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct
Updated by shugo (Shugo Maeda) about 9 years ago
Yui NARUSE wrote:
It follows UTR#18's Standard Recommendation.
http://www.unicode.org/reports/tr18/#punct
In general, it would be a reasonable choice.
However, in Ruby, the problem is that it's hard to guess the programmers intention from code,
because the behavior is decided not by the regular expression, but by the target string.
def do_something(s)
...
if /[[:punct:]]/ =~ s # should "<" match, or shouldn't?
...
end
...
end
If you want to reject symbols, /\p{P}/
can be used instead, and it's more readable.
Updated by jeremyevans0 (Jeremy Evans) over 5 years ago
- Status changed from Feedback to Closed
This was apparently fixed between Ruby 2.3 and 2.4:
$ ruby23 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)'
nil
$ ruby24 -e 'p("<".force_encoding("UTF-8") =~ /[[:punct:]]/)'
0