Bug #14137


Windows / MinGW - Regexp - Character Properties - General Category

Added by MSP-Greg (Greg L) over 3 years ago. Updated 1 day ago.

Target version:
ruby -v:
ruby 2.5.0dev (2017-11-28 trunk 60925) [x64-mingw32]


While testing RDoc on Appveyor, and the recently 'added' literals.kpeg file, I had several errors across Ruby versions 2.2 thru trunk.

It seems that the \p{} constructs listed here under 'General Category' generate an invalid character property name {**} error for many of the listed constructs.

Conversely, the constructs listed previously (eg \p{Alpha}, \p{Lower}, \p{Space}, etc) seem to work.

I briefly looked at the regexp tests, and they don't seem to test these.

Are these unavailable on Windows?

Updated by duerst (Martin Dürst) over 3 years ago

There is a C preprocessor flag USE_UNICODE_PROPERTIES that is used e.g. in enc/unicode/10.0.0/name2ctype.h. I have never actually seen this, but it may be possible that your version of Ruby is compiled without this flag on. I don't see any reason why this should be Windows-specific; these properties are useful independent of the OS.

Updated by jeremyevans0 (Jeremy Evans) about 2 months ago

  • Status changed from Open to Closed

I tested this using RubyInstaller versions on Windows. This appears related to regexp encoding, and not a bug, with the same behavior between Ruby 2.0 and 3.0:

C:\>c:\Ruby30-x64\bin\ruby -e "p(/\p{L}/.match('a'))"
-e:1: invalid character property name {L}: /\p{L}/

C:\>c:\Ruby30-x64\bin\ruby -e "p(/\p{L}/u.match('a'))"
#<MatchData "a">

C:\>c:\Ruby30-x64\bin\ruby -Ku -e "p(/\p{L}/.match('a'))"
#<MatchData "a">

C:\>c:\Ruby200-x64\bin\ruby -e "p(/\p{L}/.match('a'))"
-e:1: invalid character property name {L}: /\p{L}/

C:\>c:\Ruby200-x64\bin\ruby -e "p(/\p{L}/u.match('a'))"
#<MatchData "a">

C:\>c:\Ruby200-x64\bin\ruby -Ku -e "p(/\p{L}/.match('a'))"
#<MatchData "a">

The documentation for this feature ( says: A Unicode character's General Category value can also be matched, which I think implies this should only work for Unicode regexps, and not other regexps. So I think the current behavior is expected and not a bug.

Updated by duerst (Martin Dürst) 1 day ago

I agree with jeremyevans0 (Jeremy Evans), but would like to add that

ruby -e 'p (/\p{L}/.match("a"))'

will produce #<MatchData "a"> also in any situation that is using UTF-8. That will be on almost all current Linux/Unix,... versions, and also on Windows if you first use the command chcp 65001.


Also available in: Atom PDF