Bug #7501

\w in a regular expression doesn't match international characters

Added by Tomas Partl over 1 year ago. Updated over 1 year ago.

[ruby-core:50516]
Status:Rejected
Priority:Normal
Assignee:-
Category:core
Target version:-
ruby -v:ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux] Backport:

Description

When using regexp matching, \w doesn't match characters which are not in the English alphabet.
For example, the characters "žščřďťňaáéíóůúý" should all be matched by \w but aren't.

This program demonstrates the bug:


encoding: utf-8

match = /\w+/.match( "abcdefghijklmnopqrstuvwxyz" )
puts match.to_s

match = /\w+/.match( "áéíóůúýžščřďťň" ) #some Czech characters
puts match.to_s

match = /\w+/.match( "üäö" ) #some German characters

puts match.to_s

Expected output:

abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň

üäö

Actual output:

abcdefghijklmnopqrstuvwxyz


History

#1 Updated by Charlie Somerville over 1 year ago

/:alpha:+/ should behave as you expect

#2 Updated by Shyouhei Urabe over 1 year ago

  • Status changed from Open to Rejected

If I remember correctly this is an intentional design. Because as Unicode version grows, the definition of what is a word character and what is not changes form time to time. It is hard for us to follow that.

Also available in: Atom PDF