Actions

Copy link

Bug #7501

closed

\w in a regular expression doesn't match international characters

Bug #7501: \w in a regular expression doesn't match international characters

Added by eltomito (Tomas Partl) over 13 years ago. Updated over 13 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]

Backport:

[ruby-core:50516]

Description

When using regexp matching, \w doesn't match characters which are not in the English alphabet.
For example, the characters "žščřďťňaáéíóůúý" should all be matched by \w but aren't.

This program demonstrates the bug:

encoding: utf-8¶

match = /\w+/.match( "abcdefghijklmnopqrstuvwxyz" )
puts match.to_s

match = /\w+/.match( "áéíóůúýžščřďťň" ) #some Czech characters
puts match.to_s

match = /\w+/.match( "üäö" ) #some German characters
puts match.to_s¶

Expected output:¶

abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö¶

Actual output:¶

abcdefghijklmnopqrstuvwxyz

Updated by Anonymous over 13 years ago Actions
Copy link
#1 [ruby-core:50522]

/[[:alpha:]]+/ should behave as you expect

Updated by shyouhei (Shyouhei Urabe) over 13 years ago Actions
Copy link
#2 [ruby-core:50537]

Status changed from Open to Rejected

If I remember correctly this is an intentional design. Because as Unicode version grows, the definition of what is a word character and what is not changes form time to time. It is hard for us to follow that.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #7501

\w in a regular expression doesn't match international characters

encoding: utf-8¶

match = /\w+/.match( "üäö" ) #some German characters
puts match.to_s¶

Expected output:¶

abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö¶

Actual output:¶

Updated by Anonymous over 13 years ago Actions
Copy link
#1 [ruby-core:50522]

Updated by shyouhei (Shyouhei Urabe) over 13 years ago Actions
Copy link
#2 [ruby-core:50537]

Project

General

Profile

Ruby

Custom queries

Bug #7501

\w in a regular expression doesn't match international characters

encoding: utf-8¶

match = /\w+/.match( "üäö" ) #some German characters puts match.to_s¶

Expected output:¶

abcdefghijklmnopqrstuvwxyz áéíóůúýžščřďťň üäö¶

Actual output:¶

Updated by Anonymous over 13 years ago ActionsCopy link #1 [ruby-core:50522]

Updated by shyouhei (Shyouhei Urabe) over 13 years ago ActionsCopy link #2 [ruby-core:50537]

match = /\w+/.match( "üäö" ) #some German characters
puts match.to_s¶

abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö¶

Updated by Anonymous over 13 years ago Actions
Copy link
#1 [ruby-core:50522]

Updated by shyouhei (Shyouhei Urabe) over 13 years ago Actions
Copy link
#2 [ruby-core:50537]