Project

General

Profile

Actions

Bug #7501

closed

\w in a regular expression doesn't match international characters

Added by eltomito (Tomas Partl) over 11 years ago. Updated over 11 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]
Backport:
[ruby-core:50516]

Description

When using regexp matching, \w doesn't match characters which are not in the English alphabet.
For example, the characters "žščřďťňaáéíóůúý" should all be matched by \w but aren't.

This program demonstrates the bug:


encoding: utf-8

match = /\w+/.match( "abcdefghijklmnopqrstuvwxyz" )
puts match.to_s

match = /\w+/.match( "áéíóůúýžščřďťň" ) #some Czech characters
puts match.to_s

match = /\w+/.match( "üäö" ) #some German characters
puts match.to_s

Expected output:

abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö

Actual output:

abcdefghijklmnopqrstuvwxyz


Updated by Anonymous over 11 years ago

/[[:alpha:]]+/ should behave as you expect

Updated by shyouhei (Shyouhei Urabe) over 11 years ago

  • Status changed from Open to Rejected

If I remember correctly this is an intentional design. Because as Unicode version grows, the definition of what is a word character and what is not changes form time to time. It is hard for us to follow that.

Actions

Also available in: Atom PDF

Like0
Like0Like0