Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work. - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #11859

closed

Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.

Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.

Added by matsui (Kimihito Matsui) over 10 years ago. Updated about 10 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN

[ruby-dev:49454]

Description

U+FF21 (Ａ, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is Uppercase_Letter so it should match and return 0 in following case but this returns 1.

ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))’ # => 1

This also happens in lower case matching.

ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))’ ＃=> 1

In Unicode encoding it works as follows.

ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0

Looks like EUC-JP \p{Upper} and \p{Lower} regex is limited to ASCII characters.

Related issues 1 (0 open — 1 closed)

Updated by matsui (Kimihito Matsui) over 10 years ago Actions
Copy link
#1 [ruby-dev:49455]

Description updated (diff)

Updated by matsui (Kimihito Matsui) over 10 years ago Actions
Copy link
#2 [ruby-dev:49456]

Description updated (diff)

Updated by naruse (Yui NARUSE) about 10 years ago Actions
Copy link
#3 [ruby-dev:49663]

Status changed from Open to Rejected

Ruby doesn't have case tables for non Unicode encodings.

And EUC-JP is legacy encoding, I don't think such encoding should be extended.

Updated by duerst (Martin Dürst) about 10 years ago Actions
Copy link
#4 [ruby-dev:49664]

Some additional comments following up on the commiters' meeting yesterday:

There are many single-byte non-Unicode encodings that have case tables.

Checking the paper versions of the standards in question, À (LATIN CAPITAL LETTER A WITH GRAVE) exists in JIS X 0212-1990 at position (区点) 10-2, and in JIS X 0213-2004 at position 9-23 on the first plane (面). JIS X 0213-2004 is the version I have at hand, but that character didn't change from the -2000 version.

Checking the actual encoding of À in EUC-JP in Ruby shows the following:

$ ruby -e 'puts "\u00C0".encode("EUC-JP").b.inspect'
"\x8F\xAA\xA2"

This is clearly the JIS X 0212-1990 version, using SS3 (0x8F) to switch to the JIS X 0212 plane at G3. The 1990 version of JIS X 0212 is the first one, so the À character didn't exist in EUC-JP before.

Updated by duerst (Martin Dürst) almost 9 years ago Actions
Copy link
#5

Related to Feature #13770: Can't create valid Cyrillic-named class/module added

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #11859

Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.

Updated by matsui (Kimihito Matsui) over 10 years ago Actions
Copy link
#1 [ruby-dev:49455]

Updated by matsui (Kimihito Matsui) over 10 years ago Actions
Copy link
#2 [ruby-dev:49456]

Updated by naruse (Yui NARUSE) about 10 years ago Actions
Copy link
#3 [ruby-dev:49663]

Updated by duerst (Martin Dürst) about 10 years ago Actions
Copy link
#4 [ruby-dev:49664]

Updated by duerst (Martin Dürst) almost 9 years ago Actions
Copy link
#5

Project

General

Profile

Ruby

Custom queries

Bug #11859

Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.

Updated by matsui (Kimihito Matsui) over 10 years ago ActionsCopy link #1 [ruby-dev:49455]

Updated by matsui (Kimihito Matsui) over 10 years ago ActionsCopy link #2 [ruby-dev:49456]

Updated by naruse (Yui NARUSE) about 10 years ago ActionsCopy link #3 [ruby-dev:49663]

Updated by duerst (Martin Dürst) about 10 years ago ActionsCopy link #4 [ruby-dev:49664]

Updated by duerst (Martin Dürst) almost 9 years ago ActionsCopy link #5

Updated by matsui (Kimihito Matsui) over 10 years ago Actions
Copy link
#1 [ruby-dev:49455]

Updated by matsui (Kimihito Matsui) over 10 years ago Actions
Copy link
#2 [ruby-dev:49456]

Updated by naruse (Yui NARUSE) about 10 years ago Actions
Copy link
#3 [ruby-dev:49663]

Updated by duerst (Martin Dürst) about 10 years ago Actions
Copy link
#4 [ruby-dev:49664]

Updated by duerst (Martin Dürst) almost 9 years ago Actions
Copy link
#5