Bug #564

Regexp fails on UTF-16 & UTF-32 character encodings

Added by Michael Selig almost 7 years ago. Updated over 4 years ago.

[ruby-core:18594]
Status:Rejected
Priority:Normal
Assignee:-
ruby -v: Backport:

Description

=begin
UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings) don't seem to be work as Regexp patterns.

Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings: US-ASCII and UTF-16BE
=end

History

#1 Updated by Yukihiro Matsumoto almost 7 years ago

  • Status changed from Open to Rejected

=begin

=end

#2 Updated by Yui NARUSE almost 7 years ago

=begin
Hi,

James Gray wrote:

On Sep 15, 2008, at 3:49 AM, Michael Selig wrote:

On Mon, 15 Sep 2008 18:08:14 +1000, Tanaka Akira akr@fsij.org wrote:

In article ,
Michael Selig redmine@ruby-lang.org writes:

UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings)
don't seem to be work as Regexp patterns.

Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings:
US-ASCII and UTF-16BE

% ruby -ve 'p Regexp.new("abc".encode("UTF-16BE")) =~
"abc".encode("UTF-16BE")'
ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux]
0

I see, I have diagnosed the problem wrongly. I was using irb.

ruby -ve 'p Regexp.new("abc".encode("UTF-16BE"))'
ruby 1.9.0 (2008-09-03 revision 19073) [i686-linux]
-e:1:in p': incompatible character encodings: UTF-16BE and ASCII-8BIT
(EncodingCompatibilityError)
from -e:1:in
'

This is the error I was getting in irb, and I mistakenly assumed it
was from the Regexp::new.
It is a different problem - not as bad as I thought!

So it's inspect() that has the issues, right?

YES, a reason of this problem is Regexp#inspect.
So a patch is following.

--- re.c (revision 19371)
+++ re.c (working copy)
@@ -381,7 +381,7 @@ rb_reg_desc(const char *s, long len, VAL
{
VALUE str = rb_str_buf_new2("/");

  • rb_enc_copy(str, re);
  • rb_enc_associate(str, rb_usascii_encoding()); rb_reg_expr_str(str, s, len); rb_str_buf_cat2(str, "/"); if (re) {

The result of Regexp#inspect is only for see the content of regexp to debug,
so there may be no reason to keep original encoding.
# Of course Regexp#source must keep it.

Anyway, Regexp#to_s is alias of Regexp#source now.
But Regexp#inspect is more readble.
How about make Regexp#to_s as alias of Regexp#inspect ?

  • r1 = /ab+c/ix #=> /ab+c/ix
  • s1 = r1.to_s #=> "(?ix-m:ab+c)"
  • r2 = Regexp.new(s1) #=> /(?ix-m:ab+c)/
  • r1 == r2 #=> false
  • r1.source #=> "ab+c"
  • r2.source #=> "(?ix-m:ab+c)"

--
NARUSE, Yui naruse@airemix.jp

=end

#3 Updated by Yukihiro Matsumoto almost 7 years ago

=begin
Hi,

In message "Re: Re: [Bug #564] Regexp fails on UTF-16 & UTF-32 character encodings"
on Tue, 16 Sep 2008 04:53:18 +0900, "NARUSE, Yui" naruse@airemix.jp writes:

|> So it's inspect() that has the issues, right?
|
|YES, a reason of this problem is Regexp#inspect.
|So a patch is following.

Can you commit?

                        matz.

=end

Also available in: Atom PDF