Project

General

Profile

Actions

Bug #564

closed

Regexp fails on UTF-16 & UTF-32 character encodings

Added by mike (Michael Selig) over 16 years ago. Updated over 13 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
Backport:
[ruby-core:18594]

Description

=begin
UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings) don't seem to be work as Regexp patterns.

Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings: US-ASCII and UTF-16BE
=end

Actions #1

Updated by matz (Yukihiro Matsumoto) over 16 years ago

  • Status changed from Open to Rejected

=begin

=end

Actions #2

Updated by naruse (Yui NARUSE) over 16 years ago

=begin
Hi,

James Gray wrote:

On Sep 15, 2008, at 3:49 AM, Michael Selig wrote:

On Mon, 15 Sep 2008 18:08:14 +1000, Tanaka Akira wrote:

In article ,
Michael Selig writes:

UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings)
don't seem to be work as Regexp patterns.

Regexp.new("abc".encode("UTF-16BE"))
==> EncodingCompatibilityError: incompatible character encodings:
US-ASCII and UTF-16BE

% ruby -ve 'p Regexp.new("abc".encode("UTF-16BE")) =~
"abc".encode("UTF-16BE")'
ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux]
0

I see, I have diagnosed the problem wrongly. I was using irb.

ruby -ve 'p Regexp.new("abc".encode("UTF-16BE"))'
ruby 1.9.0 (2008-09-03 revision 19073) [i686-linux]
-e:1:in p': incompatible character encodings: UTF-16BE and ASCII-8BIT (EncodingCompatibilityError) from -e:1:in '

This is the error I was getting in irb, and I mistakenly assumed it
was from the Regexp::new.
It is a different problem - not as bad as I thought!

So it's inspect() that has the issues, right?

YES, a reason of this problem is Regexp#inspect.
So a patch is following.

--- re.c (revision 19371)
+++ re.c (working copy)
@@ -381,7 +381,7 @@ rb_reg_desc(const char *s, long len, VAL
{
VALUE str = rb_str_buf_new2("/");

  • rb_enc_copy(str, re);
  • rb_enc_associate(str, rb_usascii_encoding());
    rb_reg_expr_str(str, s, len);
    rb_str_buf_cat2(str, "/");
    if (re) {

The result of Regexp#inspect is only for see the content of regexp to debug,
so there may be no reason to keep original encoding.

Of course Regexp#source must keep it.

Anyway, Regexp#to_s is alias of Regexp#source now.
But Regexp#inspect is more readble.
How about make Regexp#to_s as alias of Regexp#inspect ?

  •  r1 = /ab+c/ix           #=> /ab+c/ix
    
  •  s1 = r1.to_s            #=> "(?ix-m:ab+c)"
    
  •  r2 = Regexp.new(s1)     #=> /(?ix-m:ab+c)/
    
  •  r1 == r2                #=> false
    
  •  r1.source               #=> "ab+c"
    
  •  r2.source               #=> "(?ix-m:ab+c)"
    

--
NARUSE, Yui

=end

Actions #3

Updated by matz (Yukihiro Matsumoto) over 16 years ago

=begin
Hi,

In message "Re: [ruby-core:18610] Re: [Bug #564] Regexp fails on UTF-16 & UTF-32 character encodings"
on Tue, 16 Sep 2008 04:53:18 +0900, "NARUSE, Yui" writes:

|> So it's inspect() that has the issues, right?
|
|YES, a reason of this problem is Regexp#inspect.
|So a patch is following.

Can you commit?

						matz.

=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0