Project

General

Profile

Actions

Bug #8653

closed

Unexpected result of String#succ with utf-16 and utf-32 string.

Added by phasis68 (Heesob Park) over 11 years ago. Updated over 11 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.1.0dev (2013-07-17 trunk 42011) [i386-mingw32]
[ruby-core:56071]

Description

I found the result of String#succ of UTF-16LE encoded string is incorrect.

As a result, Range of UTF-16LE encoded string show some unexpected behavior.

C:\work>irb
irb(main):001:0> a = 'A'.encode('UTF-16LE')
=> "A"
irb(main):002:0> b = 'B'.encode('UTF-16LE')
=> "B"
irb(main):003:0> a.succ
=> "\u0141"
irb(main):004:0> r = a..b
=> "A".."B"
irb(main):005:0> r.to_s
=> "A\u2E2EB"
irb(main):006:0> r.count
=> 3
irb(main):007:0> r.to_a
=> ["A", "\u0141", "\u0241"]
irb(main):008:0> r.include?(b)
=> false
irb(main):009:0> a = 'A'.encode('UTF-32LE')
=> "A"
irb(main):010:0> b = 'B'.encode('UTF-32LE')
=> "B"
irb(main):011:0> a.succ
=> "\u{1000041}"
irb(main):012:0> r = a..b
=> "A".."B"
irb(main):013:0> r.to_s
=> "A\u{422E2E}\x00\x00"
irb(main):014:0> r.count
=> 16777217
irb(main):015:0> r.to_a
[FATAL] failed to allocate memory

C:\work>

Updated by akr (Akira Tanaka) over 11 years ago

2013/7/18 phasis68 (Heesob Park) :

Bug #8653: Unexpected result of String#succ with utf-16 and utf-32 string.
https://bugs.ruby-lang.org/issues/8653

I found the result of String#succ of UTF-16LE encoded string is incorrect.

As a result, Range of UTF-16LE encoded string show some unexpected behavior.

I don't say the bahavior is incorrect.

% ruby -e 'p "A".encode("UTF-16LE").force_encoding("ASCII-8BIT")'
"A\x00"
% ruby -e 'p "A".encode("UTF-16LE").succ.force_encoding("ASCII-8BIT")'
"A\x01"
% ruby -e 'p "B".encode("UTF-16LE").force_encoding("ASCII-8BIT")'
"B\x00"
% ruby -e 'p "B".encode("UTF-16LE").succ.force_encoding("ASCII-8BIT")'
"B\x01"

String#succ generates the bytewise lexicographicaly next characters
successfully.

I agree that is not intuitive.
But it is very difficult to define String#succ in encoding neutral way.

Tanaka Akira

Updated by phasis68 (Heesob Park) over 11 years ago

I understand String#succ is not easy for UTF-16LE encoded string.

In case of UTF-16 or UTF-32 string, it is possible to convert it to UTF-8 string and get succ value and revert it to the original encoding.

Here is a draft patch for rb_str_succ

diff --git a/string.c b/string.c.new
index f7a12e0..f933052 100644
--- a/string.c
+++ b/string.c.new
@@ -3032,6 +3032,7 @@ enc_succ_alnum_char(char *p, long len, rb_encoding *enc, char *carry)
VALUE
rb_str_succ(VALUE orig)
{

  • int idx;
    rb_encoding *enc;
    VALUE str;
    char *sbeg, *s, *e, *last_alnum = 0;
    @@ -3041,12 +3042,26 @@ rb_str_succ(VALUE orig)
    long carry_pos = 0, carry_len = 1;
    enum neighbor_char neighbor = NEIGHBOR_FOUND;
  • str = rb_str_new5(orig, RSTRING_PTR(orig), RSTRING_LEN(orig));
  • rb_enc_cr_str_copy_for_substr(str, orig);
  • idx = ENCODING_GET(orig);
  • switch(idx) {
  •    case ENCINDEX_UTF_16BE:
    
  •    case ENCINDEX_UTF_16LE:
    
  •    case ENCINDEX_UTF_32BE:
    
  •    case ENCINDEX_UTF_32LE:
    
  •    case ENCINDEX_UTF_16:
    
  •    case ENCINDEX_UTF_32:
    
  •        str = rb_str_encode(orig, rb_enc_from_encoding(rb_utf8_encoding()), 0, Qnil);
    
  •        break;
    
  •    default:
    
  •        str = rb_str_new5(orig, RSTRING_PTR(orig), RSTRING_LEN(orig));
    
  •        rb_enc_cr_str_copy_for_substr(str, orig);        
    
  • }
  • OBJ_INFECT(str, orig);
    if (RSTRING_LEN(str) == 0) return str;
  • enc = STR_ENC_GET(orig);
  • enc = STR_ENC_GET(str);
    sbeg = RSTRING_PTR(str);
    s = e = sbeg + RSTRING_LEN(str);

@@ -3066,6 +3081,15 @@ rb_str_succ(VALUE orig)
case NEIGHBOR_NOT_CHAR:
continue;
case NEIGHBOR_FOUND:

  •    switch(idx) {
    
  •        case ENCINDEX_UTF_16BE:
    
  •        case ENCINDEX_UTF_16LE:
    
  •        case ENCINDEX_UTF_32BE:
    
  •        case ENCINDEX_UTF_32LE:
    
  •        case ENCINDEX_UTF_16:
    
  •        case ENCINDEX_UTF_32:
    
  •            str = rb_str_encode(str, rb_enc_from_encoding(rb_enc_from_index(idx)), 0, Qnil);    
    
  •    }      
      return str;
    case NEIGHBOR_WRAPPED:
      last_alnum = s;
    

@@ -3103,6 +3127,17 @@ rb_str_succ(VALUE orig)
STR_SET_LEN(str, RSTRING_LEN(str) + carry_len);
RSTRING_PTR(str)[RSTRING_LEN(str)] = '\0';
rb_enc_str_coderange(str);
+

  • switch(idx) {
  •    case ENCINDEX_UTF_16BE:
    
  •    case ENCINDEX_UTF_16LE:
    
  •    case ENCINDEX_UTF_32BE:
    
  •    case ENCINDEX_UTF_32LE:
    
  •    case ENCINDEX_UTF_16:
    
  •    case ENCINDEX_UTF_32:
    
  •        str = rb_str_encode(str, rb_enc_from_encoding(rb_enc_from_index(idx)), 0, Qnil);    
    
  • }
  • return str;
    }
Actions #3

Updated by nobu (Nobuyoshi Nakada) over 11 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r42078.
Heesob, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


string.c: wchar succ

  • string.c (enc_succ_char, enc_pred_char): consider wchar case.
    [ruby-core:56071] [Bug #8653]
  • string.c (rb_str_succ): do not replace with invalid char.
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0