Bug #8653

Unexpected result of String#succ with utf-16 and utf-32 string.

Added by Heesob Park 9 months ago. Updated 9 months ago.

[ruby-core:56071]
Status:Closed
Priority:Normal
Assignee:-
Category:-
Target version:-
ruby -v:ruby 2.1.0dev (2013-07-17 trunk 42011) [i386-mingw32] Backport:1.9.3: UNKNOWN, 2.0.0: UNKNOWN

Description

I found the result of String#succ of UTF-16LE encoded string is incorrect.

As a result, Range of UTF-16LE encoded string show some unexpected behavior.

C:\work>irb
irb(main):001:0> a = 'A'.encode('UTF-16LE')
=> "A"
irb(main):002:0> b = 'B'.encode('UTF-16LE')
=> "B"
irb(main):003:0> a.succ
=> "\u0141"
irb(main):004:0> r = a..b
=> "A".."B"
irb(main):005:0> r.tos
=> "A\u2E2EB"
irb(main):006:0> r.count
=> 3
irb(main):007:0> r.to
a
=> ["A", "\u0141", "\u0241"]
irb(main):008:0> r.include?(b)
=> false
irb(main):009:0> a = 'A'.encode('UTF-32LE')
=> "A"
irb(main):010:0> b = 'B'.encode('UTF-32LE')
=> "B"
irb(main):011:0> a.succ
=> "\u{1000041}"
irb(main):012:0> r = a..b
=> "A".."B"
irb(main):013:0> r.tos
=> "A\u{422E2E}\x00\x00"
irb(main):014:0> r.count
=> 16777217
irb(main):015:0> r.to
a
[FATAL] failed to allocate memory

C:\work>

Associated revisions

Revision 42078
Added by Nobuyoshi Nakada 9 months ago

string.c: wchar succ

  • string.c (encsuccchar, encpredchar): consider wchar case. [Bug #8653]
  • string.c (rbstrsucc): do not replace with invalid char.

History

#1 Updated by Akira Tanaka 9 months ago

2013/7/18 phasis68 (Heesob Park) phasis@gmail.com:

Bug #8653: Unexpected result of String#succ with utf-16 and utf-32 string.
https://bugs.ruby-lang.org/issues/8653

I found the result of String#succ of UTF-16LE encoded string is incorrect.

As a result, Range of UTF-16LE encoded string show some unexpected behavior.

I don't say the bahavior is incorrect.

% ruby -e 'p "A".encode("UTF-16LE").forceencoding("ASCII-8BIT")'
"A\x00"
% ruby -e 'p "A".encode("UTF-16LE").succ.force
encoding("ASCII-8BIT")'
"A\x01"
% ruby -e 'p "B".encode("UTF-16LE").forceencoding("ASCII-8BIT")'
"B\x00"
% ruby -e 'p "B".encode("UTF-16LE").succ.force
encoding("ASCII-8BIT")'
"B\x01"

String#succ generates the bytewise lexicographicaly next characters
successfully.

I agree that is not intuitive.
But it is very difficult to define String#succ in encoding neutral way.
--
Tanaka Akira

#2 Updated by Heesob Park 9 months ago

I understand String#succ is not easy for UTF-16LE encoded string.

In case of UTF-16 or UTF-32 string, it is possible to convert it to UTF-8 string and get succ value and revert it to the original encoding.

Here is a draft patch for rbstrsucc

diff --git a/string.c b/string.c.new
index f7a12e0..f933052 100644
--- a/string.c
+++ b/string.c.new
@@ -3032,6 +3032,7 @@ encsuccalnumchar(char *p, long len, rbencoding *enc, char *carry)
VALUE
rbstrsucc(VALUE orig)
{
+ int idx;
rbencoding *enc;
VALUE str;
char *sbeg, *s, *e, *last
alnum = 0;
@@ -3041,12 +3042,26 @@ rbstrsucc(VALUE orig)
long carrypos = 0, carrylen = 1;
enum neighborchar neighbor = NEIGHBORFOUND;

  • str = rbstrnew5(orig, RSTRINGPTR(orig), RSTRINGLEN(orig));
  • rbenccrstrcopyforsubstr(str, orig);
  • idx = ENCODING_GET(orig);
  • switch(idx) {
  • case ENCINDEXUTF16BE:
  • case ENCINDEXUTF16LE:
  • case ENCINDEXUTF32BE:
  • case ENCINDEXUTF32LE:
  • case ENCINDEXUTF16:
  • case ENCINDEXUTF32:
  • str = rbstrencode(orig, rbencfromencoding(rbutf8_encoding()), 0, Qnil);
  • break;
  • default:
  • str = rbstrnew5(orig, RSTRINGPTR(orig), RSTRINGLEN(orig));
  • rbenccrstrcopyforsubstr(str, orig);
  • }
  • OBJINFECT(str, orig);
    if (RSTRING
    LEN(str) == 0) return str;

  • enc = STRENCGET(orig);

  • enc = STRENCGET(str);
    sbeg = RSTRINGPTR(str);
    s = e = sbeg + RSTRING
    LEN(str);

@@ -3066,6 +3081,15 @@ rbstrsucc(VALUE orig)
case NEIGHBORNOTCHAR:
continue;
case NEIGHBORFOUND:
+ switch(idx) {
+ case ENCINDEX
UTF16BE:
+ case ENCINDEX
UTF16LE:
+ case ENCINDEX
UTF32BE:
+ case ENCINDEX
UTF32LE:
+ case ENCINDEX
UTF16:
+ case ENCINDEX
UTF32:
+ str = rb
strencode(str, rbencfromencoding(rbencfromindex(idx)), 0, Qnil);

+ }

return str;
case NEIGHBOR
WRAPPED:
lastalnum = s;
@@ -3103,6 +3127,17 @@ rb
strsucc(VALUE orig)
STR
SETLEN(str, RSTRINGLEN(str) + carrylen);
RSTRING
PTR(str)[RSTRINGLEN(str)] = '\0';
rb
encstrcoderange(str);
+
+ switch(idx) {
+ case ENCINDEXUTF16BE:
+ case ENCINDEXUTF16LE:
+ case ENCINDEXUTF32BE:
+ case ENCINDEXUTF32LE:
+ case ENCINDEXUTF16:
+ case ENCINDEXUTF32:
+ str = rbstrencode(str, rbencfromencoding(rbencfromindex(idx)), 0, Qnil);

+ }
+

return str;
}

#3 Updated by Nobuyoshi Nakada 9 months ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r42078.
Heesob, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


string.c: wchar succ

  • string.c (encsuccchar, encpredchar): consider wchar case. [Bug #8653]
  • string.c (rbstrsucc): do not replace with invalid char.

Also available in: Atom PDF