Bug #7090
closedUTF-16LE String#<< append 0x0 for certain codepoints
Description
$ irb193 -r unicode_utils/u
irb(main):001:0> RUBY_VERSION
=> "1.9.3"
irb(main):002:0> s1 = "".force_encoding('utf-16le')
=> ""
irb(main):003:0> s1 << 0x20
=> " "
irb(main):004:0> s1 << 0x300
=> " \u0000"
irb(main):005:0> U.debug s1
Char | Ordinal | Sid   | General Category | UTF-8
------+---------+-------+------------------+-------
" "  |      20 | SPACE | Space_Separator  | 20
N/A  |       0 | NULL  | Control          | 00
=> nil
irb(main):006:0> s2 = "".force_encoding('utf-8')
=> ""
irb(main):007:0> s2 << 0x20
=> " "
irb(main):008:0> s2 << 0x300
=> " ̀"
irb(main):009:0> U.debug s2
Char | Ordinal | Sid                    | General Category | UTF-8
------+---------+------------------------+------------------+-------
" "  |      20 | SPACE                  | Space_Separator  | 20
N/A  |     300 | COMBINING GRAVE ACCENT | Nonspacing_Mark  | CC 80
=> nil
IMO, the behaviour with the UTF-8 string is correct.
$ ri193 'String#<<'
= String#<<
(from ruby core)¶
str << integer       -> str
str.concat(integer)  -> str
str << obj           -> str
str.concat(obj)      -> str
Append---Concatenates the given object to str. If the object is a
Integer, it is considered as a codepoint, and is converted to a character
before concatenation.
a = "hello "
a << "world"   #=> "hello world"
a.concat(33)   #=> "hello world!"
AFAIK, a Ruby 1.9 string can be viewed as either 1) a sequence of raw bytes,
or 2) a sequence of codepoints.
Except for maybe regexes, Ruby has no higher level concept of a "character"
than a codepoint. Insofar I don't know what the "and is converted to
a character before concatenation" means.
If we take the sequence of codepoints view, than "str << integer" is simply
appending a codepoint.
If we take the sequence of bytes view, then "str << integer" is converting
the codepoint into a sequence of bytes that correspond to the codepoint
in str.encoding and appending that sequence of bytes.
        
           Updated by stefan (Stefan Lang) about 13 years ago
          Updated by stefan (Stefan Lang) about 13 years ago
          
          
        
        
      
      UTF-16BE
irb(main):003:0> s = "".force_encoding('utf-16be')
=> ""
irb(main):004:0> s << 0x20
=> "\u0000"
irb(main):005:0> s << 0x300
=> "\u0000\u0300"
        
           Updated by stefan (Stefan Lang) about 13 years ago
          Updated by stefan (Stefan Lang) about 13 years ago
          
          
        
        
      
      With older Ruby version: ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux]
the string correctly contains 0x20, 0x300 for UTF-8, UTF-16LE and UTF-16BE.
        
           Updated by naruse (Yui NARUSE) about 13 years ago
          Updated by naruse (Yui NARUSE) about 13 years ago
          
          
        
        
      
      - Status changed from Open to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r37058.
Stefan, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
- string.c (rb_str_concat): use memcpy to copy a string which contains
 NUL characters. [ruby-core:47751] [Bug #7090]