Project

General

Profile

Actions

Bug #18641

closed

UTF-16 surrogate pairs

Added by noraj (Alexandre ZANNI) about 3 years ago. Updated about 3 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
[ruby-core:107959]

Description

That Ruby triggers an invalid Unicode codepoint error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string.
It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7.

Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18

 irb
irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16)
=> ""
irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16)
/home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError)                                                                            
a += "\uD83D\uDC69".force_encoding(Encodi...                                                
        ^~~~                                                                                
(irb):2: invalid Unicode codepoint                                                          
a += "\uD83D\uDC69".force_encoding(Encoding::UT...                                          
              ^~~~                                                                          
        from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'                                                                                           
        from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load'                     
        from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `<main>'

Also see Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0