Project

General

Profile

Actions

Bug #18641

closed

UTF-16 surrogate pairs

Added by noraj (Alexandre ZANNI) about 3 years ago. Updated about 3 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
[ruby-core:107959]

Description

That Ruby triggers an invalid Unicode codepoint error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string.
It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7.

Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18

 irb
irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16)
=> ""
irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16)
/home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError)                                                                            
a += "\uD83D\uDC69".force_encoding(Encodi...                                                
        ^~~~                                                                                
(irb):2: invalid Unicode codepoint                                                          
a += "\uD83D\uDC69".force_encoding(Encoding::UT...                                          
              ^~~~                                                                          
        from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'                                                                                           
        from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load'                     
        from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `<main>'

Also see Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16

Updated by noraj (Alexandre ZANNI) about 3 years ago

  • Description updated (diff)
  • ruby -v changed from ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux] to ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]

test against 3.1.1p18

Updated by duerst (Martin Dürst) about 3 years ago

  • Status changed from Open to Rejected

"\uD83D\uDC69" tries to create an UTF-8 string with surrogates. In UTF-8, surrogates are not allowed, and therefore you get an error. Adding .force_encoding(Encoding::UTF_16) does not change any of this, the error has already happened. It is also conceptually wrong, because it would label a sequence of UTF-8 bytes as UTF-16, which would give very strange results.

If you want the 'woman' emoji in UTF-16, then here are some choices:

"\u{1F469}".encode('UTF-16') # but this will prepend \uFEFF
"👩".encode('UTF-16') # but this will prepend \uFEFF
[0xD83D, 0xDC69]..pack('S>*').force_encoding('UTF-16')

If it's something else that you want, please tell us what you want. Also, please note that the above worked on two of my systems, but may not work on your system, because it depends on the endianness of UTF-16 (whether it is actually UTF-16BE or UTF-16LE).

Updated by noraj (Alexandre ZANNI) about 3 years ago

Thank you Martin.

I'm actually working on an Unicode study, I was not interested into representing the emoji with it's codepoint but to actually be able to write non-BMP glyph in UTF-16 by using the surrogates.

As far as I understand, it's not possible to have a native UTF-16 string it will always be UTF-8 converted to UTF-16 so my only option to write surrogates directly is to use pack?

Updated by duerst (Martin Dürst) about 3 years ago

noraj (Alexandre ZANNI) wrote in #note-3:

As far as I understand, it's not possible to have a native UTF-16 string it will always be UTF-8 converted to UTF-16 so my only option to write surrogates directly is to use pack?

Or write your own custom method, but that's unnecessary. When it comes to encodings for Unicode, Ruby is definitely heavily biased towards UTF-8, because UTF-8 is compatible with ASCII.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0