Feature #10770
Updated by nobu (Nobuyoshi Nakada) almost 10 years ago
`ord` raises ord raise error when meeting ill-formed byte sequences, thus the difference of atttiute exists between `each_char` beteween each_char and `each_codepoint`. each_codepoint. ~~~ruby <pre><code class="ruby"> str = "a\x80bc" str.each_char {|c| puts c } # no error str.each_codepoint {|c| puts c } # invalid byte sequence in UTF-8 (ArgumentError) ~~~ </code></pre> The one way of keeping consistency is change `ord` ord to return substitute code point such as 0xFFFD adopted by `scrub`. scrub. Another problem about consitency is surrogate code points. Althouh CRuby allows to use surrogate code points in unicode literal, `ord` ord and `chr` don't chr dont't allow them. ~~~ruby <pre><code class="ruby"> "\uD800".ord # invalid byte sequence in UTF-8 (ArgumentError) 0xD800.chr('UTF-8') # invalid codepoint 0xD800 in UTF-8 (RangeError) ~~~ </code></pre> How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3. ~~~ruby <pre><code class="ruby"> str = "\u{1F436}" # DOG FACE cp = str.ord if cp > 0x10000 then # http://unicode.org/faq/utf_bom.html#utf16-4 lead = 0xD800 - (0x10000 >> 10) + (cp >> 10) trail = 0xDC00 + (cp & 0x3FF) ret = lead.chr('UTF-8') + trail.chr('UTF-8') end ~~~ </code></pre>