https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112015-01-22T01:12:27ZRuby Issue Tracking SystemRuby master - Feature #10770: chr and ord behavior for ill-formed byte sequences and surrogate code pointshttps://bugs.ruby-lang.org/issues/10770?journal_id=511652015-01-22T01:12:27Zmasakielastic (Masaki Kagaya)masakielastic@gmail.com
<ul></ul><p>This issue comes from discussion about mruby's behavior (<a href="https://github.com/mruby/mruby/issues/2708" class="external">https://github.com/mruby/mruby/issues/2708</a>).</p> Ruby master - Feature #10770: chr and ord behavior for ill-formed byte sequences and surrogate code pointshttps://bugs.ruby-lang.org/issues/10770?journal_id=511812015-01-22T10:19:44Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/51181/diff?detail_id=36840">diff</a>)</li></ul><p>Masaki Kagaya wrote:</p>
<blockquote>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="n">str</span> <span class="o">=</span> <span class="s2">"a</span><span class="se">\x80</span><span class="s2">bc"</span>
<span class="n">str</span><span class="p">.</span><span class="nf">each_char</span> <span class="p">{</span><span class="o">|</span><span class="n">c</span><span class="o">|</span> <span class="nb">puts</span> <span class="n">c</span> <span class="p">}</span>
<span class="c1"># no error</span>
</code></pre>
</blockquote>
<p>Sounds like a bug of <code>String#each_char</code>, but maybe intensional.</p>
<blockquote>
<p>The one way of keeping consistency is change <code>ord</code> to return substitute code point such as 0xFFFD adopted by <code>scrub</code>.</p>
</blockquote>
<p>Implicit substitution doesn't feel a nice idea to me.</p>
<blockquote>
<p>How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3.</p>
</blockquote>
<p>Primarily, it's a responsibility of those bindings.</p>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="n">str</span><span class="p">.</span><span class="nf">encode</span><span class="p">(</span><span class="s2">"UTF-16BE"</span><span class="p">).</span><span class="nf">unpack</span><span class="p">(</span><span class="s2">"v*"</span><span class="p">).</span><span class="nf">pack</span><span class="p">(</span><span class="s2">"U*"</span><span class="p">)</span>
</code></pre>