https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112018-10-06T19:51:02ZRuby Issue Tracking SystemRuby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=743322018-10-06T19:51:02Zshevegen (Robert A. Heiler)shevegen@gmail.com
<ul></ul><blockquote>
<p>BTW: stdlib::CSV chokes on the BOM</p>
</blockquote>
<p>I can't say how common this is or whether there is a bug; but in the event<br>
that there may be, and the use case or situation involving the bug or faulty<br>
behaviour affecting other ruby hackers, I would agree in this event that CSV<br>
should probably be able to handle BOM-specific entries as well, in one way<br>
or another (be it automatic or via another API).</p>
<p>I also agree that it could perhaps be mentioned somewhere, be it in the<br>
csv documentation or elsewhere.</p>
<p>To the workaround: I assume you meant this only for a solution if others face<br>
a similar problem, rather than a permanent addition to class String, yes?<br>
(I ask this because adding a specific method to class String permanently in<br>
ruby may be much harder to do and get approved, whereas an extension to ruby's<br>
CSV is most likely easier and possible.)</p> Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=743332018-10-07T00:53:10Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/74333/diff?detail_id=50028">diff</a>)</li><li><strong>Assignee</strong> set to <i>13939</i></li></ul><p>foonlyboy (Eike Dierks) wrote:</p>
<blockquote>
<p>I believe this to be a bug in how byte data is converted to the ruby internal String representation.</p>
</blockquote>
<p>Yes, a BOM should be removed at the conversion, the reading from a data stream.</p>
<blockquote>
<p>There is a workaround, but this needs to be documented:</p>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="no">IO</span><span class="p">.</span><span class="nf">read</span><span class="p">(</span><span class="n">mode</span><span class="ss">:'r:BOM|UTF-8'</span><span class="p">)</span>
</code></pre>
</blockquote>
<p>It is documented at <code>IO.new</code>, and you can use it at <code>CSV.open</code> too.</p>
<p>rdoc of <code>CSV.open</code>:</p>
<blockquote>
<p>You must pass a <code>filename</code> and may optionally add a <code>mode</code> for Ruby's <code>open()</code>.</p>
</blockquote>
<p>rdoc of <code>Kernel.open</code>:</p>
<blockquote>
<p>See the documentation of <code>IO.new</code> for full documentation of the <code>mode</code> string directives.</p>
</blockquote>
<p>rdoc of <code>IO.new</code>:</p>
<blockquote>
<p>If <code>"BOM|UTF-8"</code>, <code>"BOM|UTF-16LE"</code> or <code>"BOM|UTF16-BE"</code> are used, Ruby checks for<br>
a Unicode BOM in the input document to help determine the encoding. For<br>
UTF-16 encodings the file open mode must be binary. When present, the BOM<br>
is stripped and the external encoding from the BOM is used. When the BOM<br>
is missing the given Unicode encoding is used as <code>ext_enc</code>. (The BOM-set<br>
encoding option is case insensitive, so <code>"bom|utf-8"</code> is also valid.)</p>
</blockquote>
<p>Documents improvement patches are welcome.</p>
<blockquote>
<p>But I'm asking for to improve the UTF-BOM handling:</p>
<ul>
<li>The BOM is only used for transfer encoding at the byte stream level.</li>
</ul>
</blockquote>
<p>This is half true.</p>
<p><a href="https://en.wikipedia.org/wiki/Byte_order_mark#Usage" class="external">https://en.wikipedia.org/wiki/Byte_order_mark#Usage</a></p>
<blockquote>
<p>If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"</p>
</blockquote>
<p>The character at other place is not called as "BOM".</p>
<blockquote>
<ul>
<li>The BOM MUST NOT be part of the String in internal representation.</li>
</ul>
</blockquote>
<p>Yes, it should be removed at the reading, that is the only chance to remove a BOM properly.</p> Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=744312018-10-12T18:44:57Zfoonlyboy (Eike Dierks)
<ul></ul><p>I looked into it a bit more closely into it:</p>
<p>io.c does this in</p>
<pre><code class="c syntaxhl" data-language="c"><span class="k">static</span> <span class="kt">int</span>
<span class="n">io_strip_bom</span><span class="p">(</span><span class="n">VALUE</span> <span class="n">io</span><span class="p">)</span>
</code></pre>
<p>which is called by:</p>
<pre><code class="c syntaxhl" data-language="c"><span class="k">static</span> <span class="kt">void</span>
<span class="n">io_set_encoding_by_bom</span><span class="p">(</span><span class="n">VALUE</span> <span class="n">io</span><span class="p">)</span>
</code></pre>
<blockquote>
<p>It is documented at <code>IO.new</code>, and you can use it at <code>CSV.open</code> too.<br>
Yes, I was aware of this.</p>
</blockquote>
<p>I also agree the the conversion has to take place at opening the file.</p>
<p>But with rails I get a ActionDispatch::Http::UploadedFile<br>
(which returns an ASCII-8BIT byte stream)</p>
<p>And I could find no way to apply the io_strip_bom() to it,<br>
not even by going through StringIO.<br>
(but then Ruby is not about applying tricks anyway)</p>
<p>It sounds to me that nobu also agrees, that the BOM should always be removed.</p>
<blockquote>
<p>If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"</p>
</blockquote>
<p>I don't care so much about this for now.<br>
(while I can imagine this to happen when concatenating files ...)</p>
<p>But let's fix the more simple problems first.</p>
<p>I think the BOM is used for two reasons in byte streams:</p>
<ul>
<li>a magic number for UTF encoded data (which might even apply to UTF-8)</li>
<li>a magic number to distinguish different UTF byte orderings when using UTF-16, UTF-32, UTF-36?</li>
</ul>
<p>But in the ruby world, we have <strong>String</strong><br>
We should remove all artefacts from any external encoding.</p>
<p>Impact:</p>
<p>I believe this might need a lot of changes throughout more than just one place in the code,<br>
but I believe this should be fully upward compatible with <em>most</em> customers code.</p>
<p>This should still agree with the ruby spec,<br>
because nowhere was it ever declared that String keeps the BOM.</p>
<hr>
<p>Please excuse my lengthy writings,<br>
but I thought these encoding problems were a thing from the past.</p>
<p>We might also look at the other languages around.<br>
Makes for a good rosetta code ...</p>
<p>~eike</p> Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=785172019-06-13T07:24:43Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p><a href="https://github.com/nobu/ruby/pull/new/feature/15210-detect_bom" class="external">https://github.com/nobu/ruby/pull/new/feature/15210-detect_bom</a></p> Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=785212019-06-13T08:36:54Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>Renamed and an exception at unexpected condition.<br>
<a href="https://github.com/nobu/ruby/pull/new/feature/15210-set_encoding_by_bom" class="external">https://github.com/nobu/ruby/pull/new/feature/15210-set_encoding_by_bom</a></p> Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=785232019-06-13T09:16:35Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset <a class="changeset" title="IO#set_encoding_by_bom * io.c (rb_io_set_encoding_by_bom): IO#set_encoding_by_bom to set the e..." href="https://bugs.ruby-lang.org/projects/ruby-master/repository/git/revisions/e717d6faa8463c70407e6aaf116c6b6181f30be6">git|e717d6faa8463c70407e6aaf116c6b6181f30be6</a>.</p>
<hr>
<p>IO#set_encoding_by_bom</p>
<ul>
<li>io.c (rb_io_set_encoding_by_bom): IO#set_encoding_by_bom to set<br>
the encoding by BOM if exists. [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: UTF-8 BOM should be removed from String in internal representation (Closed)" href="https://bugs.ruby-lang.org/issues/15210">#15210</a>]</li>
</ul> Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representationhttps://bugs.ruby-lang.org/issues/15210?journal_id=785282019-06-13T10:02:56Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-5 priority-4 priority-default closed" href="/issues/15908">Bug #15908</a>: Detecting BOM with non-UTF encoding</i> added</li></ul>