Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2018-10-06T19:51:02Z</p> <ul></ul><blockquote> <p>BTW: stdlib::CSV chokes on the BOM</p> </blockquote> <p>I can't say how common this is or whether there is a bug; but in the event<br> that there may be, and the use case or situation involving the bug or faulty<br> behaviour affecting other ruby hackers, I would agree in this event that CSV<br> should probably be able to handle BOM-specific entries as well, in one way<br> or another (be it automatic or via another API).</p> <p>I also agree that it could perhaps be mentioned somewhere, be it in the<br> csv documentation or elsewhere.</p> <p>To the workaround: I assume you meant this only for a solution if others face<br> a similar problem, rather than a permanent addition to class String, yes?<br> (I ask this because adding a specific method to class String permanently in<br> ruby may be much harder to do and get approved, whereas an extension to ruby's<br> CSV is most likely easier and possible.)</p> </article> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2018-10-07T00:53:10Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/74333/diff?detail_id=50028">diff</a>)</li><li><strong>Assignee</strong> set to <i>13939</i></li></ul><p>foonlyboy (Eike Dierks) wrote:</p> <blockquote> <p>I believe this to be a bug in how byte data is converted to the ruby internal String representation.</p> </blockquote> <p>Yes, a BOM should be removed at the conversion, the reading from a data stream.</p> <blockquote> <p>There is a workaround, but this needs to be documented:</p> <pre><code class="ruby syntaxhl" data-language="ruby"><span class="no">IO</span><span class="p">.</span><span class="nf">read</span><span class="p">(</span><span class="n">mode</span><span class="ss">:'r:BOM|UTF-8'</span><span class="p">)</span> </code></pre> </blockquote> <p>It is documented at <code>IO.new</code>, and you can use it at <code>CSV.open</code> too.</p> <p>rdoc of <code>CSV.open</code>:</p> <blockquote> <p>You must pass a <code>filename</code> and may optionally add a <code>mode</code> for Ruby's <code>open()</code>.</p> </blockquote> <p>rdoc of <code>Kernel.open</code>:</p> <blockquote> <p>See the documentation of <code>IO.new</code> for full documentation of the <code>mode</code> string directives.</p> </blockquote> <p>rdoc of <code>IO.new</code>:</p> <blockquote> <p>If <code>"BOM|UTF-8"</code>, <code>"BOM|UTF-16LE"</code> or <code>"BOM|UTF16-BE"</code> are used, Ruby checks for<br> a Unicode BOM in the input document to help determine the encoding. For<br> UTF-16 encodings the file open mode must be binary. When present, the BOM<br> is stripped and the external encoding from the BOM is used. When the BOM<br> is missing the given Unicode encoding is used as <code>ext_enc</code>. (The BOM-set<br> encoding option is case insensitive, so <code>"bom|utf-8"</code> is also valid.)</p> </blockquote> <p>Documents improvement patches are welcome.</p> <blockquote> <p>But I'm asking for to improve the UTF-BOM handling:</p> <ul> <li>The BOM is only used for transfer encoding at the byte stream level.</li> </ul> </blockquote> <p>This is half true.</p> <p><a href="https://en.wikipedia.org/wiki/Byte_order_mark#Usage" class="external">https://en.wikipedia.org/wiki/Byte_order_mark#Usage</a></p> <blockquote> <p>If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"</p> </blockquote> <p>The character at other place is not called as "BOM".</p> <blockquote> <ul> <li>The BOM MUST NOT be part of the String in internal representation.</li> </ul> </blockquote> <p>Yes, it should be removed at the reading, that is the only chance to remove a BOM properly.</p> </article> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2018-10-12T18:44:57Z</p> <ul></ul><p>I looked into it a bit more closely into it:</p> <p>io.c does this in</p> <pre><code class="c syntaxhl" data-language="c"><span class="k">static</span> <span class="kt">int</span> <span class="n">io_strip_bom</span><span class="p">(</span><span class="n">VALUE</span> <span class="n">io</span><span class="p">)</span> </code></pre> <p>which is called by:</p> <pre><code class="c syntaxhl" data-language="c"><span class="k">static</span> <span class="kt">void</span> <span class="n">io_set_encoding_by_bom</span><span class="p">(</span><span class="n">VALUE</span> <span class="n">io</span><span class="p">)</span> </code></pre> <blockquote> <p>It is documented at <code>IO.new</code>, and you can use it at <code>CSV.open</code> too.<br> Yes, I was aware of this.</p> </blockquote> <p>I also agree the the conversion has to take place at opening the file.</p> <p>But with rails I get a ActionDispatch::Http::UploadedFile<br> (which returns an ASCII-8BIT byte stream)</p> <p>And I could find no way to apply the io_strip_bom() to it,<br> not even by going through StringIO.<br> (but then Ruby is not about applying tricks anyway)</p> <p>It sounds to me that nobu also agrees, that the BOM should always be removed.</p> <blockquote> <p>If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"</p> </blockquote> <p>I don't care so much about this for now.<br> (while I can imagine this to happen when concatenating files ...)</p> <p>But let's fix the more simple problems first.</p> <p>I think the BOM is used for two reasons in byte streams:</p> <ul> <li>a magic number for UTF encoded data (which might even apply to UTF-8)</li> <li>a magic number to distinguish different UTF byte orderings when using UTF-16, UTF-32, UTF-36?</li> </ul> <p>But in the ruby world, we have <strong>String</strong><br> We should remove all artefacts from any external encoding.</p> <p>Impact:</p> <p>I believe this might need a lot of changes throughout more than just one place in the code,<br> but I believe this should be fully upward compatible with <em>most</em> customers code.</p> <p>This should still agree with the ruby spec,<br> because nowhere was it ever declared that String keeps the BOM.</p> <hr> <p>Please excuse my lengthy writings,<br> but I thought these encoding problems were a thing from the past.</p> <p>We might also look at the other languages around.<br> Makes for a good rosetta code ...</p> <p>~eike</p> </article> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2019-06-13T07:24:43Z</p> <ul></ul><p><a href="https://github.com/nobu/ruby/pull/new/feature/15210-detect_bom" class="external">https://github.com/nobu/ruby/pull/new/feature/15210-detect_bom</a></p> </article> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2019-06-13T08:36:54Z</p> <ul></ul><p>Renamed and an exception at unexpected condition.<br> <a href="https://github.com/nobu/ruby/pull/new/feature/15210-set_encoding_by_bom" class="external">https://github.com/nobu/ruby/pull/new/feature/15210-set_encoding_by_bom</a></p> </article> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2019-06-13T09:16:35Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset <a class="changeset" title="IO#set_encoding_by_bom * io.c (rb_io_set_encoding_by_bom): IO#set_encoding_by_bom to set the e..." href="https://bugs.ruby-lang.org/projects/ruby-master/repository/git/revisions/e717d6faa8463c70407e6aaf116c6b6181f30be6">git|e717d6faa8463c70407e6aaf116c6b6181f30be6</a>.</p> <hr> <p>IO#set_encoding_by_bom</p> <ul> <li>io.c (rb_io_set_encoding_by_bom): IO#set_encoding_by_bom to set<br> the encoding by BOM if exists. [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: UTF-8 BOM should be removed from String in internal representation (Closed)" href="https://bugs.ruby-lang.org/issues/15210">#15210</a>]</li> </ul> </article> <article> <h1>Ruby master - Bug #15210: UTF-8 BOM should be removed from String in internal representation</h1> <p>2019-06-13T10:02:56Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-5 priority-4 priority-default closed" href="/issues/15908">Bug #15908</a>: Detecting BOM with non-UTF encoding</i> added</li></ul> </article> </main></body></html>