https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112014-12-10T11:21:11ZRuby Issue Tracking SystemRuby master - Bug #10584: String.valid_encoding?, String.ascii_only? fails to account for BOM.https://bugs.ruby-lang.org/issues/10584?journal_id=503492014-12-10T11:21:11Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>This isn't as simple as you describe it. With respect to BOMs, there is a clear distinction between external data and internal data. A BOM is often very helpful in external data (e.g. a file). On the other hand, it's not only useless, but actually highly counterproductive for internal data (just think about concatenation).</p>
<p>The problem currently is that Ruby doesn't absorb that difference, it leaves it to the programmer. The reason for this is that it's difficult to define a clear external/internal boundary (the file example is the easy one). Also, some cases require a BOM (e.g. UTF-16 in XML) whereas others forbid it and others allow it and so on. It might be possible to deal with some of this as options on methods reading from files, but that would require careful analysis.</p>
<p>Because U+FFFE isn't a valid codepoint in Unicode, your first two examples could be made true, and might indeed catch some errors. For your third example, a string with a BOM is definitely not ASCII, so ascii_only? should definitely return false. This is not only the definition of ASCII, but also tightly linked to Ruby's internals (including optimizations).</p>
<p>For your forth example, once internal, it's unclear whether the BOM is actually a BOM or a zero-width non-breaking space. The later can appear at the start of a piece of text easily. Although explicitly deprecated, it's still effective, I just used it recently in a Web page.</p> Ruby master - Bug #10584: String.valid_encoding?, String.ascii_only? fails to account for BOM.https://bugs.ruby-lang.org/issues/10584?journal_id=693492018-01-05T21:01:17Znaruse (Yui NARUSE)naruse@airemix.jp
<ul><li><strong>Target version</strong> deleted (<del><i>2.2.0</i></del>)</li></ul> Ruby master - Bug #10584: String.valid_encoding?, String.ascii_only? fails to account for BOM.https://bugs.ruby-lang.org/issues/10584?journal_id=979212022-06-10T06:15:51Zmame (Yusuke Endoh)mame@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Rejected</i></li></ul><p>For the third and forth examples, you can use <code>BOM|UTF-8</code> encoding.</p>
<pre><code>$ ruby -e 'p File.read("utf-8-with-bom-file", encoding: "BOM|UTF-8").ascii_only?'
true
$ ruby -e 'p File.read("utf-8-with-bom-file", encoding: "BOM|UTF-8")[0]'
"#"
</code></pre>
<p>For the first and second examples, I think it is a problem of the definition of <code>String#valid_encoding?</code> rather than a BOM. Currently, <code>"\uFFFE".valid_encoding?</code> returns true. (Note that <code>U+FFFE</code> is not a character.) So I think it is considered a spec. If we change it as a new feature, we need to evaluate its value and estimate the impact of compatibility.</p>