Bug #3407

Kernel.open Ignores 'BOM|' Prefix of :encoding Value

Added by Run Paint Run Run almost 5 years ago. Updated almost 4 years ago.

[ruby-core:30641]
Status:Closed
Priority:Low
Assignee:-
ruby -v:ruby 1.9.3dev (2010-06-01 trunk 28120) [i686-linux] Backport:

Description

=begin
As reported in :

open('/tmp/bom', mode: ?w){|f| f << "\xEF\xBB\xBFfoo"}
[*open('/tmp/bom', encoding: 'BOM|utf-8').read.bytes]
=> [239, 187, 191, 102, 111, 111]
[*open('/tmp/bom', mode: 'r:BOM|utf-8').read.bytes]
=> [102, 111, 111]
[*open('/tmp/bom', 'r:BOM|utf-8').read.bytes]
=> [102, 111, 111]
=end

History

#1 Updated by Nobuyoshi Nakada almost 5 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

=begin
This issue was solved with changeset r28199.
Run Paint, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

=end

#2 Updated by Run Paint Run Run almost 5 years ago

=begin
Much obliged. Is the following intended?

File.read('/tmp/bom', external_encoding: 'BOM|UTF-8') 
#=> ArgumentError: unknown encoding name - BOM|UTF-8

(I also noticed that io_encname_bom_p() appears to allow all 'UTF-' encodings to be prefixed with 'BOM|', yet io_strip_bom() doesn't strip the UTF-7 BOM. If I'm correct, an encoding of 'BOM|UTF-7' should probably be forbidden rather than silently discarded.)
=end

#3 Updated by Yui NARUSE almost 5 years ago

=begin

File.read('/tmp/bom', external_encoding: 'BOM|UTF-8')
#=> ArgumentError: unknown encoding name - BOM|UTF-8

Use IO.read('/tmp/bom', encoding: 'BOM|UTF-8').
It is not for encoding name, but mode_enc.
=end

#4 Updated by Run Paint Run Run almost 5 years ago

=begin
I suppose so. It just seems to add more complexity to an already confusing process. The format of a mode string is:

  • 'a' or 'r' or 'w'
  • Optionally followed by '+'
  • Optionally followed by either 'b' or 't'
  • Optionally followed by a colon, an optional 'BOM|' (if the external encoding is Unicode, and ignoring the UTF-7 case), followed by an encoding name.
  • Optionally followed by another colon, then either another encoding name or hyphen.

Then, the :encoding argument can take the value after the first colon in the mode string. The :internal_encoding argument can take the value after the second colon in the mode string. However, the :external_encoding argument takes the value between the two colons, but cannot have a 'BOM|' prefix. (Further, the rdoc (io.c:6363) claims that, w.r.t. :external_encoding, '-' is a synonym for Encoding.default_external, but this value raises an ArgumentError). It's a lot to explain. The fewer special cases, the better, IMHO.
=end

Also available in: Atom PDF