Bug #7200

Setting external encoding with BOM|

Added by Brian Shirai over 1 year ago. Updated about 1 year ago.

[ruby-core:48130]
Status:Rejected
Priority:Normal
Assignee:Yui NARUSE
Category:-
Target version:2.0.0
ruby -v:ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin10.8.0] Backport:

Description

File.open will accept, for example, :encoding => "bom|utf-16be:euc-jp" or :encoding => "bom|utf-16be". However, :externalencoding => "bom|utf-16be" raises an ArgumentError. Likewise, IO#setencoding will accept "bom|utf-16be:euc-jp" but raises an ArgumentError if passed "bom|utf-16be", "euc-jp".

It is inconsistent to accept "bom|utf-*" in some cases and not others.

See the following IRB transcript.

$ irb
1.9.3p286 :001 > f = File.open "foo.txt", "r", :encoding => "bom|utf-16be:euc-jp"
=> #File:foo.txt
1.9.3p286 :002 > f.internalencoding
=> #Encoding:EUC-JP
1.9.3p286 :003 > f.external
encoding
=> #Encoding:UTF-16BE
1.9.3p286 :004 > f.close
=> nil
1.9.3p286 :005 > f = File.open "foo.txt", "r"
=> #File:foo.txt
1.9.3p286 :006 > f.setencoding "bom|utf-16be:euc-jp"
=> #File:foo.txt
1.9.3p286 :007 > f.internal
encoding
=> #Encoding:EUC-JP
1.9.3p286 :008 > f.externalencoding
=> #Encoding:UTF-16BE
1.9.3p286 :009 > f.close
=> nil
1.9.3p286 :010 > f = File.open "foo.txt", "r"
=> #File:foo.txt
1.9.3p286 :011 > f.set
encoding "bom|utf-16be", "euc-jp"
ArgumentError: unknown encoding name - bom|utf-16be
from (irb):11:in set_encoding'
from (irb):11
from /Users/brian/.rvm/rubies/ruby-1.9.3-p286/bin/irb:16:in
'
1.9.3p286 :012 > f = File.open "foo.txt", "w", :external_encoding => "bom|utf-16be"
ArgumentError: unknown encoding name - bom|utf-16be
from (irb):12:in initialize'
from (irb):12:in
open'
from (irb):12
from /Users/brian/.rvm/rubies/ruby-1.9.3-p286/bin/irb:16:in `'
1.9.3p286 :013 > f = File.open "foo.txt", "rb", :encoding => "bom|utf-16be"
=> #File:foo.txt

Thanks,
Brian

History

#1 Updated by Yusuke Endoh over 1 year ago

  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE
  • Target version set to 2.0.0

Naruse-san, could you handle this?

Yusuke Endoh mame@tsg.ne.jp

#2 Updated by Yui NARUSE over 1 year ago

  • Status changed from Assigned to Rejected

BOM| specifier is available only on modeenc.
:encoding of open and set
encoding(modeenc) handles modeenc,
but :externalencoding of open and setencoding(ext, int) handles encodings.

#3 Updated by Shyouhei Urabe over 1 year ago

  • Status changed from Rejected to Assigned

Yui that's how it works, not why it should be rejected.

If you want to reject this, write why.

#4 Updated by Yui NARUSE over 1 year ago

  • Status changed from Assigned to Rejected

I meant it is why.
A mode_enc and an encoding are different thing in syntax, implementation and meaning.

BOM|UTF-* is not the name of an encoding, but it is a part of mode specifier.

#5 Updated by Shyouhei Urabe over 1 year ago

  • Status changed from Rejected to Assigned

naruse (Yui NARUSE) wrote:

I meant it is why.
A mode_enc and an encoding are different thing in syntax, implementation and meaning.

BOM|UTF-* is not the name of an encoding, but it is a part of mode specifier.

That's OK.

But #setencoding is confusing. Or inconsistent at least. Because it sets either encoding or mode depending on its arguments. Should we separate that method into two, like #setmode and #set_encoding ?

#6 Updated by Shyouhei Urabe over 1 year ago

どうも伝わってないぽいので日本語で書きますけど、貴方報告者の問題を解決する気ないでしょ。

報告者の問題は何だったかを読みかえしていただけますか。それで、なぜこれが問題ではないのかを解説していただけますか。

#7 Updated by Yui NARUSE over 1 year ago

shyouhei (Shyouhei Urabe) wrote:

どうも伝わってないぽいので日本語で書きますけど、貴方報告者の問題を解決する気ないでしょ。

報告者の問題は何だったかを読みかえしていただけますか。それで、なぜこれが問題ではないのかを解説していただけますか。

Brian だから現実の問題ではないと認識しています。

#8 Updated by Shyouhei Urabe over 1 year ago

naruse (Yui NARUSE) wrote:

shyouhei (Shyouhei Urabe) wrote:

どうも伝わってないぽいので日本語で書きますけど、貴方報告者の問題を解決する気ないでしょ。

報告者の問題は何だったかを読みかえしていただけますか。それで、なぜこれが問題ではないのかを解説していただけますか。

Brian だから現実の問題ではないと認識しています。

How dare you.

#9 Updated by Shyouhei Urabe over 1 year ago

So yui says this issue is illustrative because it was reported by Brian. What a ...

I feel very sorry, Brian. I can do nothing anymore.

#10 Updated by Martin Dürst over 1 year ago

Brian (or others),

[written in part to help Shouhei a bit]

Do you have an actual use case where you need something like
f.set_encoding "bom|utf-16be", "euc-jp"
If yes, can you explain?

The current behavior in in part influenced by implementation. But there is also a conceptual issue, because "bom|" only applies at the start of the file, and may have different implications for input (check for a BOM) and output (add a BOM). So we have to carefully think what's the best way to make this easy for programmers to use the right way.

Regards, Martin.

#11 Updated by Yui NARUSE over 1 year ago

shyouhei (Shyouhei Urabe) wrote:

So yui says this issue is illustrative because it was reported by Brian. What a ...

I feel very sorry, Brian. I can do nothing anymore.

Don't do FUD.

Brian said they are inconsistent even if mode_enc looks like encoding.
I showed the reason why it is: because they are different and they take different type of arguments.

If Brian is not satisfied the reason and has an better idea, he should show it with actual use case.
I thought Brian create this ticket with Rubinius/RubySpec interest, and it should be reasonable because they are no use case.
I criticize imagining his fictional desire and blaming me.

duerst (Martin Dürst) wrote:

The current behavior in in part influenced by implementation. But there is also a conceptual issue, because "bom|" only applies at the start of the file, and may have different implications for input (check for a BOM) and output (add a BOM). So we have to carefully think what's the best way to make this easy for programmers to use the right way.

Mainly it is conceptual.
This BOM|UTF-* specifier has two main function:
* skip U+FEFF at the beginning of the file
* set the external encoding with seeing the BOM
Such behavior is considered a derivative of mode, and it is not encoding.
Because of it is not an encoding, they can't be used in the context of encodings.

See also http://bugs.ruby-lang.org/issues/1951 and related tickets.

#12 Updated by Yui NARUSE about 1 year ago

  • Status changed from Assigned to Rejected

#13 Updated by Brian Shirai about 1 year ago

#set_encoding accepts ("bom|utf-16be:euc-jp") but rejects ("bom|utf-16be", "euc-jp"). This is inconsistent, confusing, and has nothing to do with the artificial mode vs encoding justification above. This inconsistency requires additional code that is subject to bugs.

The fact that there were no tests for this until I wrote the RubySpecs illustrates the inconsistency, confusion, and susceptibility to ad hoc implementation-defined semantics. I still don't see a single test for #set_encoding with "bom|" arguments in the MRI tests. Or am I missing something?

Cheers,
Brian

Also available in: Atom PDF