Bug #178

File.open on sprintf-formatted string fails with encoding conversion error on OS X

Added by Eric Hodel almost 7 years ago. Updated about 4 years ago.

[ruby-core:17310]
Status:Closed
Priority:Normal
Assignee:Yui NARUSE
ruby -v: Backport:

Description

=begin
String#% and File.open are interacting strangely on OS X, so files opened with a sprintf formatted string raise an ArgumentError:

$ ruby19 -vwe 'File.new("foo" % [])'
ruby 1.9.0 (2008-06-18 revision 15873) [i686-darwin9.3.0]
-e:1:in initialize': transcoding not supported (from US-ASCII to UTF8-MAC) (ArgumentError)
from -e:1:in
new'
from -e:1:in `'

Using just "foo" as the filename works fine:

$ ruby19 -we 'File.new("foo")'

As does String#<<:

$ ruby19 -we 'File.new("foo" << "")'
=end

History

#1 Updated by Anonymous almost 7 years ago

=begin
I'm not sure why UTF8-MAC was introduced. UTF8-MAC indeed
isn't supported currently for transcoding.

I don't even know what UTF8-MAC is. It is defined as a replica
of UTF-8 in enc/utf_8.c. It is not defined at
http://www.iana.org/assignments/character-sets.

It may be that it is an attempt to refer to the fact that UTF-8
usually is used in decomposed form (NFD) on the Mac. But that
would not be relevant for opening a file, because the Mac OS
accepts any kind of normalization, and converts to NFD by itself
(similar to a file system that accepts both upper- and lower-case,
but internally uses only one case).

Also, the issues of normalization is orthogonal to what kind of
encoding form is used for Unicode, and therefore adding it to
an encoding is something that we should consider much more
carefully. Overall, UTF-8 should be UTF-8, it's a bad idea to
create variants.

Regards, Martin.

At 09:42 08/06/18, Eric Hodel wrote:

Issue #178 has been reported by Eric Hodel.


Bug #178: File.open on sprintf-formatted string fails with encoding
conversion error on OS X
http://redmine.ruby-lang.org/issues/show/178

Author: Eric Hodel
Status: Open
Priority: Normal
Assigned to:
Category:
Target version:

String#% and File.open are interacting strangely on OS X, so files opened
with a sprintf formatted string raise an ArgumentError:

$ ruby19 -vwe 'File.new("foo" % [])'
ruby 1.9.0 (2008-06-18 revision 15873) [i686-darwin9.3.0]
-e:1:in initialize': transcoding not supported (from US-ASCII to UTF8-MAC)
(ArgumentError)
from -e:1:in
new'
from -e:1:in `'

Using just "foo" as the filename works fine:

$ ruby19 -we 'File.new("foo")'

As does String#<<:

$ ruby19 -we 'File.new("foo" << "")'


You have received this notification because you have either subscribed to
it, or are involved in it.
To change your notification preferences, please click here:
http://redmine.ruby-lang.org/my/account

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#2 Updated by Yui NARUSE almost 7 years ago

  • Status changed from Open to Closed
  • Assignee set to Yui NARUSE
  • % Done changed from 0 to 100

=begin
This problem is from the same bug of Bug #179,
and it was fixed at r17403.
Thanks,

It may be that it is an attempt to refer to the fact that UTF-8
usually is used in decomposed form (NFD) on the Mac. But that
would not be relevant for opening a file, because the Mac OS
accepts any kind of normalization, and converts to NFD by itself
(similar to a file system that accepts both upper- and lower-case,
but internally uses only one case).

Yeah, that's true when you write to filesystem,
but when you read from filesystem you may want to know
whether they are composed or decomposed.
=end

#3 Updated by Anonymous almost 7 years ago

=begin
At 15:36 08/06/18, Yui NARUSE wrote:

Issue #178 has been updated by Yui NARUSE.

Status changed from Open to Closed
Assigned to set to Yui NARUSE
% Done changed from 0 to 100

This problem is from the same bug of Bug #179,
and it was fixed at r17403.
Thanks,

Great, thanks!

It may be that it is an attempt to refer to the fact that UTF-8
usually is used in decomposed form (NFD) on the Mac. But that
would not be relevant for opening a file, because the Mac OS
accepts any kind of normalization, and converts to NFD by itself
(similar to a file system that accepts both upper- and lower-case,
but internally uses only one case).

Yeah, that's true when you write to filesystem,
but when you read from filesystem you may want to know
whether they are composed or decomposed.

That may indeed be the case. But this really only applies to
filenames (and maybe similar names of resources) on the Mac.
For such a small subset of data, I think it's overkill if
as a consequence, processing together with other data is
blocked (as we saw in the bug report).
As far as I understand, it doesn't apply to file contents or
other data on the Mac. Also, as soon as you concatenate two
strings, there is no guarantee that NFD is kept (unless of
course you implement separate string concatenation for this
specific encoding). In general, the best thing to do if you
want to know is to check, and the best thing if you want to
be sure is to check, and then to change if necessary. But we
still have to implement this (maybe for -3?).

[Also, if the meaning of UTF8-MAC is really NFD, it might
be better to actually call it that way.]

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#4 Updated by Yui NARUSE almost 7 years ago

=begin
Martin Duerst wrote:

It may be that it is an attempt to refer to the fact that UTF-8
usually is used in decomposed form (NFD) on the Mac. But that
would not be relevant for opening a file, because the Mac OS
accepts any kind of normalization, and converts to NFD by itself
(similar to a file system that accepts both upper- and lower-case,
but internally uses only one case).
Yeah, that's true when you write to filesystem,
but when you read from filesystem you may want to know
whether they are composed or decomposed.

That may indeed be the case. But this really only applies to
filenames (and maybe similar names of resources) on the Mac.
For such a small subset of data, I think it's overkill if
as a consequence, processing together with other data is
blocked (as we saw in the bug report).

This bug is derived from other point.

As far as I understand, it doesn't apply to file contents or
other data on the Mac. Also, as soon as you concatenate two
strings, there is no guarantee that NFD is kept (unless of
course you implement separate string concatenation for this
specific encoding). In general, the best thing to do if you
want to know is to check, and the best thing if you want to
be sure is to check, and then to change if necessary. But we
still have to implement this (maybe for -3?).

Off cource, the encoding of other data on the mac may be
other than UTF8-MAC: that's may be composed UTF-8.
I intend that strings labeld as UTF8-MAC may needed to be
converted or normalized. If you don't care about it,
you can use force_encoding.

[Also, if the meaning of UTF8-MAC is really NFD, it might
be better to actually call it that way.]

not real NFD, Apple's NFD as I commented in enc/utf_8.c.

--
NARUSE, Yui naruse@airemix.jp

=end

Also available in: Atom PDF