Project

General

Profile

Bug #14863

Array#join with empty array returns empty string always in US-ASCII encoding

Added by xsimov (Xavier Simó) over 1 year ago. Updated over 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:87562]

Description

Calling

irb(main):001:0> [].join.encoding
=> #<Encoding:US-ASCII>

returns an empty string and that empty string is always in US-ASCII encoding.

The expected result is that the returned empty string would be in UTF-8 since it seems to be the default for Ruby strings since 2.0.

History

Updated by shevegen (Robert A. Heiler) over 1 year ago

Interesting.

I kind of agree with you - that surprised me. Perhaps there is an
explanation for it, but perhaps it is a bug. Either way, I think
if it is a bug, it should obviously be fixed; and if it is not
a bug, perhaps the explanation for this behaviour could be
explained in some detail.

I just tested with this snippet stored in a .rb file:

#!/System/Index/bin/ruby -w
# Encoding: ISO-8859-1
# frozen_string_literal: true
# =========================================================================== #
puts [].join.encoding

joined = ['a','b','c'].join

puts joined.encoding

puts 'abc'.encoding

The above generated three ouputs, being:

US-ASCII
ISO-8859-1
ISO-8859-1

I sort of have to agree with Xavier. I think
the first result, for an empty [], is awkward
in particular when I already specified ANOTHER
encoding in the shebang - the other two strings
work fine and honour the shebang entry. (The
third one was just for testing; second line
was the result of a .join on an Array that
has some strings).

[].join probably creates a new, empty string,
but I think that new string should default to
the main encoding (e. g. UTF-8) - or the one
specified in the shebang directive (if
applicable). So I think it is most likely a
bug or perhaps just an oversight. I myself
never ran into this issue in my code so far.

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

xsimov (Xavier Simó) wrote:

Calling

irb(main):001:0> [].join.encoding
=> #<Encoding:US-ASCII>

returns an empty string and that empty string is always in US-ASCII encoding.

The expected result is that the returned empty string would be in UTF-8 since it seems to be the default for Ruby strings since 2.0.

UTF-8 is the default for literal strings, not the default for all strings. Note that strings will automatically change their encoding from US-ASCII to UTF-8 if a UTF-8 string that uses non-ASCII characters is appended to them.

$ ruby -e 'p ([].join << "\u1234").encoding'
#<Encoding:UTF-8>

Updated by xsimov (Xavier Simó) over 1 year ago

Taking into account
jeremyevans0 (Jeremy Evans) wrote:

UTF-8 is the default for literal strings, not the default for all strings. Note that strings will automatically change their encoding from US-ASCII to UTF-8 if a UTF-8 string that uses non-ASCII characters is appended to them.

$ ruby -e 'p ([].join << "\u1234").encoding'
#<Encoding:UTF-8>

And that Array#join also takes into consideration the encoding of the strings within that contain non-ASCII characters:

$ ruby -e 'p (["\u1234"].join).encoding'
#<Encoding:UTF-8>

maybe it would make sense that since UTF-8 is the default for literal strings it was the default also for the empty string returned from the join of an empty array.

My proposal to return the string in the locale encoding of the running ruby is so that the encoding returned by #join is consistent, since most of the times I see #join used it contains UTF-8 strings.

Thanks for your feedback! ;)

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

xsimov (Xavier Simó) wrote:

My proposal to return the string in the locale encoding of the running ruby is so that the encoding returned by #join is consistent, since most of the times I see #join used it contains UTF-8 strings.

You seem confusing the locale encoding and the source code encoding, but they are different things.
The latter is UTF-8 by default now, but the former is not.

What problems are you trying solve by this proposal?

#5

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

  • Status changed from Open to Feedback

Updated by xsimov (Xavier Simó) over 1 year ago

Thanks for the clarification. What I try to solve with this proposal is the empty array #join returning a string which is in a different encoding that the encoding of the string returned by an array of strings #join, to improve consistency.

I propose then to use the locale encoding.

I published a Pull Request with my proposed change in the Ruby git repository: https://github.com/ruby/ruby/pull/1897

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

The result encoding is determined by the encodings of the array's contents, and the locale is not related to it.
US-ASCII is the most "plain", so it is the fallback when the array has no contents.
I don't think that it is consistent enough.

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

nobu (Nobuyoshi Nakada) wrote:

I don't think that it is consistent enough.

Correction:
I think that it is consistent.

Updated by znz (Kazuhiro NISHIYAMA) over 1 year ago

[1].join.encoding is #<Encoding:US-ASCII>.
So I think current behavior is consistent too.

Updated by xsimov (Xavier Simó) over 1 year ago

I think it is not consistent because normally what I have seen the most is arrays of strings or containing strings being joined into one string.

In a case like that it so happens that all of the strings are in UTF-8 (because it is my locale encoding and the default in all the machines I have been working with) and so I get different encodings when the array is filled or when the array is empty.

The fact that an array of only numbers returns US-ASCII encoding is a result of the #join method being well written because it takes the encoding of the inner elements and a result of the #to_s on the numbers also unexpected behaviour. But that I think does not fall into the scope of this change.

irb(main):001:0> 1.to_s.encoding
=> #<Encoding:US-ASCII>
irb(main):002:0> "a".encoding
=> #<Encoding:UTF-8>

So when the empty array suddenly returns a different (hardcoded) encoding it can break user's programs with an Unidentified byte character sequence in UTF-8, and it feels inconsistent for users working with an environment where the default encoding is any other than US-ASCII.

Updated by cremno (cremno phobia) over 1 year ago

I also believe the current behavior makes sense. It won't cause the invalid byte sequence in UTF-8 issue. The US-ASCII character set is THE subset (see BINARY which is actually called ASCII-8BIT). Sure the characters are sometimes encoded differently (e.g. UTF-32BE uses 4 bytes) but US-ASCII is fully compatible to UTF-8. Especially in this case as an empty string doesn't contain any characters/byte sequences.

Meanwhile your proposal isn't radical enough to make sense to me: what about nil.to_s? Or the non-empty 65.chr?

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

May I close this?

Updated by duerst (Martin Dürst) over 1 year ago

nobu (Nobuyoshi Nakada) wrote:

May I close this?

Yes, please do.

I agree that US-ASCII is the greatest common denominator for source encodings, and there isn't any program that would fail (except if somebody checks for UTF-8 or so explicitly).

On the other hand, if we change it to the source encoding, then the implementation gets more difficult (it has to somehow get the source encoding). Also, some programs that use this functionality in a setup where the encoding of the data and the source encoding isn't the same, and where occasionally an array is empty, may stop to work.

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

  • Status changed from Feedback to Rejected

duerst (Martin Dürst) wrote:

On the other hand, if we change it to the source encoding, then the implementation gets more difficult (it has to somehow get the source encoding). Also, some programs that use this functionality in a setup where the encoding of the data and the source encoding isn't the same, and where occasionally an array is empty, may stop to work.

There is no way to get the source encoding in a caller (unless its binding is available) from library methods, so the source encoding can't be a choice. ;)

Also available in: Atom PDF