Project

General

Profile

Feature #695

More flexibility when combining ASCII-8BIT strings with other encodings

Added by mike (Michael Selig) about 11 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:19590]

Description

=begin
Consider the following 3 Ruby statements:

# String#pack always returns ASCII-8BIT
s1 = [97, 98, 99, 1589].pack("U*")

# \xNN returns the source encoding (even if it is an invalid string), or ASCII-8BIT if not set
s2 = "abc\xD8\xB5"

# \uNNNN always returns UTF-8
s3 = "abc\u0635"

All of s1, s2, and s3 have the same contents, but different encodings. When you try to combine them, you get different "encoding compatibility" problems, which can change depending on the source encoding, due to the treatment of s2.

I would like to see Ruby be able to combine all the above without error. I don't think it is reasonable to have to use "force_encoding" in these cases. This would

  • give better compatibility with 1.8,
  • make handling of methods returning ASCII-8BIT strings much easier (eg String#pack and libraries which return strings in ASCII-8BIT because the encoding is unknown)
  • reduce the confusion caused with "\x" producing a string which depends on the source encoding (which I dislike - I think it should always return ASCII-8BIT).

So the feature request is:

When combining 2 strings, with one being ASCII-8BIT, and the other is encoding "E":
1) If the ASCII-8BIT string is valid if forced to encoding E, then treat the ASCII-8BIT string as being in encoding E;
2) Otherwise treat both strings as ASCII-8BIT.

Part (2) is less important, and can probably be omitted if it is hard to implement.

Thank you
Michael Selig
=end

History

#1

Updated by mike (Michael Selig) about 11 years ago

=begin
Sorry, I meant Array#pack, not String#pack of course.

Mike.
=end

#2

Updated by nobu (Nobuyoshi Nakada) about 11 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

=begin
Applied in changeset r20021.
=end

#3

Updated by duerst (Martin Dürst) about 11 years ago

=begin
At 07:14 08/10/31, Michael Selig wrote:

Hi,

Feature #695 was closed & marked done, but unfortunately it does not seem

to have been implemented :-(

I think it should have been marked part done, part rejected,
I guess.

The request was:

When combining 2 strings, with one being ASCII-8BIT, and the other is

encoding "E":
1) If the ASCII-8BIT string is valid if forced to encoding E, then treat

the ASCII-8BIT string as being in encoding E;
2) Otherwise treat both strings as ASCII-8BIT.

Part (2) is less important, and can probably be omitted if it is hard to

implement.

In my understanding, this would be a rather strong departure
from the current Ruby multilingual architecture, and not necessarily
a desirable one. It would be much more appropriate to start with
automatic conversion between labeled real encodings than to introduce
some conversion between arbitrary bytes and characters.
This distinction is already present in Ruby, you have to use
String#force_encoding in the above case, but String#encode
for actual conversion.

While things might 'just work' in some cases, treating arbitrary
ASCII-8BIT as a specific encoding if the byte pattern is okay
can result in many garbage-in-garbage-out cases. Some encodings
are more restrictive (e.g. UTF-8), but others, in particular all
single-byte encodings, have no restrictions at all.

I don't think it is by chance that most programming languages I
know, even if they have a somewhat different internationalization
model, more focused on Unicode than Ruby, make a clear distinction
between characters and bytes. It also isn't by chance that one
of the first things people have to learn when they learn about
internationalization is "bytes are not characters".

The above change would also be very difficult and tedious to
implement in Ruby currently. I was looking into this just a little
bit to see how easy it would be to implement automatic conversions
between actual character sets.

However:

ruby -Kn -ve 'p "abc\xD8\xB5" + "abc\u0635"'
ruby 1.9.0 (2008-10-30 revision 20062) [i686-linux]
-e:1:in `': incompatible character encodings: ASCII-8BIT and UTF-8

(Encoding::CompatibilityError)

(The -Kn is only necessary here because with -e ruby uses the locale to

determine the encoding of the string containing "\x".)
I thought this feature was implemented very quickly!

What appears to have been implemented is the encoding of "Array#pack"

output with "U".
However, I am not totally convinced that even this was done correctly, as

the pack output seems now to be marked UTF-8 even if the pack option

contains a mixture of "U" with other options which then can result in an

invalid UTF-8 string.

My feature request would mean that "pack" and "\x" string literals could

be left as ASCII-8BIT, and be "forced" to another encoding transparently

depending on how the programmer uses it.

I think this is totally the wrong way. The problems are with
pack and \x in string literals, and it would be a bad idea to
try and solve them by introducing a general "bytes become characters"
feature.

You can liken this feature to the transparent conversion of an integer to

a float when doing arithmetic.

Well, it's not very similar. The conversion of an interger to a float
is very predictable, but the 'conversion' of ASCII-8BIT to some
real encoding is just a wild guess.

If you agree that this is a good idea, I don't mind trying to produce a

patch for it myself. Please let me know.

I don't know about Matz or Nobu, but I don't think at all that this
is the way to go.

Regards, Martin.

Cheers
Mike

On Wed, 29 Oct 2008 14:53:15 +1100, Michael Selig redmine@ruby-lang.org

wrote:

Feature #695: More flexibility when combining ASCII-8BIT strings with

other encodings
http://redmine.ruby-lang.org/issues/show/695

Author: Michael Selig
Status: Open, Priority: Normal
Category: M17N

Consider the following 3 Ruby statements:

String#pack always returns ASCII-8BIT

s1 = [97, 98, 99, 1589].pack("U*")

\xNN returns the source encoding (even if it is an invalid string), or

ASCII-8BIT if not set
s2 = "abc\xD8\xB5"

\uNNNN always returns UTF-8

s3 = "abc\u0635"

All of s1, s2, and s3 have the same contents, but different encodings.

When you try to combine them, you get different "encoding compatibility"

problems, which can change depending on the source encoding, due to the

treatment of s2.

I would like to see Ruby be able to combine all the above without error.

I don't think it is reasonable to have to use "force_encoding" in these

cases. This would

  • give better compatibility with 1.8,
  • make handling of methods returning ASCII-8BIT strings much easier (eg
    String#pack and libraries which return strings in ASCII-8BIT because the
    encoding is unknown)
  • reduce the confusion caused with "\x" producing a string which depends
    on the source encoding (which I dislike - I think it should always
    return ASCII-8BIT).

So the feature request is:

When combining 2 strings, with one being ASCII-8BIT, and the other is

encoding "E":
1) If the ASCII-8BIT string is valid if forced to encoding E, then treat

the ASCII-8BIT string as being in encoding E;
2) Otherwise treat both strings as ASCII-8BIT.

Part (2) is less important, and can probably be omitted if it is hard to

implement.

Thank you
Michael Selig


http://redmine.ruby-lang.org

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#4

Updated by nobu (Nobuyoshi Nakada) about 11 years ago

=begin
Hi,

At Fri, 31 Oct 2008 07:14:21 +0900,
Michael Selig wrote in [ruby-core:19646]:

Feature #695 was closed & marked done, but unfortunately it does not seem

to have been implemented :-(

Martin kindly replied already, so I don't have to add his post
so much.

If you agree that this is a good idea, I don't mind trying to produce a

patch for it myself. Please let me know.

I don't agree, but feel free to post your patch, of course.

--
Nobu Nakada

=end

#5

Updated by duerst (Martin Dürst) about 11 years ago

=begin
At 14:07 08/10/31, Michael Selig wrote:

Hi,

On Fri, 31 Oct 2008 15:42:55 +1100, Nobuyoshi Nakada nobu@ruby-lang.org

wrote:

If you agree that this is a good idea, I don't mind trying to produce a
patch for it myself. Please let me know.

I don't agree, but feel free to post your patch, of course.

There seems little point in making the effort to produce a patch if it is

going to rejected.

I kinda like the idea mentioned at the end of my previous post, of

separating "BINARY" and "ASCII-8BIT" into different encodings

This has been discussed quite extensively, but rejected.
I'm not very good at explaining why, because I was leaning
towards having this separation. Probably the person best
qualified to explain it is Akira Tanaka.

which

function identically except when it comes to combining with other

encodings.

While the separation was discussed (and rejected), the main
difference envisioned was that with ASCII-8BIT, the ASCII part
works as ASCII, but with BINARY, that wouldn't be the case.

So even if the separation would happen, I don't think your
proposal of using the two different encodings would be adopted.

Regards, Martin.

But before I do anything I would like some more discussion &

feedback (like what happenned with "default_internal"). Then I'd be happy

to put the work in to do a patch, assuming that there is a reasonable

likelyhood it will be accepted (at least in part).

Cheers
Mike

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#6

Updated by duerst (Martin Dürst) about 11 years ago

=begin
At 13:57 08/10/31, Michael Selig wrote:

Hi

On Fri, 31 Oct 2008 13:51:53 +1100, Martin Duerst duerst@it.aoyama.ac.jp

wrote:

Feature #695 was closed & marked done, but unfortunately it does not

seem to have been implemented :-(

I think it should have been marked part done, part rejected,
I guess.

Some sort of explanation would also have been nice.

Sometimes things just happen. Often, that's enough, and
if not, it's always possible to ask (as you did).

Bug tracking systems give the impression of perfection,
but one always has to remember that they are only an
attempt.

But at least we are now discussing it - I was expecting this to happen

before implementation :-)

I don't think it is by chance that most programming languages I
know, even if they have a somewhat different internationalization
model, more focused on Unicode than Ruby, make a clear distinction
between characters and bytes. It also isn't by chance that one
of the first things people have to learn when they learn about
internationalization is "bytes are not characters".

Yes, I agree with you, and I have raised this "ambiguity" before - in Ruby

ASCII-8BIT can either be a byte string or a character string of uncertain

encoding.

The problem I am trying to address here is for simple scripts which don't

care about internationalisation.

Well, we could make some simple scripts simpler, but only at the
expense of making bigger scripts much more brittle. In my opinion,
once you use \x string escapes or pack, you have to know about the
distinction between bytes and characters, and should be able to
add the necessary force-encoding (or whatever else is needed).

My feature request would mean that "pack" and "\x" string literals could
be left as ASCII-8BIT, and be "forced" to another encoding transparently
depending on how the programmer uses it.

I think this is totally the wrong way. The problems are with
pack and \x in string literals, and it would be a bad idea to
try and solve them by introducing a general "bytes become characters"
feature.

"default_internal" has gone a long way to help solve M17N issues, but

there still remains "encoding compatibility" issues even in simple, single

encoding scripts, ie: between the locale's encoding and ASCII-8BIT. The

motivation behind this feature request was to address this latter point.

I agree with you that there is a problem with "\x" in string literals.

However I am not sure I agree that the problem is in pack. The root of the

problem is this ambiguity with ASCII-8BIT between bytes and characters -

the way I think it should work is really like a "wild card" encoding.

Well, I think there is a problem in pack. It has so many different
template characters that it's impossible in general to say what
encoding the result should be. Matz did some followup work on
your proposal at revision 20057
(see http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/pack.c?view=log),
which tries to get the best result possible for simple cases.
For cases that use many different template characters at the
same time, it's simply impossible to figure out what the intent
of the programmer is, so the programmer will have to tell.

Pack is one simple example of a bunch of methods that return strings, but

cannot easily determine what encoding to return them in.

I'd guess pack is one of the more complex ones. If you know others,
please tell us, I think nobody is claiming that all i's are dotted
and all t's crossed in this area.

Other examples are decryption and uncompression methods where often the

original encoding is not known. In many cases there is no alternative

other than to return them as ASCII-8BIT and let the application worry

about interpreting the contents.

This is forcing the programmer to use "force_encoding()"

Or whatever else is appropriate.

where in 1.8 it

was not necessary, and in 1.9 it can seem rather annoying.

It can seem annoying until you realize that it's necessary.

There is even a weird exception to this - if the ASCII-8BIT string happens

to be all 7-bit chars, then it CAN be combined with other ASCII-compatible

encodings.

Yes, that's one point where it may make sense to split ASCII-8BIT
and BINARY.

This probably allows some 1.8 legacy scripts to work, but only ones

working in ASCII.
I do not think this sort of thing - one that works in some cases, but not

in others - is desirable at all.

Yes, but in my view, you are just proposing to go down the slippery
slope a bit further. The chances that ASCII is ASCII (and that otherwise,
you'll find out pretty quickly when looking at the data) are much
heigher than the chances that any more specific encoding will be
'guessed' right.

So in fact Ruby already has what you describe as a "bytes become

characters feature", but it only works in certain circumstances!

You can liken this feature to the transparent conversion of an integer

to
a float when doing arithmetic.

Well, it's not very similar. The conversion of an interger to a float
is very predictable, but the 'conversion' of ASCII-8BIT to some
real encoding is just a wild guess.

A "wild guess" is overstating it. If a program attempts to combine an

ASCII-8BIT string with another encoded string, AND it happens to be a

valid encoding, I think that the chances are very high that the program is

expecting the byte string to be in the other encoding. I think that a

heuristic like this is reasonable as it keeps the language backward

compatible & neat.

Furthermore as I said, this conversion already happens with ASCII-8BIT

character strings consisting only of 7 bit chars,

Well, yes, but then that's clearly reflected in the name "ASCII-8BIT".

so extending it to all

encodings seems an obvious thing to do. Look at:

a) 7-bit char strings work, irrespective of encoding:
ruby -e 'p ("abc".force_encoding("ASCII-8BIT") +

"abc".force_encoding("UTF-8")).encoding'
=> #Encoding:UTF-8

but:
b) Legal 8-bit encoding string:
ruby -e 'p ("ab\xE0".force_encoding("ASCII-8BIT") +

"ab\xE0".force_encoding("ISO-8859-8")).encoding'
=> -e:1:in `': incompatible character encodings: ASCII-8BIT and

ISO-8859-8 (Encoding::CompatibilityError)

c) Legal multibyte encoding string:
ruby -ve 'p ("ab\u0635".force_encoding("ASCII-8BIT") +

"ab\u0635".force_encoding("UTF-8")).encoding'
=> -e:1:in `': incompatible character encodings: ASCII-8BIT and

UTF-8 (Encoding::CompatibilityError)

I think you have to come up with much more realistic examples
than these.

Certainly I don't see the downside in the conversion to a single-byte

encoding (eg: example (b)) above. Even if it converted when it shouldn't

have, the indexing and "codepoint values" are the same as if the result

were ASCII-8BIT.

The bytes are of course the same. But what counts is whether we have
the right characters.

One other idea: maybe we should distinguish between 2 encodings "BINARY"

and "ASCII-8BIT", which are currently aliases. Essentially they are the

same, but "BINARY" would mean "bytestring" and will generate an error if

you try to combine it with any other encoding, while "ASCII-8BIT" would

mean "unknown encoding", which can be combined transparently with other

encodings.

See separate mail on this topic.

Regards, Martin.

Maybe there is a better solution - any ideas?

Cheers
Mike

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

Also available in: Atom PDF