Project

General

Profile

Actions

Feature #18576

open

Rename `ASCII-8BIT` encoding to `BINARY`

Added by byroot (Jean Boussier) 5 months ago. Updated 3 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:107514]

Description

Context

I'm now used to it, but something that confused me for years was errors such as:

>> "fée" + "\xFF".b
(irb):3:in `+': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

When you aren't that familiar with Ruby, it's really not evident that ASCII-8BIT basically means "no encoding" or "binary".

And even when you know it, if you don't read carefully it's very easily confused with US-ASCII.

The Encoding::BINARY alias is much more telling IMHO.

Proposal

Since Encoding::ASCII_8BIT has been aliased as Encoding::BINARY for years, I think renaming it to BINARY and then making asking ASCII_8BIT the alias would significantly improve usability without backward compatibility concerns.

The only concern I could see would be the consistency with a handful of C API functions:

  • rb_encoding *rb_ascii8bit_encoding(void)
  • int rb_ascii8bit_encindex(void)
  • VALUE rb_io_ascii8bit_binmode(VALUE io)

But that's for much more advanced users, so I don't think it's much of a concern.

Updated by duerst (Martin Dürst) 5 months ago

Well, it's actually not just binary. Binary would mean that none of the bytes have any 'meaning' (i.e. characters) assigned to them. But ASCII-8BIT actually has character 'meaning' assigned to the ASCII range.
You can for example do the following:

u = (b = "abcde".force_encoding('ASCII-8BIT')).encode('UTF-8')

This gives you the string "abcde" with the encoding UTF-8. This shows that the lower 7 bits are interpreted the same as US-ASCII. The range with the 8th bit set, on the other hand, is just binary values, so

"\xCD".force_encoding('ASCII-8BIT').encode('UTF-8')

produces this error:

Encoding::UndefinedConversionError ("\xCD" from ASCII-8BIT to UTF-8)

I choose UTF-8 as the target encoding because that contains all of Unicode, so the error cannot be because the source character doesn't exist in the target encoding.

So there's indeed some complexity here, but it's not exactly what you think.

Updated by byroot (Jean Boussier) 5 months ago

@duerst (Martin Dürst) I'm aware of this, but I don't quite see how it's a concern. It's a fairly subtle behavior, and I doubt the ASCII-8BIT name particularly reveal it.

Also nitpick, but a better example would be:

"\xC3\xA9".b.encode(Encoding::UTF_8) # => Encoding::UndefinedConversionError

Since it's valid UTF-8.

Updated by naruse (Yui NARUSE) 5 months ago

duerst (Martin Dürst) wrote in #note-1:

Well, it's actually not just binary. Binary would mean that none of the bytes have any 'meaning' (i.e. characters) assigned to them. But ASCII-8BIT actually has character 'meaning' assigned to the ASCII range.

I agree the principle.
But we should consider this proposal as "ASCII range of binary data in the world is usually ASCII. Why you call them as complex name: ASCII-8BIT?"

I think the name of the encoding is a communication tool. We should compare pros and cons between ASCII-8BIT and BINARY.

Updated by Eregon (Benoit Daloze) 5 months ago

+1000 for this, I think ASCII-8BIT is always extremely confusing, and BINARY is much more revealing (= we don't know what the actual encoding is, or it might be binary data and not text).
I've seen many Ruby users confused by this.
I'm not sure why I never thought to propose it here TBH.

I've literally never used the Encoding::ASCII_8BIT form in code (and rarely if ever seen it) but Encoding::BINARY many times.

The property that bytes < 128 are interpreted as US-ASCII is nothing special, every Encoding#ascii_compatible? behaves like that.
And almost all non-dummy Ruby encodings are #ascii_compatible?, the only two exceptions are UTF-16/32 (both LE/BE).

Two things particularly confusing about the name ASCII-8BIT:

  • It's completely unclear it might mean binary data or unknown encoding
  • ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

(FWIW JCodings, the Java library for Ruby encodings has ASCIIEncoding.INSTANCE for BINARY, that's even worse as it's even more confusing with US-ASCII, I've been thinking how to fix that in JCodings in a compatible way)

Updated by Eregon (Benoit Daloze) 5 months ago

BTW Python has the "bytes" encoding and it behaves very similar to Ruby's BINARY encoding (it's a different type in Python but that's details).
e.g.

>>> bytes("abcdé", 'utf-8')
b'abcd\xc3\xa9'

That's also a more telling name than ASCII-8BIT.
BINARY is better for Ruby because it's already an established name for it.

There is also already String#b for binary, it's not String#a or so.

Updated by naruse (Yui NARUSE) 5 months ago

The name ASCII-8BIT expresses how we deeply considered about what "binary" is. Ruby 1.9's encoding system is serial invents. Ruby invented some ideas: ASCII COMPATIBLE and ASCII-8BIT.

Two things particularly confusing about the name ASCII-8BIT:

  • It's completely unclear it might mean binary data or unknown encoding
  • ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

Your two questions raises very good points. The answer for them is tightly coupled with the name ASCII-8BIT.

  • It's completely unclear it might mean binary data or unknown encoding

I want to ask you that how often you can actually distinguish them. Ruby's assumption is that developers cannot distinguish them in normal use cases. If so, Ruby may not provide two objects. If Ruby provide only one object for them, developers don't need clarify it.

ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

This is very good question. Ruby's answer is "yes, ASCII-8BIT is similar to ISO-8859-*". As you say, an ASCII-8BIT string's 8-bit range is undefined. But Ruby doesn't matter that. In the real world such phenomenon is sometimes discovered.

For example the charset of HTTP Header is usually ISO-8859-1. Many languages struggled how to handle these octets. Python and .NET handles this as binary. It prevents to leverage powerful String methods to such binary data. Ruby handles it as ASCII-8BIT. Ruby's insight is binaries Ruby handles is usually such octets. The name ASCII-8BIT reflects such insight.

Therefore the conclusion for your question is that they are just what the real world is. The name just reflects that.

Anyway Rails programmers don't need such understanding usually. If renaming cares people who just hit the surface of this chaos, it might be worth considered, though changing encoding.name may hit the compatibility issue.

Updated by tenderlovemaking (Aaron Patterson) 5 months ago

First, I agree with this proposal. Second, I think this example should raise an exception:

u = (b = "abcde".force_encoding('ASCII-8BIT')).encode('UTF-8')

But I can open a different ticket for that. The point I actually want to make is that I've never seen this use case in the wild. 100% of the cases I've seen for force_encoding('ASCII-8BIT') are when the developer knows the string is binary (or unknown) data and they want to treat it as binary / unknown data not as "might be US-ASCII sometimes". The name "binary" would more accurately reflect real world usage IMO.

Updated by Eregon (Benoit Daloze) 5 months ago

naruse (Yui NARUSE) wrote in #note-6:

I want to ask you that how often you can actually distinguish them.

I think in many cases it is possible to distinguish.
For instance, an HTTP header might initially be in the binary encoding and mean "unknown encoding" (can often find the real encoding through Content-Type's charset, but not always and could be invalid)
Another example is socket.read(N) which might be actual binary data (e.g. for a binary protocol), or text and the actual encoding depends then on what's communicated on that socket.

And I would think most Ruby programs need to handle the binary encoding somehow, and can only leave a String as binary if it's only bytes < 128, otherwise things break.

If so, Ruby may not provide two objects.

I don't think two different "binary" Encodings are useful, one seems enough in practice and can be used for both meanings, which are very close (as a binary byte array, or a marker for unknown encoding).

This is very good question. Ruby's answer is "yes, ASCII-8BIT is similar to ISO-8859-*". As you say, an ASCII-8BIT string's 8-bit range is undefined. But Ruby doesn't matter that. In the real world such phenomenon is sometimes discovered.

I think such situations need to be handled somehow and given a real encoding.
"ASCII-8BIT" feels confusing because there is no such thing as a "8th" bit of ASCII, without a more specific encoding which defines that.
So it really means unknown, and "ASCII-8BIT" seems far from "unknown encoding".

Also "ASCII-8BIT" sounds clearly wrong if it's actual binary data (which might not use any ASCII concept at all).
The behavior that this pseudo-encoding is ASCII compatible and e.g. shows byte 65 as A is fine, after all hexdump utilities typically do the same for bytes < 128 and it's helpful if there is ASCII text in the middle of binary data.

Anyway Rails programmers don't need such understanding usually. If renaming cares people who just hit the surface of this chaos, it might be worth considered, though changing encoding.name may hit the compatibility issue.

Not just Rails programmers, I think most Ruby programmers are confused when they see ASCII-8BIT, and not only the first time.
I believe renaming to BINARY would help them understand the meaning much better.

@tenderlovemaking (Aaron Patterson) One issue is e.g. error messages in CRuby are encoded in the binary encoding (probably for the legacy reason of using rb_str_new()), and so that would be I think a wide-reaching change with a high chance of causing real compatibility issues, it seems too incompatible to me.
As an example, the encoding negotiation rules (e.g. for concatenation) in Ruby are all based around whether one side is #ascii_only? and if yes then just use the other side's encoding. Preventing to e.g. concat with a ASCII-only binary string would break lots of programs.
Anyway, I think that's a separate issue indeed.

Updated by jeremyevans0 (Jeremy Evans) 5 months ago

I'm also in favor of renaming ASCII-8BIT to BINARY, but I don't have strong feelings about it. I'm strongly against breaking String#encode for binary strings.

Updated by tenderlovemaking (Aaron Patterson) 5 months ago

jeremyevans0 (Jeremy Evans) wrote in #note-9:

I'm also in favor of renaming ASCII-8BIT to BINARY, but I don't have strong feelings about it. I'm strongly against breaking String#encode for binary strings.

Ya, sorry, I should be more clear. I think concatenation shouldn't try to guess at the encoding. If the user calls "encode" then it seems fine.

Eregon (Benoit Daloze) wrote in #note-8:

As an example, the encoding negotiation rules (e.g. for concatenation) in Ruby are all based around whether one side is #ascii_only? and if yes then just use the other side's encoding. Preventing to e.g. concat with a ASCII-only binary string would break lots of programs.
Anyway, I think that's a separate issue indeed.

Yes, this is the issue I have. IME the code is already broken, it just hasn't had the right input to break it yet (where would the binary string come from other than an external location?). Regardless, I made a ticket here: https://bugs.ruby-lang.org/issues/18579 😄

Updated by duerst (Martin Dürst) 5 months ago

Eregon (Benoit Daloze) wrote in #note-4:

The property that bytes < 128 are interpreted as US-ASCII is nothing special, every Encoding#ascii_compatible? behaves like that.
And almost all non-dummy Ruby encodings are #ascii_compatible?, the only two exceptions are UTF-16/32 (both LE/BE).

Two things particularly confusing about the name ASCII-8BIT:

  • It's completely unclear it might mean binary data or unknown encoding

Well, binary data can be character data with unknown encoding (or with encoding not yet set), or it can be truly binary data (e.g. as in a .jpg file or .zip file,...).

  • ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

ASCII-8BIT is an 8-bit ascii-compatible encoding, isn't it?

I think the idea of ASCII-8BIT goes back to the fact that in Ruby, many encodings can be used for source code, and as long as you only use ASCII in the code, it doesn't actually matter. That's to a large extent how Ruby 1.8 operated, and that was carried over into Ruby 1.9.

Now that the default source encoding is UTF-8, we have an encoding pragma for source files in other encodings, and so on, the importance of "something where we know ASCII is ASCII, but we are not sure about the upper half of the byte values" may be quite a bit less important.

Updated by byroot (Jean Boussier) 5 months ago

though changing encoding.name may hit the compatibility issue.

I personally don't think it's much of a concern, but if it is, then a possible alternative would be to only change Encoding::ASCII_8BIT.inspect so that it shows up as BINARY in EncodingError and such, but that Encoding::ASCII_8BIT.name is unchanged.

Unless people think this would be even more confusing.

Updated by Eregon (Benoit Daloze) 5 months ago

byroot (Jean Boussier) wrote in #note-12:

though changing encoding.name may hit the compatibility issue.

I personally don't think it's much of a concern

I agree, this sounds very unlikely to cause compatibility issues, and if it does it would be extremely rare.
I believe the vast majority of programs simply don't rely on Encoding#name values.
(and of course Encoding.find(name) would still work for both "binary" & "ascii-8bit")

Updated by matz (Yukihiro Matsumoto) 4 months ago

  • Status changed from Open to Rejected

I don't object to the proposal itself. But as @ko1 (Koichi Sasada) searched, there are so many gems that compare Encoding#name and ASCII-8BIT.
So I don't accept the proposal for the sake of compatibility.

Matz.

Updated by byroot (Jean Boussier) 4 months ago

Can I make a counter proposal?

We could keep Encoding#name as "ASCII-8BIT", but change Encoding#inspect and make sure EncodingError use the BINARY name in its error messages.

What do you think?

Updated by matz (Yukihiro Matsumoto) 4 months ago

Does this counter-proposal solve the original problem?
It seems it introduces another inconsistency (and possible confusion).

Matz.

Updated by byroot (Jean Boussier) 4 months ago

Does this counter-proposal solve the original problem?

I believe so because the main way users are exposed to ASCII-8BIT is through EncodingError.

It seems it introduces another inconsistency (and possible confusion).

Indeed, my personal belief is that Encoding#name is both an advanced API and one that you don't really want to use. So I think the few users that would encounter this inconsistency would have the background to not be tricked by it.

But ultimately this is your call.

Updated by Eregon (Benoit Daloze) 4 months ago

Link to the gem-codesearch results from @ko1 (Koichi Sasada): https://hackmd.io/koJLPz4eRXKzaaDvVqji7w#Feature-18576-Rename-ASCII-8BIT-encoding-to-BINARY-byroot

This seems very few usages and IMHO such gems should be fixed (if they are still used, probably not for most).
It's only 71 gems: https://gist.github.com/eregon/2b5de829d9aeb8b91b551fa05677b4db#file-gem-names

str.encoding.name == "ASCII-8BIT" is also needlessly slow and brittle.

It seems many matches are about old versions of rack/lint.rb and that's already fixed since https://github.com/rack/rack/pull/982.
nokogiri still uses it but that could be easily fixed: https://github.com/sparklemotion/nokogiri/blob/e324a91477fe3b199c95b52c3985647dd2aeb847/lib/nokogiri/html5/document.rb#L33

IMHO from a compatibility perspective it would be fair enough to change the Encoding#name too.
But I guess others will disagree, so I believe @byroot's proposal is still a big step forward (i.e. adding def Encoding::BINARY.name; 'ASCII-8BIT'; end or so for compatibility).

Updated by matz (Yukihiro Matsumoto) 4 months ago

  • Status changed from Rejected to Open

Making Encoding#name to return the name different from the encoding name is unacceptable.
Besides that, in general, compatibility issue is hard to estimate beforehand, so we tend to be very conservative.
If you (or someone) estimate the compatibility issue is minimal, and want to experiment to see if it's true during pre-release, I'd say go.
Will you?

Matz.

Updated by byroot (Jean Boussier) 4 months ago

Will you?

I'd like to champion this. I already started opening pull requests on the affected gems.

Updated by byroot (Jean Boussier) 4 months ago

Ok, so I went over all 71 matches after filtering vendored code: https://gist.github.com/casperisfine/5a26c7b85f7d15c4acd63d62d67eafbb

I opened 31 pull requests, all where trivial changes str.encoding.name == "" -> str.encoding == Encoding::BINARY with the notable exception of vcr because it store the encoding names in files.

The vast majority of the matches are abandoned gems with no update since 2013 or older ( I still opened PRs when I could). Some are even just old versions of rack republished under another name.

The few high profiles gems impacted are:

  • Nokogiri: patch sent
  • VCR: patch sent
  • mongo: patch sent

That being said, it's impossible to measure how much proprietary code may use the same pattern.

Updated by byroot (Jean Boussier) 4 months ago

I prepared the patch for this: https://github.com/ruby/ruby/pull/5571

If there is no objections I'd like to merge it so it's part of the upcoming 3.2.0-preview1

Updated by byroot (Jean Boussier) 4 months ago

@matz (Yukihiro Matsumoto) could you confirm you are OK to merge the ASCII-8BIT -> BINARY rename for 3.2.0-preview1?

I think the earlier this happens the more likely it can go well. So far all the PR I made in gems were received very positively.

Updated by matz (Yukihiro Matsumoto) 3 months ago

The risk of compatibility has been reduced thanks to @byroot's effort, but probably there still are many applications potentially affected by the change. Considering the benefit (of being slightly more descriptive) and risk (of incompatibility), I don't think it pays.

Matz.

Updated by Eregon (Benoit Daloze) 3 months ago

I think it's worth changing, the current name is confusing to most Ruby users, and there were only 71 gems out of 170000+ gems, and those gems were patched.
It seems equally unlikely that many applications would depend on enc.name == "ASCII-8BIT", and that those applications would update to latest Ruby.
If we don't change it now, we will probably never change it and stay forever with that confusing name, that seems really bad for future Ruby.

@matz (Yukihiro Matsumoto) How about we try it (as experimental or so) before the preview, and based on feedback keep it or revert it?
From your comment in #19 I thought that's what you offered.

Updated by larskanis (Lars Kanis) 3 months ago

Having solved a lot of encoding issues for co-workers, especially on Windows, I'm with @Eregon (Benoit Daloze). As the programmers best friend, I think it's worth to try out this minor incompatibility. At least compared to something like the removal of rb_cData which breaks lots of older gems, just for cleaning up the C-API (after 2 years of deprecation warnings).

Actions

Also available in: Atom PDF