Feature #18576: Rename `ASCII-8BIT` encoding to `BINARY` - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #18576

closed

Rename `ASCII-8BIT` encoding to `BINARY`

Added by byroot (Jean Boussier) over 3 years ago. Updated over 1 year ago.

Status:

Closed

Assignee:

Target version:

3.4

[ruby-core:107514]

Description

Context¶

I'm now used to it, but something that confused me for years was errors such as:

>> "fée" + "\xFF".b
(irb):3:in `+': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

When you aren't that familiar with Ruby, it's really not evident that ASCII-8BIT basically means "no encoding" or "binary".

And even when you know it, if you don't read carefully it's very easily confused with US-ASCII.

The Encoding::BINARY alias is much more telling IMHO.

Proposal¶

Since Encoding::ASCII_8BIT has been aliased as Encoding::BINARY for years, I think renaming it to BINARY and then making asking ASCII_8BIT the alias would significantly improve usability without backward compatibility concerns.

The only concern I could see would be the consistency with a handful of C API functions:

rb_encoding *rb_ascii8bit_encoding(void)
int rb_ascii8bit_encindex(void)
VALUE rb_io_ascii8bit_binmode(VALUE io)

But that's for much more advanced users, so I don't think it's much of a concern.

Actions

Copy link

#1 [ruby-core:107515]

Updated by duerst (Martin Dürst) over 3 years ago

Well, it's actually not just binary. Binary would mean that none of the bytes have any 'meaning' (i.e. characters) assigned to them. But ASCII-8BIT actually has character 'meaning' assigned to the ASCII range.
You can for example do the following:

u = (b = "abcde".force_encoding('ASCII-8BIT')).encode('UTF-8')

This gives you the string "abcde" with the encoding UTF-8. This shows that the lower 7 bits are interpreted the same as US-ASCII. The range with the 8th bit set, on the other hand, is just binary values, so

"\xCD".force_encoding('ASCII-8BIT').encode('UTF-8')

produces this error:

Encoding::UndefinedConversionError ("\xCD" from ASCII-8BIT to UTF-8)

I choose UTF-8 as the target encoding because that contains all of Unicode, so the error cannot be because the source character doesn't exist in the target encoding.

So there's indeed some complexity here, but it's not exactly what you think.

Actions

Copy link

#2 [ruby-core:107516]

Updated by byroot (Jean Boussier) over 3 years ago

@duerst (Martin Dürst) I'm aware of this, but I don't quite see how it's a concern. It's a fairly subtle behavior, and I doubt the ASCII-8BIT name particularly reveal it.

Also nitpick, but a better example would be:

"\xC3\xA9".b.encode(Encoding::UTF_8) # => Encoding::UndefinedConversionError

Since it's valid UTF-8.

Actions

Copy link

#3 [ruby-core:107517]

Updated by naruse (Yui NARUSE) over 3 years ago

duerst (Martin Dürst) wrote in #note-1:

Well, it's actually not just binary. Binary would mean that none of the bytes have any 'meaning' (i.e. characters) assigned to them. But ASCII-8BIT actually has character 'meaning' assigned to the ASCII range.

I agree the principle.
But we should consider this proposal as "ASCII range of binary data in the world is usually ASCII. Why you call them as complex name: ASCII-8BIT?"

I think the name of the encoding is a communication tool. We should compare pros and cons between ASCII-8BIT and BINARY.

Actions

Copy link

#4 [ruby-core:107518]

Updated by Eregon (Benoit Daloze) over 3 years ago

+1000 for this, I think ASCII-8BIT is always extremely confusing, and BINARY is much more revealing (= we don't know what the actual encoding is, or it might be binary data and not text).
I've seen many Ruby users confused by this.
I'm not sure why I never thought to propose it here TBH.

I've literally never used the Encoding::ASCII_8BIT form in code (and rarely if ever seen it) but Encoding::BINARY many times.

The property that bytes < 128 are interpreted as US-ASCII is nothing special, every Encoding#ascii_compatible? behaves like that.
And almost all non-dummy Ruby encodings are #ascii_compatible?, the only two exceptions are UTF-16/32 (both LE/BE).

Two things particularly confusing about the name ASCII-8BIT:

It's completely unclear it might mean binary data or unknown encoding
ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

(FWIW JCodings, the Java library for Ruby encodings has ASCIIEncoding.INSTANCE for BINARY, that's even worse as it's even more confusing with US-ASCII, I've been thinking how to fix that in JCodings in a compatible way)

Actions

Copy link

#5 [ruby-core:107519]

Updated by Eregon (Benoit Daloze) over 3 years ago

BTW Python has the "bytes" encoding and it behaves very similar to Ruby's BINARY encoding (it's a different type in Python but that's details).
e.g.

>>> bytes("abcdé", 'utf-8')
b'abcd\xc3\xa9'

That's also a more telling name than ASCII-8BIT.
BINARY is better for Ruby because it's already an established name for it.

There is also already String#b for binary, it's not String#a or so.

Actions

Copy link

#6 [ruby-core:107527]

Updated by naruse (Yui NARUSE) over 3 years ago

The name ASCII-8BIT expresses how we deeply considered about what "binary" is. Ruby 1.9's encoding system is serial invents. Ruby invented some ideas: ASCII COMPATIBLE and ASCII-8BIT.

Two things particularly confusing about the name ASCII-8BIT:

It's completely unclear it might mean binary data or unknown encoding

ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

Your two questions raises very good points. The answer for them is tightly coupled with the name ASCII-8BIT.

It's completely unclear it might mean binary data or unknown encoding

I want to ask you that how often you can actually distinguish them. Ruby's assumption is that developers cannot distinguish them in normal use cases. If so, Ruby may not provide two objects. If Ruby provide only one object for them, developers don't need clarify it.

ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

This is very good question. Ruby's answer is "yes, ASCII-8BIT is similar to ISO-8859-*". As you say, an ASCII-8BIT string's 8-bit range is undefined. But Ruby doesn't matter that. In the real world such phenomenon is sometimes discovered.

For example the charset of HTTP Header is usually ISO-8859-1. Many languages struggled how to handle these octets. Python and .NET handles this as binary. It prevents to leverage powerful String methods to such binary data. Ruby handles it as ASCII-8BIT. Ruby's insight is binaries Ruby handles is usually such octets. The name ASCII-8BIT reflects such insight.

Therefore the conclusion for your question is that they are just what the real world is. The name just reflects that.

Anyway Rails programmers don't need such understanding usually. If renaming cares people who just hit the surface of this chaos, it might be worth considered, though changing encoding.name may hit the compatibility issue.

Actions

Copy link

#7 [ruby-core:107531]

Updated by tenderlovemaking (Aaron Patterson) over 3 years ago

First, I agree with this proposal. Second, I think this example should raise an exception:

u = (b = "abcde".force_encoding('ASCII-8BIT')).encode('UTF-8')

But I can open a different ticket for that. The point I actually want to make is that I've never seen this use case in the wild. 100% of the cases I've seen for force_encoding('ASCII-8BIT') are when the developer knows the string is binary (or unknown) data and they want to treat it as binary / unknown data not as "might be US-ASCII sometimes". The name "binary" would more accurately reflect real world usage IMO.

Actions

Copy link

#8 [ruby-core:107532]

Updated by Eregon (Benoit Daloze) over 3 years ago

naruse (Yui NARUSE) wrote in #note-6:

I want to ask you that how often you can actually distinguish them.

I think in many cases it is possible to distinguish.
For instance, an HTTP header might initially be in the binary encoding and mean "unknown encoding" (can often find the real encoding through Content-Type's charset, but not always and could be invalid)
Another example is socket.read(N) which might be actual binary data (e.g. for a binary protocol), or text and the actual encoding depends then on what's communicated on that socket.

And I would think most Ruby programs need to handle the binary encoding somehow, and can only leave a String as binary if it's only bytes < 128, otherwise things break.

If so, Ruby may not provide two objects.

I don't think two different "binary" Encodings are useful, one seems enough in practice and can be used for both meanings, which are very close (as a binary byte array, or a marker for unknown encoding).

This is very good question. Ruby's answer is "yes, ASCII-8BIT is similar to ISO-8859-*". As you say, an ASCII-8BIT string's 8-bit range is undefined. But Ruby doesn't matter that. In the real world such phenomenon is sometimes discovered.

I think such situations need to be handled somehow and given a real encoding.
"ASCII-8BIT" feels confusing because there is no such thing as a "8th" bit of ASCII, without a more specific encoding which defines that.
So it really means unknown, and "ASCII-8BIT" seems far from "unknown encoding".

Also "ASCII-8BIT" sounds clearly wrong if it's actual binary data (which might not use any ASCII concept at all).
The behavior that this pseudo-encoding is ASCII compatible and e.g. shows byte 65 as A is fine, after all hexdump utilities typically do the same for bytes < 128 and it's helpful if there is ASCII text in the middle of binary data.

Anyway Rails programmers don't need such understanding usually. If renaming cares people who just hit the surface of this chaos, it might be worth considered, though changing encoding.name may hit the compatibility issue.

Not just Rails programmers, I think most Ruby programmers are confused when they see ASCII-8BIT, and not only the first time.
I believe renaming to BINARY would help them understand the meaning much better.

@tenderlovemaking (Aaron Patterson) One issue is e.g. error messages in CRuby are encoded in the binary encoding (probably for the legacy reason of using rb_str_new()), and so that would be I think a wide-reaching change with a high chance of causing real compatibility issues, it seems too incompatible to me.
As an example, the encoding negotiation rules (e.g. for concatenation) in Ruby are all based around whether one side is #ascii_only? and if yes then just use the other side's encoding. Preventing to e.g. concat with a ASCII-only binary string would break lots of programs.
Anyway, I think that's a separate issue indeed.

Actions

Copy link

#9 [ruby-core:107533]

Updated by jeremyevans0 (Jeremy Evans) over 3 years ago

I'm also in favor of renaming ASCII-8BIT to BINARY, but I don't have strong feelings about it. I'm strongly against breaking String#encode for binary strings.

Actions

Copy link

#10 [ruby-core:107537]

Updated by tenderlovemaking (Aaron Patterson) over 3 years ago

jeremyevans0 (Jeremy Evans) wrote in #note-9:

I'm also in favor of renaming ASCII-8BIT to BINARY, but I don't have strong feelings about it. I'm strongly against breaking String#encode for binary strings.

Ya, sorry, I should be more clear. I think concatenation shouldn't try to guess at the encoding. If the user calls "encode" then it seems fine.

Eregon (Benoit Daloze) wrote in #note-8:

As an example, the encoding negotiation rules (e.g. for concatenation) in Ruby are all based around whether one side is #ascii_only? and if yes then just use the other side's encoding. Preventing to e.g. concat with a ASCII-only binary string would break lots of programs.
Anyway, I think that's a separate issue indeed.

Yes, this is the issue I have. IME the code is already broken, it just hasn't had the right input to break it yet (where would the binary string come from other than an external location?). Regardless, I made a ticket here: https://bugs.ruby-lang.org/issues/18579 😄

Actions

Copy link

#11 [ruby-core:107549]

Updated by duerst (Martin Dürst) over 3 years ago

Eregon (Benoit Daloze) wrote in #note-4:

The property that bytes < 128 are interpreted as US-ASCII is nothing special, every Encoding#ascii_compatible? behaves like that.
And almost all non-dummy Ruby encodings are #ascii_compatible?, the only two exceptions are UTF-16/32 (both LE/BE).

Two things particularly confusing about the name ASCII-8BIT:

It's completely unclear it might mean binary data or unknown encoding

Well, binary data can be character data with unknown encoding (or with encoding not yet set), or it can be truly binary data (e.g. as in a .jpg file or .zip file,...).

ISO-8859-* and many other encodings are 8-bit ascii-compatible encodings. Yet ASCII-8BIT which name seems to imply something close is nothing like that (the 8th bit is undefined, uninterpreted but valid).

ASCII-8BIT is an 8-bit ascii-compatible encoding, isn't it?

I think the idea of ASCII-8BIT goes back to the fact that in Ruby, many encodings can be used for source code, and as long as you only use ASCII in the code, it doesn't actually matter. That's to a large extent how Ruby 1.8 operated, and that was carried over into Ruby 1.9.

Now that the default source encoding is UTF-8, we have an encoding pragma for source files in other encodings, and so on, the importance of "something where we know ASCII is ASCII, but we are not sure about the upper half of the byte values" may be quite a bit less important.

Actions

Copy link

#12 [ruby-core:107550]

Updated by byroot (Jean Boussier) over 3 years ago

though changing encoding.name may hit the compatibility issue.

I personally don't think it's much of a concern, but if it is, then a possible alternative would be to only change Encoding::ASCII_8BIT.inspect so that it shows up as BINARY in EncodingError and such, but that Encoding::ASCII_8BIT.name is unchanged.

Unless people think this would be even more confusing.

Actions

Copy link

#13 [ruby-core:107553]

Updated by Eregon (Benoit Daloze) over 3 years ago

byroot (Jean Boussier) wrote in #note-12:

though changing encoding.name may hit the compatibility issue.

I personally don't think it's much of a concern

I agree, this sounds very unlikely to cause compatibility issues, and if it does it would be extremely rare.
I believe the vast majority of programs simply don't rely on Encoding#name values.
(and of course Encoding.find(name) would still work for both "binary" & "ascii-8bit")

Actions

Copy link

#14 [ruby-core:107619]

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Status changed from Open to Rejected

I don't object to the proposal itself. But as @ko1 (Koichi Sasada) searched, there are so many gems that compare Encoding#name and ASCII-8BIT.
So I don't accept the proposal for the sake of compatibility.

Matz.

Actions

Copy link

#15 [ruby-core:107620]

Updated by byroot (Jean Boussier) over 3 years ago

Can I make a counter proposal?

We could keep Encoding#name as "ASCII-8BIT", but change Encoding#inspect and make sure EncodingError use the BINARY name in its error messages.

What do you think?

Actions

Copy link

#16 [ruby-core:107621]

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Does this counter-proposal solve the original problem?
It seems it introduces another inconsistency (and possible confusion).

Matz.

Actions

Copy link

#17 [ruby-core:107622]

Updated by byroot (Jean Boussier) over 3 years ago

Does this counter-proposal solve the original problem?

I believe so because the main way users are exposed to ASCII-8BIT is through EncodingError.

It seems it introduces another inconsistency (and possible confusion).

Indeed, my personal belief is that Encoding#name is both an advanced API and one that you don't really want to use. So I think the few users that would encounter this inconsistency would have the background to not be tricked by it.

But ultimately this is your call.

Actions

Copy link

#18 [ruby-core:107634]

Updated by Eregon (Benoit Daloze) over 3 years ago

Link to the gem-codesearch results from @ko1 (Koichi Sasada): https://hackmd.io/koJLPz4eRXKzaaDvVqji7w#Feature-18576-Rename-ASCII-8BIT-encoding-to-BINARY-byroot

This seems very few usages and IMHO such gems should be fixed (if they are still used, probably not for most).
It's only 71 gems: https://gist.github.com/eregon/2b5de829d9aeb8b91b551fa05677b4db#file-gem-names

str.encoding.name == "ASCII-8BIT" is also needlessly slow and brittle.

It seems many matches are about old versions of rack/lint.rb and that's already fixed since https://github.com/rack/rack/pull/982.
nokogiri still uses it but that could be easily fixed: https://github.com/sparklemotion/nokogiri/blob/e324a91477fe3b199c95b52c3985647dd2aeb847/lib/nokogiri/html5/document.rb#L33

IMHO from a compatibility perspective it would be fair enough to change the Encoding#name too.
But I guess others will disagree, so I believe @byroot's proposal is still a big step forward (i.e. adding def Encoding::BINARY.name; 'ASCII-8BIT'; end or so for compatibility).

Actions

Copy link

#19 [ruby-core:107636]

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Status changed from Rejected to Open

Making Encoding#name to return the name different from the encoding name is unacceptable.
Besides that, in general, compatibility issue is hard to estimate beforehand, so we tend to be very conservative.
If you (or someone) estimate the compatibility issue is minimal, and want to experiment to see if it's true during pre-release, I'd say go.
Will you?

Matz.

Actions

Copy link

#20 [ruby-core:107637]

Updated by byroot (Jean Boussier) over 3 years ago

Will you?

I'd like to champion this. I already started opening pull requests on the affected gems.

Actions

Copy link

#21 [ruby-core:107640]

Updated by byroot (Jean Boussier) over 3 years ago

Ok, so I went over all 71 matches after filtering vendored code: https://gist.github.com/casperisfine/5a26c7b85f7d15c4acd63d62d67eafbb

I opened 31 pull requests, all where trivial changes str.encoding.name == "" -> str.encoding == Encoding::BINARY with the notable exception of vcr because it store the encoding names in files.

The vast majority of the matches are abandoned gems with no update since 2013 or older ( I still opened PRs when I could). Some are even just old versions of rack republished under another name.

The few high profiles gems impacted are:

Nokogiri: patch sent
VCR: patch sent
mongo: patch sent

That being said, it's impossible to measure how much proprietary code may use the same pattern.

Actions

Copy link

#22 [ruby-core:107666]

Updated by byroot (Jean Boussier) over 3 years ago

I prepared the patch for this: https://github.com/ruby/ruby/pull/5571

If there is no objections I'd like to merge it so it's part of the upcoming 3.2.0-preview1

Actions

Copy link

#23 [ruby-core:107680]

Updated by byroot (Jean Boussier) over 3 years ago

@matz (Yukihiro Matsumoto) could you confirm you are OK to merge the ASCII-8BIT -> BINARY rename for 3.2.0-preview1?

I think the earlier this happens the more likely it can go well. So far all the PR I made in gems were received very positively.

Actions

Copy link

#24 [ruby-core:107943]

Updated by matz (Yukihiro Matsumoto) over 3 years ago

The risk of compatibility has been reduced thanks to @byroot's effort, but probably there still are many applications potentially affected by the change. Considering the benefit (of being slightly more descriptive) and risk (of incompatibility), I don't think it pays.

Matz.

Actions

Copy link

#25 [ruby-core:107944]

Updated by Eregon (Benoit Daloze) over 3 years ago

I think it's worth changing, the current name is confusing to most Ruby users, and there were only 71 gems out of 170000+ gems, and those gems were patched.
It seems equally unlikely that many applications would depend on enc.name == "ASCII-8BIT", and that those applications would update to latest Ruby.
If we don't change it now, we will probably never change it and stay forever with that confusing name, that seems really bad for future Ruby.

@matz (Yukihiro Matsumoto) How about we try it (as experimental or so) before the preview, and based on feedback keep it or revert it?
From your comment in #19 I thought that's what you offered.

Actions

Copy link

#26 [ruby-core:107956]

Updated by larskanis (Lars Kanis) over 3 years ago

Having solved a lot of encoding issues for co-workers, especially on Windows, I'm with @Eregon (Benoit Daloze). As the programmers best friend, I think it's worth to try out this minor incompatibility. At least compared to something like the removal of rb_cData which breaks lots of older gems, just for cleaning up the C-API (after 2 years of deprecation warnings).

Actions

Copy link

#27 [ruby-core:115604]

Updated by Eregon (Benoit Daloze) over 1 year ago

Target version set to 3.4

@matz (Yukihiro Matsumoto) Could we try this again for 3.4, soon after the 3.3 release?

Then there is plenty of time to discover any issue related to it (probably very few as gems have been patched, and applications using encoding names instead of encoding constants are likely very old and unlikely to use a recent Ruby).

Actions

Copy link

#28 [ruby-core:115813]

Updated by naruse (Yui NARUSE) over 1 year ago

I strongly object that we change Encoding#name of ASCII-8BIT encoding into "BINARY" because of compatibility.
I don't want people to fix the code which are correctly running now.

However supporting people who newly writing a code is reasonable.
I agree to add more information in Encoding#inspect and error message.

Actions

Copy link

#29 [ruby-core:116170]

Updated by Eregon (Benoit Daloze) over 1 year ago

@naruse (Yui NARUSE) Do you have evidence of (latest release and not ancient) gems or applications comparing encoding.name to 'ASCII-8BIT'?
I think it's so obviously a bad idea to compare the encoding name as a String, AFAIK there was never a valid reason to use it (over enc == Encoding::BINARY, which works since Ruby 1.9) and it's inefficient, brittle and unnecessary.

FWIW https://github.com/search?q=%22name+%3D%3D+%27ASCII-8BIT%27%22&type=code&p=1 shows very few matches and it's mostly copies of old VCR code.
The chance of that code running on Ruby 3.4+ seems almost nonexistent, there would likely be many more serious compatibility issues with such old code (e.g. kwargs changes).
And fixing it is really easy.

@matz (Yukihiro Matsumoto) Can we experiment for 3.4?
If we have pushback based on actual code then let's go more conservative, but otherwise I think we should do the clean fix here.

Actions

Copy link

#30 [ruby-core:116172]

Updated by Eregon (Benoit Daloze) over 1 year ago

Also given the efforts of @byroot (Jean Boussier) in https://bugs.ruby-lang.org/issues/18576#note-21 and the offer from @matz (Yukihiro Matsumoto) in https://bugs.ruby-lang.org/issues/18576#note-19, I'd like to do exactly what matz said:

If you (or someone) estimate the compatibility issue is minimal, and want to experiment to see if it's true during pre-release, I'd say go.

I estimate it to be minimal.
We can know from the experiment if it's true or not, there are more than 11 months before 3.4, so plenty of time to discover any potential issue with it.

Actions

Copy link

#31 [ruby-core:116173]

Updated by byroot (Jean Boussier) over 1 year ago

I would also like to try this again for 3.4, if we do it early, the potential remaining issue will have a chance to be noticed with the first preview release.

Actions

Copy link

#32 [ruby-core:116266]

Updated by naruse (Yui NARUSE) over 1 year ago

Even if you "fix" gems, the number of affected gems insists there are as many as private rails applications.
Such incompatibility is not acceptable.

Actions

Copy link

#33 [ruby-core:116268]

Updated by byroot (Jean Boussier) over 1 year ago

@naruse (Yui NARUSE) no one is denying that there is private code out there that will be broken by such change. The question is how much and how hard it would be to detect and fix, and how much the change improve Ruby for its users.

We regularly make changes with much more breaking potential. So that alone isn't a reason to refuse the change in my opinion.

But if there is consensus that the cost/benefit isn't positive, then I'd like to propose again:

We could keep Encoding#name as "ASCII-8BIT", but change Encoding#inspect and make sure EncodingError use the BINARY name in its error messages.

But slightly modified:

I'd like to change Encoding::BINARY.inspect from "#<Encoding:ASCII-8BIT>" to "#<Encoding:ASCII-8BIT (BINARY)>".

Would that be acceptable?

Actions

Copy link

#34 [ruby-core:116269]

Updated by zverok (Victor Shepelev) over 1 year ago

Such incompatibility is not acceptable.

In all honesty, a selective application of this dogma doesn’t always look justified.
For better or worse, we break compatibility constantly.

One of the recent telling examples was the removal of File.exists? (an alias of .exist?), which, while "deprecated a long time ago," actually

broke a lot of gems/other software (because even with the "typically we have bare words as predicates" rule, it was more natural for people to write exists?, so while it was available, a lot of code was using it);
improved absolutely nothing in Ruby’s friendliness and learnability save for "removed a reason to ask for String#starts_with? and similar methods" (while, say, Rails continues to prefer third-person verbs in its core extensions, like String#starts_with? or Range#overlaps?)

OTOH, renaming the unfortunately named encoding:

makes Ruby friendlier (as a mentor, I saw a lot of people confused with ASCII-8BIT),
breaks not a lot of code: while fixing gems wouldn't fix all of its usages, the (minuscule) amount of gems to fix gives a good estimation of how frequently this might be a problem,
breaks code that mostly written in the "unexpected" way, so rethinking it might be a good idea anyway.

Actions

Copy link

#35 [ruby-core:116280]

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

tenderlovemaking (Aaron Patterson) wrote in #note-7:

I think this example should raise an exception:
u = (b = "abcde".force_encoding('ASCII-8BIT')).encode('UTF-8')

I'm worried about the above misconception. No, this example shouldn't raise an exception, because being ascii-compatible is the entire reason there's "ASCII" in "ASCII-8BIT". If even @tenderlovemaking (Aaron Patterson) can have this misconception, I would wager it's a fairly common one. And if the encoding was renamed to "BINARY" it would further encourage the misconception. We'd wind up with a kind of Frankenstein encoding that pretends to be true binary by its name, but having the behavior of ascii-compatible encodings. This thread has several people currently agreeing that the ascii-compatible behavior should not change, but if the name was changed I can easily predict some people will call for a change in behavior because the name "binary" has that overtone.

zverok (Victor Shepelev) wrote in #note-34:

For better or worse, we break compatibility constantly.
One of the recent telling examples was the removal of File.exists?

I won't say we can never break compatibility, but there's a very big qualitative difference here. If you run into File.exists?, the program simply crashes with NoMethodError. If you run into enc.name == "ASCII-8BIT" the return value changes from true to false; the program may crash later or not, the bug can remain undetected for a long time, there's a potential for corrupted data. This is 2-3 orders of magnitude harder to debug than NoMethodError. Even if not many people are affected by this, it's a very nasty kind of incompatibility.

byroot (Jean Boussier) wrote in #note-15:

We could keep Encoding#name as "ASCII-8BIT", but change Encoding#inspect and make sure EncodingError use the BINARY name in its error messages.

I would really like that.

Actions

Copy link

#36 [ruby-core:116298]

Updated by Eregon (Benoit Daloze) over 1 year ago

I think everyone's opinion on the thread is pretty clear and not everyone agrees, that's fine.

@matz (Yukihiro Matsumoto) Could you decide whether it's OK to experiment with the Encoding#name changing to "BINARY" or not?
If not, is @byroot's proposal in https://bugs.ruby-lang.org/issues/18576#note-33 accepted?

Actions

Copy link

#37 [ruby-core:116355]

Updated by byroot (Jean Boussier) over 1 year ago

@byroot's proposal

To clarify what I'm proposing if the rename is not acceptable is:

>> Encoding::BINARY
=> #<Encoding:ASCII-8BIT>

becomes:

>> Encoding::BINARY
=> #<Encoding:ASCII-8BIT (BINARY)>

And:

>> "fée" + "fée".b
(irb):8:in `+': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

Becomes:

>> "fée" + "fée".b
(irb):8:in `+': incompatible character encodings: UTF-8 and ASCII-8BIT (BINARY) (Encoding::CompatibilityError)

Actions

Copy link

#38 [ruby-core:116363]

Updated by Eregon (Benoit Daloze) over 1 year ago

I think for that last example, omitting ASCII-8BIT would be much clearer, also two sets of parens seems too much.
So:

(irb):8:in `+': incompatible character encodings: UTF-8 and BINARY (Encoding::CompatibilityError)

Otherwise we would likely still have the confusion that "ASCII" is not compatible with UTF-8 (which is untrue of course).

Actions

Copy link

#39 [ruby-core:116393]

Updated by shyouhei (Shyouhei Urabe) over 1 year ago

@naruse (Yui NARUSE) is actually positive for changing error messages (see #note-28). I guess everybody here is agreeing to @byroot's list of proposed changes in #note-37 (except wording)?

Actions

Copy link

#40 [ruby-core:116738]

Updated by naruse (Yui NARUSE) over 1 year ago

byroot (Jean Boussier) wrote in #note-33:

I'd like to change Encoding::BINARY.inspect from "#<Encoding:ASCII-8BIT>" to "#<Encoding:ASCII-8BIT (BINARY)>".

Would that be acceptable?

I agree the idea.

Actions

Copy link

#41 [ruby-core:116845]

Updated by byroot (Jean Boussier) over 1 year ago

Proposed patch: https://github.com/ruby/ruby/pull/10018

I used my initial suggestion: ASCII-8BIT (BINARY), but if the parentheses are deemed to much, I'm happy to adjust.

Actions

Copy link

#42 [ruby-core:116855]

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

I've come to realize something; when an ASCII-8BIT string contains only ascii characters, it behaves exactly like a US-ASCII string and in such a case it feels unnatural to call it "binary" (at least for me). But as soon as there is a non-ascii byte, it becomes incompatible with every other encoding and then truly deserves to be called BINARY. And that's when encoding errors occur. So in error messages, "BINARY" makes perfect sense to me since the error occurs due to the string being in "binary" state rather than "ascii-only" state. The distinction may be irrelevant to others but at least it has helped me put into words and understand why it felt so uncomfortable to change the name to "BINARY". Just my 2¢

Actions

Copy link

#43 [ruby-core:116857]

Updated by duerst (Martin Dürst) over 1 year ago

What about

>> "fée" + "fée".b
(irb):8:in `+': incompatible character encodings: UTF-8 and BINARY (ASCII-8BIT) (Encoding::CompatibilityError)

This still leaves "ASCII-8BIT" in (because I think it's important to help people understand that BINARY and ASCII-8BIT are the same).

[It also keeps the wart of consecutive parentheticals, but that can be dealt with separately.]

Actions

Copy link

#44 [ruby-core:116868]

Updated by byroot (Jean Boussier) over 1 year ago

>> "fée" + "fée".b
(irb):8:in `+': incompatible character encodings: UTF-8 and BINARY (ASCII-8BIT) (Encoding::CompatibilityError)

I don't mind BINARY being first or last. I'll adjust my PR.

As for the consecutive parentheteses, what about:

>> "fée" + "fée".b
(irb):8:in `+': incompatible character encodings: UTF-8 and BINARY / ASCII-8BIT (Encoding::CompatibilityError)

Actions

Copy link

#45 [ruby-core:116875]

Updated by Eregon (Benoit Daloze) over 1 year ago

BINARY (ASCII-8BIT) seems a good compromise.

The / seems potentially confusing for:
incompatible character encodings: BINARY / ASCII-8BIT and EUC-JP (Encoding::CompatibilityError).
So I think using parenthesis is OK and clearer than /.

Actions

Copy link

#46 [ruby-core:117508]

Updated by alexander-s (Alexander S) over 1 year ago

matz (Yukihiro Matsumoto) wrote in #note-14:

I don't object to the proposal itself. But as @ko1 (Koichi Sasada) searched, there are so many gems that compare Encoding#name and ASCII-8BIT.
So I don't accept the proposal for the sake of compatibility.

Matz.

I've been developing with Ruby for some 10+ years now, and overall I really like the language.

However, I also feel that Ruby sometimes seems too focused on being backwards compatible, to a point where it risk hurting the ecosystem. I think this thread is a good example, because it seems like such a small and benign change, yet it's taken several years and lots of back and forth, and in the end the proposed change wasn't even applied(!?).

At the same time, several parts of the standard library feels outdated (Net::HTTP for example), and others misplaced (OLE-automation anyone?). On the other hand, new "cool features" are sometimes introduced that I don't really see any value in. For example 'endless ranges' and 'single line end-less method definition'. In short, I share much of Bbatsov's (RuboCop author) sentiment from this article (https://metaredux.com/posts/2019/04/02/ruby-s-creed.html).

There is good progress too, I'll happily admit. A few examples that comes to mind are 'keyword params', 'unifying Integer/Fixnum', 'UTF-8 encoding by default', the Prism parser and the focus on performance. All these seemed like sensible improvements, in alignment with development in other modern languages.

Others probably have a much better ideas on what old stuff could be improved, but it could be for example:

Remove or deprecate globals
Update the Rubydoc system (many other languages have better documentation systems)
Continue cleaning up the stdlib (some progress has been made in recent Ruby releases, which is good)
Look at popular rules in RuboCop etc, for stuff that people are frequently disabling with linting, and consider deprecating them.
Take it easy with new syntax, ruby already have 'many ways to solve the same problem'. Something like end-less method definition seems like a pointless addition. On our team, we just disabled it with linting on day one.

To summarize, obviously backwards compatibility is important. But progress is inevitable and a language that doesn't development at a reasonable pace will eventually stagnate and die. I don't think ruby is there yet, but I'd hate to see it go down that path. I also think think much of this can be managed with deprecation messages and the like.

Actions

Copy link

#47

Updated by byroot (Jean Boussier) over 1 year ago

Status changed from Open to Closed

Applied in changeset git|3a7846b1aa4c10d86dc5a91c6df94f89d60bb0c3.

Add a hint of ASCII-8BIT being BINARY

[Feature #18576]

Since outright renaming ASCII-8BIT is deemed to backward incompatible,
the next best thing would be to only change its #inspect, particularly
in exception messages.

Actions

Copy link

Also available in: Atom PDF

Like2

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like1Like1Like1Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Feature #18576

Rename `ASCII-8BIT` encoding to `BINARY`

Context¶

Proposal¶

Updated by duerst (Martin Dürst) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by naruse (Yui NARUSE) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by naruse (Yui NARUSE) over 3 years ago

Updated by tenderlovemaking (Aaron Patterson) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by jeremyevans0 (Jeremy Evans) over 3 years ago

Updated by tenderlovemaking (Aaron Patterson) over 3 years ago

Updated by duerst (Martin Dürst) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by byroot (Jean Boussier) over 3 years ago

Updated by matz (Yukihiro Matsumoto) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by larskanis (Lars Kanis) over 3 years ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by naruse (Yui NARUSE) over 1 year ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by byroot (Jean Boussier) over 1 year ago

Updated by naruse (Yui NARUSE) over 1 year ago

Updated by byroot (Jean Boussier) over 1 year ago

Updated by zverok (Victor Shepelev) over 1 year ago

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by byroot (Jean Boussier) over 1 year ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by shyouhei (Shyouhei Urabe) over 1 year ago

Updated by naruse (Yui NARUSE) over 1 year ago

Updated by byroot (Jean Boussier) over 1 year ago

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

Updated by duerst (Martin Dürst) over 1 year ago

Updated by byroot (Jean Boussier) over 1 year ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by alexander-s (Alexander S) over 1 year ago

Updated by byroot (Jean Boussier) over 1 year ago