Feature #18949: Deprecate and remove replicate and dummy encodings - Ruby - Ruby Issue Tracking System

Updated by byroot (Jean Boussier) over 3 years ago Actions
Copy link
#1 [ruby-core:109372]

Based on my recent work on String#<< performance, I concur that encoding management often takes longer than the actual string operation, and that get_actual_encoding() is a very big offender.

I'm also curious what was the intended use for dummy encodings and whether it's actually in use somewhere.

So I'm in total support of this proposal.

Updated by noahgibbs (Noah Gibbs) over 3 years ago Actions
Copy link
#2 [ruby-core:109377]

I've been working on YJIT string handling. At this point our string handling is so simple that I don't think this would affect us either way. For encodings, we mostly call CRuby functions that handle them. So short-term, I don't think this would cause us any implementation difficulties.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#3 [ruby-core:109378]

noahgibbs (Noah Gibbs) wrote in #note-2:

I've been working on YJIT string handling. At this point our string handling is so simple that I don't think this would affect us either way. For encodings, we mostly call CRuby functions that handle them. So short-term, I don't think this would cause us any implementation difficulties.

Right, so this should also benefit YJIT as much as it benefits the CRuby interpreter performance-wise since it will make string/encoding-related functions faster.
I assume "would affect us either way/would cause us any implementation difficulties" means this change would be easy to adopt for YJIT, i.e., no or low effort but yet the performance gains would be valuable.
Since it simplify String encodings it benefits all Ruby implementations and all Ruby JITs.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#4 [ruby-core:109384]

I have looked at usage of Encoding#replicate and rb_define_dummy_encoding().

Encoding#replicate seems only used in 1 CRuby test, 1 ruby/spec file and exactly 1 gem (!), using the command below:

$ gem-codesearch '\.replicate\b'
# I removed irrelevant matches
/srv/gems/ruby_encoding_iroha-0.1.0/lib/ruby_encoding_iroha.rb:Encoding::ASCII_8BIT.replicate("IROHA")

https://github.com/ima1zumi/ruby_encoding_iroha seems a demo gem (it's a self-made encoding, not a real one), from what I understand not something serious.
There was a talk related to it at RubyKaigi 2021: https://speakerdeck.com/ima1zumi/dive-into-encoding
That related encoding_iroha gem uses rb_encdb_replicate(), rb_declare_transcoder() and rb_register_transcoder() which are private APIs.
rb_encdb_replicate() is only used by this gem.
To support new encodings properly, they should be added to Ruby, example. This has been done multiple times already (CESU-8, IBM720).

rb_define_dummy_encoding() is used in just a few gems:

$ gem-codesearch rb_define_dummy_encoding
# I removed redundant matches and matches from CRuby /*.c and header files
/srv/gems/jberkel-mysql-ruby-2.8.1/ext/mysql.c:	enc_index = rb_define_dummy_encoding(aliases[0]);
/srv/gems/nkf-0.1.1/ext/nkf/nkf.c:	    idx = rb_define_dummy_encoding(name);
/srv/gems/pg-1.4.2/ext/pg.c:	enc_index = rb_define_dummy_encoding(aliases[0]);
/srv/gems/tk-0.4.0/ext/tk/tcltklib.c:  if (RTEST(rb_define_dummy_encoding(RSTRING_PTR(name)))) {
/srv/gems/win32ole-1.8.8/ext/win32ole/win32ole.c:	idx = rb_define_dummy_encoding(enc_cstr);

So that is just 5 unique usages of rb_define_dummy_encoding() among all public gems.

jberkel-mysql-ruby seems to be an old version of ruby-mysql, which no longer uses rb_define_dummy_encoding().
nkf calls it in rb_nkf_enc_get(). rb_nkf_enc_get() seems only called with encodings that Ruby knows, so the rb_define_dummy_encoding() call seems dead/unreachable code.
I note it does refer to the existing dummy CP50220/CP50221 encodings: code
pg uses it to create the JOHAB encoding. I will take a look at how to replace that usage: https://github.com/ged/ruby-pg/pull/472.
The tk gem seems to already have a fallback for when rb_define_dummy_encoding() is not available. It might also be never reached but it is hard to tell from the code. Otherwise it could likely be replaced by the BINARY encoding.
win32ole uses it here. It could be dead code given this path is only taken for CP* encodings Ruby does not already know. Otherwise it could likely be replaced by the BINARY encoding.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#5

Description updated (diff)

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#6 [ruby-core:109393]

I have updated the issue description with a To Decide section to summarize for the dev meeting: https://bugs.ruby-lang.org/issues/18949#To-Decide

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#7

Description updated (diff)

Updated by duerst (Martin Dürst) over 3 years ago Actions
Copy link
#8 [ruby-core:109400]

For Encoding.list.filter { |e| e.dummy? }, I get the following list:

#<Encoding:UTF-16 (dummy)>,
#<Encoding:UTF-32 (dummy)>,
#<Encoding:IBM037 (dummy)>,
#<Encoding:ISO-2022-JP (dummy)>,
#<Encoding:ISO-2022-JP-2 (dummy)>,
#<Encoding:CP50220 (dummy)>,
#<Encoding:CP50221 (dummy)>,
#<Encoding:UTF-7 (dummy)>,
#<Encoding:ISO-2022-JP-KDDI (dummy)>

This is quite a diverse collection. I have no clue for example whether ISO-2022-JP-2 is, or ever was, in use. But on the other hand, ISO-2022-JP, UTF-16, and UTF-32 definitely have their uses. They are not so much used directly when processing strings (because indeed for these encodings, most string operations don't work, or don't work correctly), but they are used when transcoding to another encoding.

Updated by nirvdrum (Kevin Menard) over 3 years ago Actions
Copy link
#9 [ruby-core:109402]

But on the other hand, ISO-2022-JP, UTF-16, and UTF-32 definitely have their uses. They are not so much used directly when processing strings (because indeed for these encodings, most string operations don't work, or don't work correctly), but they are used when transcoding to another encoding.

I think UTF-16 and UTF-32 can be handled without general support for dummy encodings. They already receive special consideration in get_actual_encoding. What Benoit proposed was to allow those encodings, but resolve them to the correct endianness at string creation time. That way you don't have any dummy encodings floating through the runtime.

Can you please provide additional context on when ISO-2022-JP is used? As far as I can tell, you can't perform most string operations on a string with that encoding. I'm curious what the use cases are.

Updated by znz (Kazuhiro NISHIYAMA) over 3 years ago Actions
Copy link
#10 [ruby-core:109405]

I think ISO-2022-JP is used for mail archives, and IRCnet even now.

Updated by naruse (Yui NARUSE) over 3 years ago Actions
Copy link
#11 [ruby-core:109406]

Status changed from Open to Rejected

String is a container and an encoding is a label of it. While data whose encoding is an encoding categorized in dummy encodings in Ruby, we cannot avoid such encodings.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#12 [ruby-core:109433]

Status changed from Rejected to Open

This is a disappointing answer, it sounds like there was not enough time at the dev meeting to consider this issue properly maybe? (that's fine, but then don't reject please)
I reopen because this issue is still vastly unanswered.
Also it feels not so respectful to me to answer with a single short unclear sentence to a carefully-crafted issue.
I estimate I have a lot of experience in this area since I reimplemented dummy/replicate encodings in TruffleRuby recently.

The dev meeting log has no notes about a discussion: https://github.com/ruby/dev-meeting-log/blob/master/DevMeeting-2022-07-21.md#feature-18949-deprecate-and-remove-replicate-and-dummy-encodings-eregon

naruse (Yui NARUSE) wrote in #note-11:

String is a container and an encoding is a label of it. While data whose encoding is an encoding categorized in dummy encodings in Ruby, we cannot avoid such encodings.

So, there are 3 things we can decide, which I pointed out in https://bugs.ruby-lang.org/issues/18949#note-6
It seems you are against "3. Deprecate and then remove other dummy encodings".
What about the other points, 1. and 2.?

Could you explain your point of view better?
Who uses dummy encodings and what for? Since no string operations really works on them, they are indeed nothing more than a label + the transcoding table.

I agree 3. is more difficult, and is less important than 1./2..

seems straightforward, there seems to be extremely few usages of it, and they can be replaced by Encoding::BINARY easily.
In fact Encoding#replicate is used in zero gems, so I think it's fair enough to deprecate and then remove it without further discussion.
I think we need to evaluate usages of the dummy UTF-16 and UTF-32 encodings, e.g., with gem-codesearch.

For better compatibility between Rubies we need to do 2. as well, as mentioned above the overhead and complexity for dummy UTF-16/UTF-32 is too much, and JRuby & TruffleRuby already do not deal with it for many String operations.
It is also clearly a significant performance cost for CRuby.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#13 [ruby-core:109434]

Usages of Encoding::UTF_32:

$ gem-codesearch 'Encoding::UTF_32\b'
/srv/gems/chupa-text-decomposer-html-1.0.3/test/test-html.rb:            assert_equal([Encoding::UTF_32], decompose(@data))
Easily replaced by LE/BE: https://github.com/ranguba/chupa-text-decomposer-html/blob/4ede88e7cb868bf7d6e6bf30b5bc2e4d6b50b338/test/test-html.rb#L249-L256
/srv/gems/flydata-0.8.10.2/flydata-core/lib/flydata-core/table_def/mysql_table_def.rb:    "utf32"    => Encoding::UTF_32,
No repository available, seems just an incorrect mapping
/srv/gems/rbs-2.6.0/core/encoding.rbs:Encoding::UTF_32: Encoding
Just a mapping of Ruby Encoding constants, safe to remove
/srv/gems/rsense-core-0.6.6/stubs/1.8/_builtin.rb:      Encoding::UTF_32,
/srv/gems/rsense-core-0.6.6/stubs/1.8/_builtin.rb:Encoding::UTF_32 = Encoding.new
Just a mapping of Ruby Encoding constants, safe to remove
/srv/gems/ruby-lint-2.3.1/lib/ruby-lint/definitions/core/encoding.rb:  defs.define_constant('Encoding::UTF_32') do |klass|
Just a mapping of Ruby Encoding constants, safe to remove
/srv/gems/ruby_encoding_iroha-0.1.0/lib/ruby_encoding_iroha.rb:      Encoding::UTF_32,
Test gem

Usages of Encoding::UTF_16:

$ gem-codesearch 'Encoding::UTF_16\b' | grep -v vendor
/srv/gems/code_owners-2.0.1/lib/code_owners.rb:      input.encode!(Encoding::UTF_16, invalid: :replace, replace: '�')
/srv/gems/code_owners-2.0.1/lib/code_owners.rb:      input.encode!(Encoding::UTF_8, Encoding::UTF_16)
/srv/gems/dependabot-common-0.205.0/lib/dependabot/clients/azure.rb:        pr_description = pr_description.dup.force_encoding(Encoding::UTF_16)
/srv/gems/dependabot-common-0.205.0/lib/dependabot/clients/azure.rb:          truncated_msg = "...\n\n_Description has been truncated_".dup.force_encoding(Encoding::UTF_16)
/srv/gems/eightball-0.2.2/lib/eightball.rb:    string = string.encode(Encoding::UTF_16, :invalid => :replace, :undef => :replace, :replace => replacement_string)
/srv/gems/encodable_validator-0.2.0/spec/encodable_validator_spec.rb:            TestRecord.validates :target_text, :encodable => { :encodings => [Encoding::UTF_16] }
/srv/gems/fastlane-2.208.0/fastlane_core/lib/fastlane_core/ui/implementations/shell.rb:      return message.encode(Encoding::UTF_8, Encoding::UTF_16) if test_message.force_encoding(Encoding::UTF_16).valid_encoding?
/srv/gems/fastlane_hotfix-2.187.0/fastlane_core/lib/fastlane_core/ui/implementations/shell.rb:      return message.encode(Encoding::UTF_8, Encoding::UTF_16) if test_message.force_encoding(Encoding::UTF_16).valid_encoding?
/srv/gems/flydata-0.8.10.2/flydata-core/lib/flydata-core/redshift/string.rb:                  string.encode(Encoding::UTF_16, options).encode(Encoding::UTF_8) :
/srv/gems/flydata-0.8.10.2/flydata-core/lib/flydata-core/table_def/mysql_table_def.rb:    "utf16"    => Encoding::UTF_16,
/srv/gems/fmod-0.9.6/lib/fmod/core/tag.rb:        when TagDataType::STRING_UTF16 then raw.force_encoding(Encoding::UTF_16)
/srv/gems/honeybadger-mubi-2.1.1/lib/honeybadger/util/sanitizer.rb:          data.encode(Encoding::UTF_16, ENCODE_OPTS).encode!(Encoding::UTF_8)
/srv/gems/id3tag-1.0.0/lib/id3tag/encoding_util.rb:      0b1 => Encoding::UTF_16,
/srv/gems/id3tag-1.0.0/lib/id3tag/frames/v2/text_frame.rb:          0b1 => Encoding::UTF_16,
/srv/gems/id3taginator-0.8/README.md:  .default_encode_for_destination(Encoding::UTF_16)
/srv/gems/id3taginator-0.8/README.md:audio_file.default_encode_for_destination(Encoding::UTF_16)
/srv/gems/id3taginator-0.8/lib/id3taginator/extensions/encodable.rb:        when Encoding::UTF_16
/srv/gems/id3taginator-0.8/lib/id3taginator/extensions/encodable.rb:          Encoding::UTF_16
/srv/gems/id3taginator-0.8/lib/id3taginator/options/options.rb:      def initialize(default_encode_dest = Encoding::UTF_16, default_decode_dest = Encoding::UTF_8, padding_bytes = 25,
/srv/gems/json-canonicalization-0.3.0/lib/json/canonicalization.rb:    return k.encode(Encoding::UTF_16)
/srv/gems/lhj-tools-0.1.59/lib/lhj/ui/implementations/shell.rb:      return message.encode(Encoding::UTF_8, Encoding::UTF_16) if test_message.force_encoding(Encoding::UTF_16).valid_encoding?
/srv/gems/msgpack-1.5.4/spec/format_spec.rb:    v = pack_unpack('string'.encode(Encoding::UTF_16))
/srv/gems/net-http-0.2.2/lib/net/http/response.rb:      charset = Encoding::UTF_8 if charset == Encoding::UTF_16
/srv/gems/oj-3.13.19/test/json_gem/json_parser_test.rb:    assert_equal Encoding::UTF_16, source.encoding
/srv/gems/oj-3.13.19/test/json_gem/json_parser_test.rb:  end if defined?(Encoding::UTF_16)
/srv/gems/pdfinfo-1.4.2/lib/pdfinfo.rb:    str = str.encode(Encoding::UTF_16, invalid: :replace, undef: :replace, replace: '')
/srv/gems/rbs-2.6.0/core/encoding.rbs:Encoding::UTF_16: Encoding
/srv/gems/rosette-core-1.0.1/spec/core/commands/translations/export_command_spec.rb:      command.set_encoding(Encoding::UTF_16)
/srv/gems/rsense-core-0.6.6/stubs/1.8/_builtin.rb:      Encoding::UTF_16,
/srv/gems/rsense-core-0.6.6/stubs/1.8/_builtin.rb:Encoding::UTF_16 = Encoding.new
/srv/gems/ruby-lint-2.3.1/lib/ruby-lint/definitions/core/encoding.rb:  defs.define_constant('Encoding::UTF_16') do |klass|
/srv/gems/ruby_encoding_iroha-0.1.0/lib/ruby_encoding_iroha.rb:      Encoding::UTF_16,
/srv/gems/sqlite3-fiddle-1.0.0/lib/sqlite3/database.rb:        str.force_encoding(Encoding::UTF_16)
/srv/gems/txgh-7.0.3/spec/github_api_spec.rb:      expect(result[:content].encoding).to eq(Encoding::UTF_16)
/srv/gems/txgh-7.0.3/spec/gitlab_api_spec.rb:      expect(result[:content].encoding).to eq(Encoding::UTF_16)

More usages, but all these usages seem incorrect, and should use either UTF_16LE or UTF_16BE.

What we probably need here is a custom deprecation messages for these two constants, to tell to use the LE/BE variants instead.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#14 [ruby-core:109435]

znz (Kazuhiro NISHIYAMA) wrote in #note-10:

I think ISO-2022-JP is used for mail archives, and IRCnet even now.

AFAIK Encoding::ISO_2022_JP (in Ruby) is only useful to say "this String is encoded in ISO-2022-JP", but Ruby can't do anything with that except String#encode.
Is the use case then to String#encode to another encoding before any String operation?
Because all character/codepoints String operations are broken on ISO-2022-JP (and other dummy encodings).

Could it be replaced by the BINARY encoding in programs using it?
Or just an instance variable on the String to remember its original encoding?
Or eagerly converting to a non-dummy encoding as soon as receiving ISO-2022-JP bytes from an external source?

Updated by duerst (Martin Dürst) over 3 years ago Actions
Copy link
#15 [ruby-core:109440]

Eregon (Benoit Daloze) wrote in #note-14:

znz (Kazuhiro NISHIYAMA) wrote in #note-10:

I think ISO-2022-JP is used for mail archives, and IRCnet even now.

AFAIK Encoding::ISO_2022_JP (in Ruby) is only useful to say "this String is encoded in ISO-2022-JP", but Ruby can't do anything with that except String#encode.
Is the use case then to String#encode to another encoding before any String operation?

Yes, exactly. I already said so in https://bugs.ruby-lang.org/issues/18949#note-8.

Could it be replaced by the BINARY encoding in programs using it?

No, that would lose encoding information.

Or just an instance variable on the String to remember its original encoding?

Theoretically yes, but that would make dealing with encodings more complex.

Or eagerly converting to a non-dummy encoding as soon as receiving ISO-2022-JP bytes from an external source?

Theoretically yes, but that would again complicate code somewhere. We don't just want to move around complexity, because it stays roughly the same, and moving it around is work and may introduce additional errors.

Eregon (Benoit Daloze) wrote in #note-13:

Usages of Encoding::UTF_32:
(details omitted)
Usages of Encoding::UTF_16:
(details omitted)
More usages, but all these usages seem incorrect, and should use either UTF_16LE or UTF_16BE.

What we probably need here is a custom deprecation messages for these two constants, to tell to use the LE/BE variants instead.

You are looking for Encoding::UTF_32 in code, but what's more relevant is UTF-32 and UTF-16 in data. These are officially registered 'charset' labels, see https://www.iana.org/assignments/charset-reg/charset-reg.xhtml.

Eregon (Benoit Daloze) wrote in #note-12:

This is a disappointing answer, it sounds like there was not enough time at the dev meeting to consider this issue properly maybe? (that's fine, but then don't reject please)

I wasn't at the meeting, so I can't say.

I reopen because this issue is still vastly unanswered.
Also it feels not so respectful to me to answer with a single short unclear sentence to a carefully-crafted issue.

Naruse-san usually writes very short answeres, but that doesn't mean he doesn't understand the matter at hand.

I estimate I have a lot of experience in this area since I reimplemented dummy/replicate encodings in TruffleRuby recently.

(Naruse-san and others have a lot of experience because they implemented these encodings in CRuby.)

The dev meeting log has no notes about a discussion: https://github.com/ruby/dev-meeting-log/blob/master/DevMeeting-2022-07-21.md#feature-18949-deprecate-and-remove-replicate-and-dummy-encodings-eregon

naruse (Yui NARUSE) wrote in #note-11:

String is a container and an encoding is a label of it. While data whose encoding is an encoding categorized in dummy encodings in Ruby, we cannot avoid such encodings.

So, there are 3 things we can decide, which I pointed out in https://bugs.ruby-lang.org/issues/18949#note-6
It seems you are against "3. Deprecate and then remove other dummy encodings".
What about the other points, 1. and 2.?

For point 2, please see above. For 1, these functions are internal, and I guess the same thing could be done with some different functions, but as far as I understand, the functionality is needed for Ruby itself. And I don't see the point of a fixed number of encodings. That it is possible to support new encodings may no longer be as important as it was 10 or 20 years ago, but may come in handy anyway.

Could you explain your point of view better?
Who uses dummy encodings and what for? Since no string operations really works on them, they are indeed nothing more than a label + the transcoding table.

Yes. That's exactly what they are. And I have difficulties understanding why there would be a problem with this. Supporting a real encoding is real work. But supporting dummy encodings can't be that much work, because there isn't much functionality.

I agree 3. is more difficult, and is less important than 1./2..

seems straightforward, there seems to be extremely few usages of it, and they can be replaced by Encoding::BINARY easily.
In fact Encoding#replicate is used in zero gems, so I think it's fair enough to deprecate and then remove it without further discussion.

enc_replicate is used internally as far as I understand. Maybe we can deprecate the Ruby method, but not the C function.

I think we need to evaluate usages of the dummy UTF-16 and UTF-32 encodings, e.g., with gem-codesearch.

See above.

For better compatibility between Rubies we need to do 2. as well, as mentioned above the overhead and complexity for dummy UTF-16/UTF-32 is too much, and JRuby & TruffleRuby already do not deal with it for many String operations.
It is also clearly a significant performance cost for CRuby.

Updated by nobu (Nobuyoshi Nakada) over 3 years ago Actions
Copy link
#16 [ruby-core:109441]

Eregon (Benoit Daloze) wrote in #note-14:

Could it be replaced by the BINARY encoding in programs using it?

Yes, it could, and was exactly the way in 1.8 or earlier.
I don't want it again.

Updated by nobu (Nobuyoshi Nakada) over 3 years ago Actions
Copy link
#17 [ruby-core:109442]

Eregon (Benoit Daloze) wrote:

Replicate and dummy encodings seem to clearly fall into this category, almost nobody uses them but they add a significant complexity and also add a significant performance overhead.
Notably, the existence of those means the number of encodings in a Ruby runtime is actually variable and not fixed.
That means extra synchronization, hashtable lookups, indirections, function calls, etc.

These are not related to replicate/dummy encodings.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#18 [ruby-core:109447]

nobu (Nobuyoshi Nakada) wrote in #note-17:

Eregon (Benoit Daloze) wrote:

Replicate and dummy encodings seem to clearly fall into this category, almost nobody uses them but they add a significant complexity and also add a significant performance overhead.
Notably, the existence of those means the number of encodings in a Ruby runtime is actually variable and not fixed.
That means extra synchronization, hashtable lookups, indirections, function calls, etc.

These are not related to replicate/dummy encodings.

It very much is, as explained in the description (https://bugs.ruby-lang.org/issues/18949#Summary).
Replicate (Encoding#replicate) and extra dummy encodings (rb_define_dummy_encoding) cause a variable number of encodings, which means extra synchronization, hashtable lookups, indirections, function calls, etc.
A fixed number of encodings would simplify that a lot in complexity and in performance for getting the encoding of a String.
The dummy UTF-16/UTF-32 encodings cause the overhead of get_actual_encoding().
Other dummy encodings do not cause performance overheads (item 3. above).

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#19 [ruby-core:109448]

duerst (Martin Dürst) wrote in #note-15:

Yes, exactly. I already said so in https://bugs.ruby-lang.org/issues/18949#note-8.

Right, I wasn't sure this was what you meant there, thank you for confirming.

Or just an instance variable on the String to remember its original encoding?

Theoretically yes, but that would make dealing with encodings more complex.

Only for extremely-rarely used encodings that don't even exist in the list of 100+ encodings in Ruby by default, so I think that is reasonable.

Or eagerly converting to a non-dummy encoding as soon as receiving ISO-2022-JP bytes from an external source?

Theoretically yes, but that would again complicate code somewhere. We don't just want to move around complexity, because it stays roughly the same, and moving it around is work and may introduce additional errors.

AFAIK we know the encoding of an external string at the point it enters Ruby (e.g. reading from an external source).
So associating the encoding (e.g. with force_encoding) must be done at that point.
Eager conversion could also be done at that point with very little changes, and it would have the significant advantage that all String operations are actually implemented/correct and not some of them are broken (for Strings with a dummy encoding).
In code, it'd means instead of io.gets.force_encoding("some-dummy-encoding") we'd use io.gets.encode("UTF-8", "some-dummy-encoding").

You are looking for Encoding::UTF_32 in code, but what's more relevant is UTF-32 and UTF-16 in data. These are officially registered 'charset' labels, see https://www.iana.org/assignments/charset-reg/charset-reg.xhtml.

Do you know what this "UTF-16" charset does or mean? https://www.iana.org/assignments/charset-reg/UTF-16 does not say much.
Is it always the native-endian UTF-16, does it use the BOM, what if there is no BOM?
The reality AFAIK is there no such thing as an UTF-16 encoding without an endian, it's either UTF-16LE or UTF-16BE and the user should know.
For the rare cases where it can be guessed from the BOM, we can add a convenience method to pick the right encoding between these two based on the BOM (as proposed in the description).

This is what Wikipedia says on the matter: https://en.wikipedia.org/wiki/UTF-16#Byte-order_encoding_schemes
Basically, we need the resolve the endian, before that we can't do anything and "UTF-16" is meaningless without a BOM or specifying LE/BE.
Re-reading the BOM on every string operation is a needless overhead, it should be done once, and only paid if using UTF-16/UTF-32 strings with BOM.

Giving the dummy UTF-16/32 encoding to a String is almost always a mistake, for instance taking a substring of it with a non-zero offset will result as treating it like single-byte BINARY (because the BOM is gone then), which is clearly never what anyone wants.
This resolution of the endian needs to be happen before the encoding is assigned to the String to be meaningful, so a UTF-* string always knows its endian.

For 1, these functions are internal, and I guess the same thing could be done with some different functions, but as far as I understand, the functionality is needed for Ruby itself. And I don't see the point of a fixed number of encodings. That it is possible to support new encodings may no longer be as important as it was 10 or 20 years ago, but may come in handy anyway.

Yes, the fact that e.g. enc_replicate is used internally is an implementation detail (and it could easily be changed to something else, the encoding indices could be determined statically, no need for dynamically growing the list of encodings, etc).
The point of a fixed number of encodings:

We know the number of encodings, they can all fit in 7 bits, and so ENCODING_GET can be simply RB_ENCODING_GET_INLINED (no need for the complicated and slow fallback logic).
We can store all encodings in a read-only global array, there is no need to handle resizing or anything
We can optimize the lookup of encodings by name better, by having the map read-only and potentially use perfect-hashing
No need to synchronize changes to these array and map.
A significant amount of complexity can be removed in all Ruby implementations currently implementing replicate/extra dummy encodings.
Other points mentioned in https://bugs.ruby-lang.org/issues/18949#Summary

Yes. That's exactly what they are. And I have difficulties understanding why there would be a problem with this. Supporting a real encoding is real work. But supporting dummy encodings can't be that much work, because there isn't much functionality.

Those dummy encodings expose broken behavior for String operations, so IMHO they hurt more than help.
But I can already see some people disagree on that, so let's postpone 3. for now and leave existing/predefined dummy encoding untouched.

Updated by duerst (Martin Dürst) over 3 years ago Actions
Copy link
#20 [ruby-core:109452]

Eregon (Benoit Daloze) wrote in #note-18:

Replicate (Encoding#replicate) and extra dummy encodings (rb_define_dummy_encoding) cause a variable number of encodings, which means extra synchronization, hashtable lookups, indirections, function calls, etc.
A fixed number of encodings would simplify that a lot in complexity and in performance for getting the encoding of a String.

I think it should be okay to in your implementation assume a maximum number of encodings (currently, the number of actual encodings in CRuby is a bit more than 100, so a max of 127 or whatever might be a good choice). Then you don't need a hashtable, you can just use an array.

Regards, Martin.

Updated by duerst (Martin Dürst) over 3 years ago Actions
Copy link
#21 [ruby-core:109453]

Eregon (Benoit Daloze) wrote in #note-19:

duerst (Martin Dürst) wrote in #note-15:

Or just an instance variable on the String to remember its original encoding?

Theoretically yes, but that would make dealing with encodings more complex.

Only for extremely-rarely used encodings that don't even exist in the list of 100+ encodings in Ruby by default, so I think that is reasonable.

Having a special case that is used extremely rarely is prone to hidden bugs, and is therefore bad engineering.

It's unclear whether you want to have Ruby internally deal with such rare encodings in a special way, or whether you want the user to do that, but both are bad choices.

You are looking for Encoding::UTF_32 in code, but what's more relevant is UTF-32 and UTF-16 in data. These are officially registered 'charset' labels, see https://www.iana.org/assignments/charset-reg/charset-reg.xhtml.

Do you know what this "UTF-16" charset does or mean? https://www.iana.org/assignments/charset-reg/UTF-16 does not say much.

It references RFC 2781 (see https://www.rfc-editor.org/rfc/rfc2781). That provides quite a bit of details.

Is it always the native-endian UTF-16, does it use the BOM, what if there is no BOM?
The reality AFAIK is there no such thing as an UTF-16 encoding without an endian, it's either UTF-16LE or UTF-16BE and the user should know.

Basically, we need the resolve the endian, before that we can't do anything and "UTF-16" is meaningless without a BOM or specifying LE/BE.

[A string may be big-endian or little-endian, but it's a string's endianness, not a strings 'endian'.]

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#22 [ruby-core:109454]

duerst (Martin Dürst) wrote in #note-20:

I think it should be okay to in your implementation assume a maximum number of encodings (currently, the number of actual encodings in CRuby is a bit more than 100, so a max of 127 or whatever might be a good choice). Then you don't need a hashtable, you can just use an array.

That would get only a small part of the benefits, would introduce a weird arbitrary limit for Encoding#replicate/rb_define_dummy_encoding and would not remove the need for synchronization.
So I don't think that's a good choice in practice for any Ruby implementation.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#23 [ruby-core:109455]

duerst (Martin Dürst) wrote in #note-21:

It references RFC 2781 (see https://www.rfc-editor.org/rfc/rfc2781). That provides quite a bit of details.

Thanks for the link.
So that says strings in "UTF-16" (no explicit endian) SHOULD have a BOM, and if not SHOULD be interpreted as UTF-16BE (and the Wikipedia article links to some disagreement there, which means no BOM => undefined or would need to scan bytes which seems unreasonable and unreliable).
CRuby treats it as BINARY if there is no BOM, which seems to ignore this, but does not matter if we remove Encoding::UTF_16.
The RFC more or less say "UTF-16 bytes have no meaning unless the endian is specified or there is a BOM".
So Ruby should probably raise on dummy UTF-16 without BOM and force resolution at String creation time (described in resolve these encodings on String creation in the description).
It is much simpler and cleaner to not have Encoding::UTF_16 at all.

As we see in the list above, the dummy UTF-16 encoding seems extremely rarely used (https://bugs.ruby-lang.org/issues/18949#note-13), in practice users seem to already know to use UTF_16BE or UTF_16LE.
Even gem-codesearch "'UTF-16'" | grep '\.rb:' and gem-codesearch '"UTF-16"' | grep '\.rb:' are fairly short.
If the encoding comes dynamically from an external source then it really ought to specify the endianness.

But to truly know, I think we need to deprecate dummy UTF-16/32 and see if someone really has a use-case where they cannot do without these.
That seems clearly impossible if we provide a method to look at the BOM and return the encoding-with-endian, unless they actually want the BINARY behavior which makes no sense.

Updated by duerst (Martin Dürst) over 3 years ago Actions
Copy link
#24 [ruby-core:109459]

@Eregon (Benoit Daloze): If you want, you can add this issue to the developers' meeting next week. I will be in Switzerland, but I should be able to participate. Maybe you can also participate?

Eregon (Benoit Daloze) wrote in #note-22:

duerst (Martin Dürst) wrote in #note-20:

I think it should be okay to in your implementation assume a maximum number of encodings (currently, the number of actual encodings in CRuby is a bit more than 100, so a max of 127 or whatever might be a good choice). Then you don't need a hashtable, you can just use an array.

That would get only a small part of the benefits, would introduce a weird arbitrary limit for Encoding#replicate/rb_define_dummy_encoding and would not remove the need for synchronization.
So I don't think that's a good choice in practice for any Ruby implementation.

Maybe you have different information, but the number of additional encodings people may create is very, very low. I guess I should know, because otherwise, people would ask me for additional transcodings.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#25 [ruby-core:109461]

duerst (Martin Dürst) wrote in #note-24:

@Eregon (Benoit Daloze): If you want, you can add this issue to the developers' meeting next week. I will be in Switzerland, but I should be able to participate. Maybe you can also participate?

OK, I will add it.
BTW if you pass by Zurich I'd be happy to meet.
I can try to attend, however the meeting is 6am-10am local time, that is quite early for me.
Maybe we can discuss towards the end of the meeting i.e. 9am CEST/16 JST? I'll mention that on the dev meeting issue.

Maybe you have different information, but the number of additional encodings people may create is very, very low. I guess I should know, because otherwise, people would ask me for additional transcodings.

rb_define_dummy_encoding() doesn't actually have any transcoding (Encoding::ConverterNotFoundError: code converter not found (foo to UTF-8), https://gist.github.com/eregon/8ac387dc77b03d08f2259a64b9361934),
and Encoding#replicate just inherits the transcoding from the original encoding.
So rb_define_dummy_encoding() is even less useful than the existing dummy encodings, because one can literally do nothing with them in Ruby, it's just bytes + an encoding name, but no String operation or transcoding works.

Hence I really want to remove these 2 ways to create extra encodings dynamically, because they are basically useless (except to "label" a string, but there are other simpler ways to do that) and they cause a lot of complexity and overhead.

For example this created a large number of replicate encodings for testing purposes:
https://github.com/ged/ruby-pg/pull/472/files#diff-7c29db9c84b8ebcf77654f5b4a4fc1f6aebd918cf7ac8c2c773354ee56a71ec5L1880
BTW that's also the PR removing the usage of rb_define_dummy_encoding() in pg, so that is one less usage of rb_define_dummy_encoding().
It's very rare in practice, I agree with you there should never be more than e.g. 127 encodings in Ruby (unless many real encodings are created and added to Ruby).
But a max limit is not nearly as helpful as a fixed number of encodings in terms of complexity and overhead.

Updated by Dan0042 (Daniel DeLorme) over 3 years ago Actions
Copy link
#26 [ruby-core:109463]

Eregon (Benoit Daloze) wrote in #note-25:

But a max limit is not nearly as helpful as a fixed number of encodings in terms of complexity and overhead.

Can you clarify this please? With a max limit you need synchronization only when adding encodings. I hope you'll agree such a small level of complexity is not an issue. Apart from that, the two approaches should have the same complexity/overhead. So I don't understand why it would be "not nearly as helpful".

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#27 [ruby-core:109464]

Dan0042 (Daniel DeLorme) wrote in #note-26:

Can you clarify this please? With a max limit you need synchronization only when adding encodings. I hope you'll agree such a small level of complexity is not an issue. Apart from that, the two approaches should have the same complexity/overhead. So I don't understand why it would be "not nearly as helpful".

In general there is no such thing as "synchronization only on writes, nothing on read", one would need something (some form of synchronization/checks) on reads too.
E.g. the array which holds the encodings by index could have volatile accesses but that's a big overhead as it prevents optimizations (if not volatile, stale reads are possible).
Or it could check if the read value is not NULL in a append-only array, but that's an extra cost.
Note the GVL does NOT help for these issues, and there is still locking needed in CRuby, since Encodings are shared across Ractors.
Perfect hashing as mentioned in https://bugs.ruby-lang.org/issues/18949#Summary is also not possible with a dynamic number of encodings.
The map/Hash holding encoding names to encodings (used by Encoding.find) would need synchronization with a max limit, but would not with a fixed number of encodings.

But also I feel it's very bad semantics to specify "usages of Encoding#replicate and rb_define_dummy_encoding() are limited to 127-103=24 times, after that they raise an error".
Do you know any language or core method which says you can call me N times total (with N a small number) and after that it raises an exception?

Also I don't see the point to preserve functionality basically nobody uses (3 gems in the world, rb_define_dummy_encoding() only used in weird edge cases, and they can be replaced by BINARY easily: https://bugs.ruby-lang.org/issues/18949#note-4).
I will go through these gems and make PRs to replace those usages.

Updated by Dan0042 (Daniel DeLorme) over 3 years ago Actions
Copy link
#28 [ruby-core:109472]

Thank you for the response.
It's true synchronization is required for a number of read operations, but I think the most important for performance, get_encoding, could become very fast with a max limit. But I digress.
More importantly I agree there's not much point to preserve functionality nobody uses. The flexibility is nice to have but doesn't seem worth the cost. 2¢

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#29 [ruby-core:109543]

Copying from the dev meeting log:

We will try to change the encoding lookup for stings so that the fast path gets faster, and less checks are involved, in a way that should reduce or eliminate locking.
- https://github.com/ruby/ruby/blob/b0b9f7201acab05c2a3ad92c3043a1f01df3e17f/encoding.c#L423
For 'UTF-16' and 'UTF-32', we agreed that there is no point in getting the endianness dynamically every time the string is accessed. We will try to change the implementation so that the endianness is evaluated only once (string creation time) and that information can then be reused. This will evaluate some checks in 'hot' code.
- https://github.com/ruby/ruby/blob/b0b9f7201acab05c2a3ad92c3043a1f01df3e17f/string.c#L364
@matz (Yukihiro Matsumoto) agreed to deprecate and remove Encoding#replicate
Other committers are unsure if we can remove rb_define_dummy_encoding(), but if @Eregon (Benoit Daloze) makes PRs to replace usages in existing gems it may be OK

Updated by matz (Yukihiro Matsumoto) over 3 years ago Actions
Copy link
#30 [ruby-core:109566]

We still need to discuss about the implementation detail, but the baseline is what @Eregon (Benoit Daloze) explained above.

Matz.

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#31 [ruby-core:109831]

Here is the dev meeting log: https://github.com/ruby/dev-meeting-log/blob/master/DevMeeting-2022-08-18.md#feature-18949-deprecate-and-remove-replicate-and-dummy-encodings-eregon

Conclusion: Just remove the dynamic UTF-16/32 endian detection. People should convert the string first.

Awesome, I think this is the best outcome and it simplifies things the most.
I'll deprecate Encoding::UTF_16/Encoding::UTF_32 (saying to use LE/BE explicitly instead) and in the next version we remove them.

rb_enc_from_index's lock: Conclusion: Change the encoding table into fixed array whose size is 256

This might still need a lock when resolving index to encoding (i.e., when reading), if e.g. a new dummy encoding or its index is passed to another Ractor (possibly passed via a C extension). WDYT @ko1 (Koichi Sasada)?

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#32 [ruby-core:109833]

Assignee set to Eregon (Benoit Daloze)

PR for deprecating Encoding#replicate and the dummy Encoding::UTF_16/32 encodings: https://github.com/ruby/ruby/pull/6322

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#33

Status changed from Open to Closed

Applied in changeset git|14bcf69c9c36875c7956d0ec8c04bfeaec514dd1.

Deprecate Encoding#replicate

See [Feature #18949].

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#34

Status changed from Closed to Open

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#35

Status changed from Open to Closed

Applied in changeset git|6525b6f760ccd9612c9546b0313ab1c7e4af5e66.

Remove get_actual_encoding() and the dynamic endian detection for dummy UTF-16/UTF-32

And simplify callers of get_actual_encoding().
See [Feature #18949].
See https://github.com/ruby/ruby/pull/6322#issuecomment-1242758474

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#36 [ruby-core:109889]

Status changed from Closed to Open

Done:

Deprecate Encoding#replicate
Remove get_actual_encoding() and the dynamic endian detection for dummy UTF-16/UTF-32 (https://github.com/ruby/ruby/pull/6322)

Still to do:

Limit the number of dynamic encodings defined by rb_define_dummy_encoding(), or deprecate it after addressing all usages.
Based on a fixed number of encodings, optimize rb_enc_from_index() to avoid locking and other places which can be simplified due to the fixed number of encodings

Updated by Eregon (Benoit Daloze) over 3 years ago Actions
Copy link
#37 [ruby-core:109890]

In https://github.com/ruby/ruby/pull/6323 I managed to replace all usages of rb_define_dummy_encoding() by just rb_raise(rb_eArgError, "unknown encoding name - %s", name);, and all tests pass.
Feedback/review appreciated.

Updated by Eregon (Benoit Daloze) about 3 years ago Actions
Copy link
#38 [ruby-core:111690]

@ko1 (Koichi Sasada) did https://github.com/ruby/ruby/pull/6935, great, thank you!

Updated by Eregon (Benoit Daloze) about 3 years ago 1Actions
Copy link
#39 [ruby-core:111697]

Target version set to 3.3

This is all done now, only https://github.com/ruby/ruby/pull/7079 left and I'll merge that when it passes CI.

Overall:

We deprecated and removed Encoding#replicate
We removed get_actual_encoding()
We limited to 256 encodings and kept rb_define_dummy_encoding() with that constraint.
There is a single flat array to lookup encodings, rb_enc_from_index() is fast now.

Since the limit is 256 and not 128 though it means ENCODING_GET is not just RB_ENCODING_GET_INLINED but still has the check and slow fallback.

Thank you for the discussion, @ko1 (Koichi Sasada) for implementing the fixed-size table and let's close this.
Of course for all builtin encodings the cost is just the extra check.
Maybe the limit could be changed later to 128 if this optimization is wanted.

Updated by Eregon (Benoit Daloze) about 3 years ago Actions
Copy link
#40

Status changed from Open to Closed

Project

General

Profile

Ruby

Custom queries

Feature #18949

Deprecate and remove replicate and dummy encodings

Replicate Encodings¶

Dummy Encodings¶

UTF-16 and UTF-32 dummy encodings¶

Existing dummy encodings¶

rb_define_dummy_encoding(name)¶

Summary¶

To Decide¶

Updated by byroot (Jean Boussier) over 3 years ago ActionsCopy link #1 [ruby-core:109372]

Updated by noahgibbs (Noah Gibbs) over 3 years ago ActionsCopy link #2 [ruby-core:109377]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #3 [ruby-core:109378]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #4 [ruby-core:109384]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #5

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #6 [ruby-core:109393]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #7

Updated by duerst (Martin Dürst) over 3 years ago ActionsCopy link #8 [ruby-core:109400]

Updated by nirvdrum (Kevin Menard) over 3 years ago ActionsCopy link #9 [ruby-core:109402]

Updated by znz (Kazuhiro NISHIYAMA) over 3 years ago ActionsCopy link #10 [ruby-core:109405]

Updated by naruse (Yui NARUSE) over 3 years ago ActionsCopy link #11 [ruby-core:109406]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #12 [ruby-core:109433]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #13 [ruby-core:109434]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #14 [ruby-core:109435]

Updated by duerst (Martin Dürst) over 3 years ago ActionsCopy link #15 [ruby-core:109440]

Updated by nobu (Nobuyoshi Nakada) over 3 years ago ActionsCopy link #16 [ruby-core:109441]

Updated by nobu (Nobuyoshi Nakada) over 3 years ago ActionsCopy link #17 [ruby-core:109442]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #18 [ruby-core:109447]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #19 [ruby-core:109448]

Updated by duerst (Martin Dürst) over 3 years ago ActionsCopy link #20 [ruby-core:109452]

Updated by duerst (Martin Dürst) over 3 years ago ActionsCopy link #21 [ruby-core:109453]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #22 [ruby-core:109454]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #23 [ruby-core:109455]

Updated by duerst (Martin Dürst) over 3 years ago ActionsCopy link #24 [ruby-core:109459]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #25 [ruby-core:109461]

Updated by Dan0042 (Daniel DeLorme) over 3 years ago ActionsCopy link #26 [ruby-core:109463]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #27 [ruby-core:109464]

Updated by Dan0042 (Daniel DeLorme) over 3 years ago ActionsCopy link #28 [ruby-core:109472]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #29 [ruby-core:109543]

Updated by matz (Yukihiro Matsumoto) over 3 years ago ActionsCopy link #30 [ruby-core:109566]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #31 [ruby-core:109831]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #32 [ruby-core:109833]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #33

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #34

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #35

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #36 [ruby-core:109889]

Updated by Eregon (Benoit Daloze) over 3 years ago ActionsCopy link #37 [ruby-core:109890]

Updated by Eregon (Benoit Daloze) about 3 years ago ActionsCopy link #38 [ruby-core:111690]

Updated by Eregon (Benoit Daloze) about 3 years ago 1ActionsCopy link #39 [ruby-core:111697]

Updated by Eregon (Benoit Daloze) about 3 years ago ActionsCopy link #40