Bug #20009: Marshal.load raises exception when load dumped class include non-ASCII - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #20009

closed

Marshal.load raises exception when load dumped class include non-ASCII

Added by ippachi (Kazuya Hatanaka) almost 2 years ago. Updated about 1 month ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]

Backport:

3.2: REQUIRED, 3.3: DONE, 3.4: DONE

[ruby-core:115422]

Description

Reproduction code¶

class Cクラス; end
Marshal.load(Marshal.dump(Cクラス))

Actual result¶

<internal:marshal>:34:in `load': undefined class/module C\xE3\x82\xAF\xE3\x83\xA9\xE3\x82\xB9 (ArgumentError)
        from marshal.rb:2:in `<main>'

Expected result¶

Returns Cクラス

Impacted area¶

An exception is raised in Rails under the following conditions

minitest is used with default settings
Parallel execution with parallelize
test class names contain non-ASCII characters

The default parallelization uses DRb, and Marshal is used inside DRb.

Other¶

After trying various things, I thought I could fix it by making rb_path_to_class support strings containing non-ASCII characters, but I couldn't find anything more than that.

Actions

Copy link

#1 [ruby-core:115433]

Updated by byroot (Jean Boussier) almost 2 years ago

I dug into this bug, and I'm not sure if it's possible to fix it.

Classes are serialized this way:

          case T_CLASS:
            w_byte(TYPE_CLASS, arg);
            {
                VALUE path = class2path(obj);
                w_bytes(RSTRING_PTR(path), RSTRING_LEN(path), arg);
                RB_GC_GUARD(path);
            }
            break;

We write the TYPE_CLASS prefix, and then write the bytes of the class name, without any encoding indication.

Then on load, we just read the bytes and try to lookup the class:

      case TYPE_CLASS:
        {
            VALUE str = r_bytes(arg);

            v = path2class(str);

So on load we're looking for "Cクラス".b.to_sym, which doesn't match :"Cクラス".

To fix this we'd need to include the encoding in the format, but that would mean breaking backward and forward compatibility which is a huge deal.

Half-way solution¶

Some possible half-way solution would be:

Assume non-ASCII class names are UTF-8
Raise on dump for class names with non-UTF8 compatible class names.

It's far from ideal though.

Actions

Copy link

#2 [ruby-core:122043]

Updated by make_now_just (Hiroya Fujinami) 3 months ago · Edited

In my opinion, we need to introduce a new format for dumping classes/modules correctly.

Marshal uses c and m (TYPE_CLASS and TYPE_MODULE) as format prefixes currently, so the format is the following:

| 1 byte  | ...       | ...                     |
| 'c'/'m' | path size | path name binary string |

And, this format lacks the encoding, then the bug is happened.

To solve this issue, adding the encoding information to dump results is necessary, and introducing a new format seems a natural way to me. Therefore, a new format is the following (as a new type prefix is K):

| 1 byte | ...                                          |
| 'K'    | a dump of the path symbol (`:`, `;`, or `I`) |

Such a format is used for dumping other kinds of objects. e.g., o and S (TYPE_OBJECT and TYPE_STRUCT).

This idea does not break the backward compatibility, but we need to increment MARSHAL_MINOR.

Actions

Copy link

#3 [ruby-core:122048]

Updated by Eregon (Benoit Daloze) 3 months ago

byroot (Jean Boussier) wrote in #note-1:

Half-way solution¶

Some possible half-way solution would be:

Assume non-ASCII class names are UTF-8

Raise on dump for class names with non-UTF8 compatible class names.

It's far from ideal though.

I think this would be a pretty good solution actually, it seems very unlikely to have class names which can't be encoded in UTF-8.

Actions

Copy link

#4 [ruby-core:122137]

Updated by nobu (Nobuyoshi Nakada) 3 months ago · Edited

Currently instance variables of class/module are prohibited.
It may be possible to put the encoding information there.

Actions

Copy link

#5 [ruby-core:122139]

Updated by nobu (Nobuyoshi Nakada) 3 months ago · Edited

I made a patch which works well except for tests added by ruby/spec@907cb35.
Since the dumped strings in these tests have not been loadable, I think they are useless actually.

https://github.com/ruby/ruby/pull/13362

Actions

Copy link

#6 [ruby-core:122143]

Updated by Eregon (Benoit Daloze) 3 months ago

Mmh, but Marshal.dump+load of such non-7-bit modules/classes works on TruffleRuby, although it needs a tiny fix:

diff --git a/src/main/ruby/truffleruby/core/marshal.rb b/src/main/ruby/truffleruby/core/marshal.rb
index 102468e774..ea7469ea4a 100644
--- a/src/main/ruby/truffleruby/core/marshal.rb
+++ b/src/main/ruby/truffleruby/core/marshal.rb
@@ -786,19 +786,19 @@ module Marshal
     end
 
     def construct_class
-      obj = const_lookup(get_byte_sequence.to_sym, Class)
+      obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym, Class)
       store_unique_object obj
       obj
     end
 
     def construct_module
-      obj = const_lookup(get_byte_sequence.to_sym, Module)
+      obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym, Module)
       store_unique_object obj
       obj
     end
 
     def construct_old_module
-      obj = const_lookup(get_byte_sequence.to_sym)
+      obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym)
       store_unique_object obj
       obj
     end

class MultibyteぁあぃいClass
end

source_object = MultibyteぁあぃいClass

p Marshal.dump(source_object)
p Marshal.load(Marshal.dump(source_object))

$ ruby -v marshal_class.rb
truffleruby 25.0.0-dev-a65bde3d, like ruby 3.3.7, Interpreted JVM [x86_64-linux]
"\x04\bc\x1FMultibyte\xE3\x81\x81\xE3\x81\x82\xE3\x81\x83\xE3\x81\x84Class"
MultibyteぁあぃいClass

I think at least if no encoding information is present we should assume UTF-8, because it's by far the most common source encoding.
I think there is no value to look up the name in BINARY encoding as currently, such a constant wouldn't even print well.

(FWIW TruffleRuby stores constant names as Java Strings, which means no encoding information. I'm not convinced it's a good idea to e.g. have two constants É in e.g. UTF-8 and ISO-8859-1 on the same module, it just seems needless confusion. Having non-7-bit BINARY-encoded constants seems no good either.)

Actions

Copy link

#7 [ruby-core:122144]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited

So my suggestion would be:

Interpret the serialized module/class name as UTF-8, not as BINARY. And of course if it's only 7-bit as US-ASCII (already the case).
If the module/class name uses another encoding, we could either transcode it to UTF-8, or use nobu's trick of serializing the encoding name as a fake instance variable. Transcoding to UTF-8 seems simpler, and the only case it wouldn't work is for BINARY with non-7-bit, which seems of no value anyway. EDIT: mmh but I suppose transcoding to UTF-8 wouldn't work on CRuby because the lookup would fail.

Actions

Copy link

#8 [ruby-core:122145]

Updated by Eregon (Benoit Daloze) 3 months ago

@nobu (Nobuyoshi Nakada) What happens when using your patch to Marshal.dump a class and then trying to load it on an older Ruby? Will it create an instance variable E or error?
I guess if there is no matching BINARY constant it would error anyway, but if there is it would set that instance variable E or different error.

Actions

Copy link

#9 [ruby-core:122168]

Updated by nobu (Nobuyoshi Nakada) 3 months ago

Eregon (Benoit Daloze) wrote in #note-8:

@nobu (Nobuyoshi Nakada) What happens when using your patch to Marshal.dump a class and then trying to load it on an older Ruby? Will it create an instance variable E or error?

Just an error.

I guess if there is no matching BINARY constant it would error anyway, but if there is it would set that instance variable E or different error.

If the matching BINARY class/module exists, the instance variable causes an error.

Actions

Copy link

#10 [ruby-core:122169]

Updated by nobu (Nobuyoshi Nakada) 3 months ago

Eregon (Benoit Daloze) wrote in #note-7:

So my suggestion would be:

Interpret the serialized module/class name as UTF-8, not as BINARY. And of course if it's only 7-bit as US-ASCII (already the case).

If the module/class name uses another encoding, we could either transcode it to UTF-8, or use nobu's trick of serializing the encoding name as a fake instance variable. Transcoding to UTF-8 seems simpler, and the only case it wouldn't work is for BINARY with non-7-bit, which seems of no value anyway. EDIT: mmh but I suppose transcoding to UTF-8 wouldn't work on CRuby because the lookup would fail.

I'm not against to interpret the default encoding as UTF-8, but don't think transcoding is intuitive as different encoding symbols are different even if they are same when transcoded.

Actions

Copy link

#11

Updated by nobu (Nobuyoshi Nakada) 3 months ago

Status changed from Open to Closed
Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN to 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED

Actions

Copy link

#12 [ruby-core:122536]

Updated by nagachika (Tomoyuki Chikanaga) 2 months ago

In my understanding, with 097d742a1ed53afb91e83aef01365d68b763357b Marshal.load could cause error with bytes stream which was dumped from older version ruby. If this is right, I don't want to backport the change to stable branch.

Actions

Copy link

#13 [ruby-core:122760]