Project

General

Profile

Actions

Bug #20009

open

Marshal.load raises exception when load dumped class include non-ASCII

Added by ippachi (Kazuya Hatanaka) over 1 year ago. Updated 14 days ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]
[ruby-core:115422]

Description

Reproduction code

class Cクラス; end
Marshal.load(Marshal.dump(Cクラス))

Actual result

<internal:marshal>:34:in `load': undefined class/module C\xE3\x82\xAF\xE3\x83\xA9\xE3\x82\xB9 (ArgumentError)
        from marshal.rb:2:in `<main>'

Expected result

Returns Cクラス

Impacted area

An exception is raised in Rails under the following conditions

  • minitest is used with default settings
  • Parallel execution with parallelize
  • test class names contain non-ASCII characters

The default parallelization uses DRb, and Marshal is used inside DRb.

Other

After trying various things, I thought I could fix it by making rb_path_to_class support strings containing non-ASCII characters, but I couldn't find anything more than that.

Updated by byroot (Jean Boussier) over 1 year ago

I dug into this bug, and I'm not sure if it's possible to fix it.

Classes are serialized this way:

          case T_CLASS:
            w_byte(TYPE_CLASS, arg);
            {
                VALUE path = class2path(obj);
                w_bytes(RSTRING_PTR(path), RSTRING_LEN(path), arg);
                RB_GC_GUARD(path);
            }
            break;

We write the TYPE_CLASS prefix, and then write the bytes of the class name, without any encoding indication.

Then on load, we just read the bytes and try to lookup the class:

      case TYPE_CLASS:
        {
            VALUE str = r_bytes(arg);

            v = path2class(str);

So on load we're looking for "Cクラス".b.to_sym, which doesn't match :"Cクラス".

To fix this we'd need to include the encoding in the format, but that would mean breaking backward and forward compatibility which is a huge deal.

Half-way solution

Some possible half-way solution would be:

  • Assume non-ASCII class names are UTF-8
  • Raise on dump for class names with non-UTF8 compatible class names.

It's far from ideal though.

Updated by make_now_just (Hiroya Fujinami) 18 days ago · Edited

In my opinion, we need to introduce a new format for dumping classes/modules correctly.

Marshal uses c and m (TYPE_CLASS and TYPE_MODULE) as format prefixes currently, so the format is the following:

| 1 byte  | ...       | ...                     |
| 'c'/'m' | path size | path name binary string |

And, this format lacks the encoding, then the bug is happened.

To solve this issue, adding the encoding information to dump results is necessary, and introducing a new format seems a natural way to me. Therefore, a new format is the following (as a new type prefix is K):

| 1 byte | ...                                          |
| 'K'    | a dump of the path symbol (`:`, `;`, or `I`) |

Such a format is used for dumping other kinds of objects. e.g., o and S (TYPE_OBJECT and TYPE_STRUCT).

This idea does not break the backward compatibility, but we need to increment MARSHAL_MINOR.

Updated by Eregon (Benoit Daloze) 18 days ago

byroot (Jean Boussier) wrote in #note-1:

Half-way solution

Some possible half-way solution would be:

  • Assume non-ASCII class names are UTF-8
  • Raise on dump for class names with non-UTF8 compatible class names.

It's far from ideal though.

I think this would be a pretty good solution actually, it seems very unlikely to have class names which can't be encoded in UTF-8.

Updated by nobu (Nobuyoshi Nakada) 15 days ago · Edited

Currently instance variables of class/module are prohibited.
It may be possible to put the encoding information there.

Updated by nobu (Nobuyoshi Nakada) 15 days ago · Edited

I made a patch which works well except for tests added by ruby/spec@907cb35.
Since the dumped strings in these tests have not been loadable, I think they are useless actually.

https://github.com/ruby/ruby/pull/13362

Updated by Eregon (Benoit Daloze) 15 days ago

Mmh, but Marshal.dump+load of such non-7-bit modules/classes works on TruffleRuby, although it needs a tiny fix:

diff --git a/src/main/ruby/truffleruby/core/marshal.rb b/src/main/ruby/truffleruby/core/marshal.rb
index 102468e774..ea7469ea4a 100644
--- a/src/main/ruby/truffleruby/core/marshal.rb
+++ b/src/main/ruby/truffleruby/core/marshal.rb
@@ -786,19 +786,19 @@ module Marshal
     end
 
     def construct_class
-      obj = const_lookup(get_byte_sequence.to_sym, Class)
+      obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym, Class)
       store_unique_object obj
       obj
     end
 
     def construct_module
-      obj = const_lookup(get_byte_sequence.to_sym, Module)
+      obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym, Module)
       store_unique_object obj
       obj
     end
 
     def construct_old_module
-      obj = const_lookup(get_byte_sequence.to_sym)
+      obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym)
       store_unique_object obj
       obj
     end
class MultibyteぁあぃいClass
end

source_object = MultibyteぁあぃいClass

p Marshal.dump(source_object)
p Marshal.load(Marshal.dump(source_object))
$ ruby -v marshal_class.rb
truffleruby 25.0.0-dev-a65bde3d, like ruby 3.3.7, Interpreted JVM [x86_64-linux]
"\x04\bc\x1FMultibyte\xE3\x81\x81\xE3\x81\x82\xE3\x81\x83\xE3\x81\x84Class"
MultibyteぁあぃいClass

I think at least if no encoding information is present we should assume UTF-8, because it's by far the most common source encoding.
I think there is no value to look up the name in BINARY encoding as currently, such a constant wouldn't even print well.

(FWIW TruffleRuby stores constant names as Java Strings, which means no encoding information. I'm not convinced it's a good idea to e.g. have two constants É in e.g. UTF-8 and ISO-8859-1 on the same module, it just seems needless confusion. Having non-7-bit BINARY-encoded constants seems no good either.)

Updated by Eregon (Benoit Daloze) 15 days ago · Edited

So my suggestion would be:

  • Interpret the serialized module/class name as UTF-8, not as BINARY. And of course if it's only 7-bit as US-ASCII (already the case).
  • If the module/class name uses another encoding, we could either transcode it to UTF-8, or use nobu's trick of serializing the encoding name as a fake instance variable. Transcoding to UTF-8 seems simpler, and the only case it wouldn't work is for BINARY with non-7-bit, which seems of no value anyway. EDIT: mmh but I suppose transcoding to UTF-8 wouldn't work on CRuby because the lookup would fail.

Updated by Eregon (Benoit Daloze) 15 days ago

@nobu (Nobuyoshi Nakada) What happens when using your patch to Marshal.dump a class and then trying to load it on an older Ruby? Will it create an instance variable E or error?
I guess if there is no matching BINARY constant it would error anyway, but if there is it would set that instance variable E or different error.

Updated by nobu (Nobuyoshi Nakada) 14 days ago

Eregon (Benoit Daloze) wrote in #note-8:

@nobu (Nobuyoshi Nakada) What happens when using your patch to Marshal.dump a class and then trying to load it on an older Ruby? Will it create an instance variable E or error?

Just an error.

I guess if there is no matching BINARY constant it would error anyway, but if there is it would set that instance variable E or different error.

If the matching BINARY class/module exists, the instance variable causes an error.

Updated by nobu (Nobuyoshi Nakada) 14 days ago

Eregon (Benoit Daloze) wrote in #note-7:

So my suggestion would be:

  • Interpret the serialized module/class name as UTF-8, not as BINARY. And of course if it's only 7-bit as US-ASCII (already the case).
  • If the module/class name uses another encoding, we could either transcode it to UTF-8, or use nobu's trick of serializing the encoding name as a fake instance variable. Transcoding to UTF-8 seems simpler, and the only case it wouldn't work is for BINARY with non-7-bit, which seems of no value anyway. EDIT: mmh but I suppose transcoding to UTF-8 wouldn't work on CRuby because the lookup would fail.

I'm not against to interpret the default encoding as UTF-8, but don't think transcoding is intuitive as different encoding symbols are different even if they are same when transcoded.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0