Marshal limit of >= 2 GiB
Using a gem to handle matrix operations called Numo-array I found the following error when save large matrix:
in `dump': long too big to dump (TypeError)
Github thread: https://github.com/ruby-numo/numo-narray/issues/144
Digging with the authors, we found the following code that reproduces the error:
ruby -e 'Marshal.dump(" "*2**31)'
Executed in :
ruby 2.7.0dev (2019-11-12T12:03:22Z master 3816622fbe) [x86_64-linux]
The marshal library has a limit that is checked with the SIZEOF_LONG constant. This check is performed in this line https://github.com/ruby/ruby/blob/e7ea6e078fecb70fbc91b04878b69f696749afac/marshal.c#L301 to 321 of the Marshal.c file. I don't understand the motivation of this limit and has a great impact in libraries that need to serialize large objects as numeric matrix. In this case, the limit of >= 2 GiB it's reached easily and it blocks the ruby development in scientifical projects as cited. I found other bug related: #1560, but the Marshal problem itself was not addressed in this case.
Thank you in advance
Updated by shyouhei (Shyouhei Urabe) 2 months ago
- Description updated (diff)
This behaviour has been there since the beginning. No ruby version since 0.49 has successfully dumped such long string. Same thing happens for a very big bignum, a very long array, a class that has very long classpath (Q::W::E::R::...), an object of 2**31 instance variables (which isn't impossible these days), and much much more.
The limitation is due to marshal's binary format. I guess the reason behind this is simply because at the time the format was designed (back in 1990s), there simply was no such thing like a 64 bit integer type. To properly reroute we have to reconsider all use of
long in marshal format. I guess that is essentially a format change. That should hurt data portability so not that easy.
Any nice idea to fix the situation?
Updated by shevegen (Robert A. Heiler) 2 months ago
I don't understand the motivation of this limit and has a great impact in libraries that need to serialize large objects as numeric matrix.
In this case, the limit of >= 2 GiB it's reached easily and it blocks the ruby development in scientifical projects as cited.
Shyouhei already pointed out the historic reason. I believe you can quite easily convince the ruby core team that a change may
be necessary in the long run (most likely past ruby 3.0) based on use cases. Matz likes to hear real world use cases, so the
more information may be given the better. :)
As for possibility of change, I guess the Marshal format could be kept by default, but another variant could perhaps be added
where people could switch to another format - a bit like syck and psych could be used interchangably for yaml to some extent
(I used syck for quite some time even after psych was added, before I transitioned into Unicode finally; I used to specify
the yaml engine via e. g. YAML.engine = or something like that).