Bug #21021
open"try to mark T_NONE object" with 3.4.1
Added by Benoit_Tigeot (Benoit Tigeot) 7 days ago. Updated about 12 hours ago.
Description
Hello
We upgraded to 3.4.1 yesterday but we are seeing crash since then.
/bundle/ruby/3.4.0/gems/activejob-7.2.2.1/lib/active_job/enqueuing.rb:93: [BUG] try to mark T_NONE object
I saw the other issue related to ffi gem https://bugs.ruby-lang.org/issues/20694
But in our case the C level backtrace information
looks different.
https://gist.github.com/benoittgt/13507c2000281aa7740bc782adab68c5
We migrated this part of the code to parallel->concurrent-ruby and we do not see the error yet again but I am a little bit worried we could see the issue again.
Updated by Benoit_Tigeot (Benoit Tigeot) 7 days ago
Benoit_Tigeot (Benoit Tigeot) wrote:
We migrated this part of the code to parallel->concurrent-ruby and we do not see the error yet again but I am a little bit worried we could see the issue again.
I was wrong. We still have the issue. Here is a new crash dump : https://gist.github.com/benoittgt/f0ad6476002b2a33c30070833e1d17c5
Updated by Benoit_Tigeot (Benoit Tigeot) 7 days ago
Benoit_Tigeot (Benoit Tigeot) wrote in #note-1:
I was wrong. We still have the issue. Here is a new crash dump : https://gist.github.com/benoittgt/f0ad6476002b2a33c30070833e1d17c5
Same with last psych update (it was present in crash dump but an old version). https://gist.github.com/benoittgt/13507c2000281aa7740bc782adab68c5?permalink_comment_id=5380956#gistcomment-5380956
Updated by tenderlovemaking (Aaron Patterson) 7 days ago
Are you able to get a core file or a backtrace from gdb? The bug is that some object has a T_NONE reference and is trying to mark that reference. We can't really tell what object has a broken reference without a core file (or possibly a gdb backtrace).
Updated by alanwu (Alan Wu) 7 days ago · Edited
There seems to be a weakmap bug that's been around since at least November 2024 that could be responsible: http://ci.rvm.jp/results/trunk-O0@ruby-sp2-noble-docker/5392991
rb_obj_info_dump: @)��
/tmp/ruby/src/trunk-O0/test/ruby/test_weakkeymap.rb:142: [BUG] try to mark T_NONE object
ruby 3.4.0dev (2024-11-05T22:08:35Z master 4203c70dfa) +PRISM [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0018 p:---- s:0114 e:000113 CFUNC :new
c:0017 p:0004 s:0110 e:000109 BLOCK /tmp/ruby/src/trunk-O0/test/ruby/test_weakkeymap.rb:142
Latest occurrence from 2 days ago: http://ci.rvm.jp/results/trunk-yjit@ruby-sp2-noble-docker/5513233
Updated by Benoit_Tigeot (Benoit Tigeot) 6 days ago · Edited
Thanks for your answers.
tenderlovemaking (Aaron Patterson) wrote in #note-3:
Are you able to get a core file or a backtrace from gdb? The bug is that some object has a T_NONE reference and is trying to mark that reference. We can't really tell what object has a broken reference without a core file (or possibly a gdb backtrace).
I'm gonna try but it will take some time.
Updated by Benoit_Tigeot (Benoit Tigeot) 6 days ago
We are not seeing the issue if we disable YJIT, but it could be a side effect.
Updated by Benoit_Tigeot (Benoit Tigeot) about 18 hours ago
Sorry for the delay. I removed the concurrency mecanism and let our crontask ran multiple times. The crash output seems to be more interesting.
/bundle/ruby/3.4.0/gems/psych-5.2.2/lib/psych.so(parse+0x5c5) [0x7f3274e2bbd5] /bundle/ruby/3.4.0/gems/psych-5.2.2/ext/psych/psych_parser.c:384
[0x7f326bd3b3cf]
Updated by tenderlovemaking (Aaron Patterson) about 17 hours ago
Odd. This may be a weak map bug as @alanwu (Alan Wu) is saying.
The C level back trace has these lines:
/usr/local/lib/libruby.so.3.4(rb_gc_mark_vm_stack_values) /usr/include/ruby-3.4.1/gc.c:2346
/usr/local/lib/libruby.so.3.4(rb_execution_context_mark+0x39) [0x7f329134af49] /usr/include/ruby-3.4.1/vm.c:3415
The GC is scanning the VM stack marking any Ruby objects it finds in the stack. This means something has pushed an invalid reference on the Ruby stack.
Do you know if any of the code in your Ruby level backtrace are using WeakMaps?
Updated by alanwu (Alan Wu) about 15 hours ago
T_NONE on the stack is reminiscent of a class of YJIT bugs we see during development. I recommend building Ruby while passing --enable-yjit=dev
to ./configure
then attempting to re-trigger the crash. This build configuration runs debug assertions that can reveal more information about the bug. Note that you'll need cargo
for this development build configuration and the build process will download some Rust dependencies from the internet.
If you use a third-party tool to build Ruby, you'll need to pass options to ./configure
through that tool.
- For
ruby-install
, it's$ ruby-install -- --enable-yjit=dev
- For
ruby-build
, you can use theCONFIGURE_OPTS
environment variable, e.g$ CONFIGURE_OPTS=--enable-yjit=dev ruby-build ....
You should be able to verify that you have a dev build by checking $ ruby --yjit -v
. It should include "+YJIT dev" like the following:
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT dev +PRISM [arm64-darwin24]
Updated by Benoit_Tigeot (Benoit Tigeot) about 12 hours ago · Edited
tenderlovemaking (Aaron Patterson) wrote in #note-8:
Do you know if any of the code in your Ruby level backtrace are using WeakMaps?
I see no matching between the two
~/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems ❯ rg WeakMap -g '*.rb' --max-count 1
debug-1.10.0/lib/debug/source_repository.rb
32: @cmap = ObjectSpace::WeakMap.new
bundler-2.6.2/lib/bundler/vendor/connection_pool/lib/connection_pool.rb
49: INSTANCES = ObjectSpace::WeakMap.new
connection_pool-2.5.0/lib/connection_pool.rb
49: INSTANCES = ObjectSpace::WeakMap.new
activerecord-7.2.2.1/lib/active_record/connection_adapters/pool_config.rb
16: INSTANCES = ObjectSpace::WeakMap.new
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb
190: @lazy_enrollment_records ||= ObjectSpace::WeakMap.new
mustermann-3.0.3/lib/mustermann/equality_map.rb
3:[Omitted long line with 1 matches]
sorbet-runtime-0.5.11751/lib/types/types/typed_array.rb
32: ObjectSpace::WeakMap.new[1] = 1
sorbet-runtime-0.5.11751/lib/types/types/typed_class.rb
50: ObjectSpace::WeakMap.new[1] = 1
sorbet-runtime-0.5.11751/lib/types/types/simple.rb
81: ObjectSpace::WeakMap.new[1] = 1
activesupport-7.2.2.1/lib/active_support/descendants_tracker.rb
18: # On MRI `ObjectSpace::WeakMap` keys are weak references.
drb-2.2.1/lib/drb/weakidconv.rb
17: @map = ObjectSpace::WeakMap.new
Thanks Alan for the detailed guide. I was able to use YJIT dev
, get a crash but the output seems to be quite similar at first sight. I have a valid version
$ ruby --yjit -v
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT dev +PRISM [x86_64-linux]
Here is a dump https://gist.github.com/benoittgt/74d83534b9a2d8837d643cdcad318367
I've look a little bit before but those are mostly app logs. I'm gonna looked a little bit at yjit source code to see what can be look at.
I saw that someone posted a core file https://bugs.ruby-lang.org/issues/21034
Thanks