Project

General

Profile

Actions

Bug #21021

open

"try to mark T_NONE object" with 3.4.1

Added by Benoit_Tigeot (Benoit Tigeot) 7 days ago. Updated about 10 hours ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [x86_64-linux] │
[ruby-core:120580]

Description

Hello

We upgraded to 3.4.1 yesterday but we are seeing crash since then.

/bundle/ruby/3.4.0/gems/activejob-7.2.2.1/lib/active_job/enqueuing.rb:93: [BUG] try to mark T_NONE object

I saw the other issue related to ffi gem https://bugs.ruby-lang.org/issues/20694

But in our case the C level backtrace information looks different.

https://gist.github.com/benoittgt/13507c2000281aa7740bc782adab68c5

We migrated this part of the code to parallel->concurrent-ruby and we do not see the error yet again but I am a little bit worried we could see the issue again.

Updated by Benoit_Tigeot (Benoit Tigeot) 7 days ago

Benoit_Tigeot (Benoit Tigeot) wrote:

We migrated this part of the code to parallel->concurrent-ruby and we do not see the error yet again but I am a little bit worried we could see the issue again.

I was wrong. We still have the issue. Here is a new crash dump : https://gist.github.com/benoittgt/f0ad6476002b2a33c30070833e1d17c5

Updated by Benoit_Tigeot (Benoit Tigeot) 7 days ago

Benoit_Tigeot (Benoit Tigeot) wrote in #note-1:

I was wrong. We still have the issue. Here is a new crash dump : https://gist.github.com/benoittgt/f0ad6476002b2a33c30070833e1d17c5

Same with last psych update (it was present in crash dump but an old version). https://gist.github.com/benoittgt/13507c2000281aa7740bc782adab68c5?permalink_comment_id=5380956#gistcomment-5380956

Updated by tenderlovemaking (Aaron Patterson) 7 days ago

Are you able to get a core file or a backtrace from gdb? The bug is that some object has a T_NONE reference and is trying to mark that reference. We can't really tell what object has a broken reference without a core file (or possibly a gdb backtrace).

Updated by alanwu (Alan Wu) 6 days ago · Edited

There seems to be a weakmap bug that's been around since at least November 2024 that could be responsible: http://ci.rvm.jp/results/trunk-O0@ruby-sp2-noble-docker/5392991

rb_obj_info_dump: @)��
/tmp/ruby/src/trunk-O0/test/ruby/test_weakkeymap.rb:142: [BUG] try to mark T_NONE object
ruby 3.4.0dev (2024-11-05T22:08:35Z master 4203c70dfa) +PRISM [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0018 p:---- s:0114 e:000113 CFUNC  :new
c:0017 p:0004 s:0110 e:000109 BLOCK  /tmp/ruby/src/trunk-O0/test/ruby/test_weakkeymap.rb:142

Latest occurrence from 2 days ago: http://ci.rvm.jp/results/trunk-yjit@ruby-sp2-noble-docker/5513233

Updated by Benoit_Tigeot (Benoit Tigeot) 6 days ago · Edited

Thanks for your answers.

tenderlovemaking (Aaron Patterson) wrote in #note-3:

Are you able to get a core file or a backtrace from gdb? The bug is that some object has a T_NONE reference and is trying to mark that reference. We can't really tell what object has a broken reference without a core file (or possibly a gdb backtrace).

I'm gonna try but it will take some time.

Updated by Benoit_Tigeot (Benoit Tigeot) 6 days ago

We are not seeing the issue if we disable YJIT, but it could be a side effect.

Updated by Benoit_Tigeot (Benoit Tigeot) about 16 hours ago

Sorry for the delay. I removed the concurrency mecanism and let our crontask ran multiple times. The crash output seems to be more interesting.

https://gist.github.com/benoittgt/13507c2000281aa7740bc782adab68c5?permalink_comment_id=5391753#gistcomment-5391753

/bundle/ruby/3.4.0/gems/psych-5.2.2/lib/psych.so(parse+0x5c5) [0x7f3274e2bbd5] /bundle/ruby/3.4.0/gems/psych-5.2.2/ext/psych/psych_parser.c:384
[0x7f326bd3b3cf]

Updated by tenderlovemaking (Aaron Patterson) about 15 hours ago

Odd. This may be a weak map bug as @alanwu (Alan Wu) is saying.

The C level back trace has these lines:

/usr/local/lib/libruby.so.3.4(rb_gc_mark_vm_stack_values) /usr/include/ruby-3.4.1/gc.c:2346
/usr/local/lib/libruby.so.3.4(rb_execution_context_mark+0x39) [0x7f329134af49] /usr/include/ruby-3.4.1/vm.c:3415

The GC is scanning the VM stack marking any Ruby objects it finds in the stack. This means something has pushed an invalid reference on the Ruby stack.

Do you know if any of the code in your Ruby level backtrace are using WeakMaps?

Updated by alanwu (Alan Wu) about 13 hours ago

T_NONE on the stack is reminiscent of a class of YJIT bugs we see during development. I recommend building Ruby while passing --enable-yjit=dev to ./configure then attempting to re-trigger the crash. This build configuration runs debug assertions that can reveal more information about the bug. Note that you'll need cargo for this development build configuration and the build process will download some Rust dependencies from the internet.

If you use a third-party tool to build Ruby, you'll need to pass options to ./configure through that tool.

  • For ruby-install, it's $ ruby-install -- --enable-yjit=dev
  • For ruby-build, you can use the CONFIGURE_OPTS environment variable, e.g $ CONFIGURE_OPTS=--enable-yjit=dev ruby-build ....

You should be able to verify that you have a dev build by checking $ ruby --yjit -v. It should include "+YJIT dev" like the following:

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT dev +PRISM [arm64-darwin24] 

Updated by Benoit_Tigeot (Benoit Tigeot) about 10 hours ago · Edited

tenderlovemaking (Aaron Patterson) wrote in #note-8:

Do you know if any of the code in your Ruby level backtrace are using WeakMaps?

I see no matching between the two

~/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems ❯ rg WeakMap -g '*.rb' --max-count 1
debug-1.10.0/lib/debug/source_repository.rb
32:        @cmap = ObjectSpace::WeakMap.new

bundler-2.6.2/lib/bundler/vendor/connection_pool/lib/connection_pool.rb
49:    INSTANCES = ObjectSpace::WeakMap.new

connection_pool-2.5.0/lib/connection_pool.rb
49:    INSTANCES = ObjectSpace::WeakMap.new

activerecord-7.2.2.1/lib/active_record/connection_adapters/pool_config.rb
16:      INSTANCES = ObjectSpace::WeakMap.new

activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb
190:          @lazy_enrollment_records ||= ObjectSpace::WeakMap.new

mustermann-3.0.3/lib/mustermann/equality_map.rb
3:[Omitted long line with 1 matches]

sorbet-runtime-0.5.11751/lib/types/types/typed_array.rb
32:          ObjectSpace::WeakMap.new[1] = 1

sorbet-runtime-0.5.11751/lib/types/types/typed_class.rb
50:            ObjectSpace::WeakMap.new[1] = 1

sorbet-runtime-0.5.11751/lib/types/types/simple.rb
81:          ObjectSpace::WeakMap.new[1] = 1

activesupport-7.2.2.1/lib/active_support/descendants_tracker.rb
18:      # On MRI `ObjectSpace::WeakMap` keys are weak references.

drb-2.2.1/lib/drb/weakidconv.rb
17:        @map = ObjectSpace::WeakMap.new

Thanks Alan for the detailed guide. I was able to use YJIT dev, get a crash but the output seems to be quite similar at first sight. I have a valid version

$ ruby --yjit -v
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT dev +PRISM [x86_64-linux]

Here is a dump https://gist.github.com/benoittgt/74d83534b9a2d8837d643cdcad318367

I've look a little bit before but those are mostly app logs. I'm gonna looked a little bit at yjit source code to see what can be look at.

I saw that someone posted a core file https://bugs.ruby-lang.org/issues/21034

Thanks

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0