Project

General

Profile

Actions

Bug #21710

closed

Segfault when reading object_id after it is set inside RUBY_INTERNAL_EVENT_NEWOBJ

Bug #21710: Segfault when reading object_id after it is set inside RUBY_INTERNAL_EVENT_NEWOBJ

Added by ivoanjo (Ivo Anjo) 21 days ago. Updated 4 days ago.

Status:
Closed
Assignee:
-
Target version:
ruby -v:
ruby 4.0.0dev (2025-11-24T08:44:28Z master aeb7689e69) +PRISM [x86_64-linux]
[ruby-core:123902]

Description

Hey ๐Ÿ‘‹. I caught a following segfault when running the Datadog Ruby Profiler test suite on 4.0.0-preview2.

The Datadog Ruby Profiler still uses object_ids in RUBY_INTERNAL_EVENT_NEWOBJ to simulate weak references. (We know this is deprecated and we plan to move to rb_gc_mark_weak in the future).

I was able to reproduce this bug with both 4.0.0-preview2 as well as a ruby-head build from 2025-11-24.

Here's a very simple reproducer:

static void on_newobj_event(VALUE tpval, void *data) {
  VALUE obj = rb_tracearg_object(rb_tracearg_from_tracepoint(tpval));
  if (!rb_objspace_internal_object_p(obj)) rb_obj_id(obj);
}

static VALUE add_object_id(RB_UNUSED_VAR(VALUE _)) {
  VALUE tp = rb_tracepoint_new(0, RUBY_INTERNAL_EVENT_NEWOBJ, on_newobj_event, NULL);
  rb_tracepoint_enable(tp); rb_yield(Qnil); rb_tracepoint_disable(tp);
  return Qnil;
}
puts RUBY_DESCRIPTION

require "lowlevel-toolkit"

foo = Struct.new(:foo)
bar = nil
LowlevelToolkit.add_object_id { bar = foo.new(1) }
bar.object_id

and here's the segfault:

examples/add_object_id.rb:8: [BUG] Segmentation fault at 0x0000000000000000
ruby 4.0.0dev (2025-11-24T08:44:28Z master aeb7689e69) +PRISM [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0003 p:---- s:0012 e:000011 l:y b:---- CFUNC  :object_id
c:0002 p:0044 s:0008 E:002170 l:n b:---- EVAL   examples/add_object_id.rb:8 [FINISH]
c:0001 p:0000 s:0003 E:001dc0 l:y b:---- DUMMY  [FINISH]

-- Ruby level backtrace information ----------------------------------------
examples/add_object_id.rb:8:in '<main>'
examples/add_object_id.rb:8:in 'object_id'

-- Threading information ---------------------------------------------------
Total ractor count: 1
Ruby thread count for this ractor: 1

-- Machine register context ------------------------------------------------
 RIP: 0x000079c33b33e2cd RBP: 0x00007ffdcee65d10 RSP: 0x00007ffdcee65cd0
 RAX: 0x0000000000000000 RBX: 0x000079c33a4ff058 RCX: 0x000079c339000000
 RDX: 0x0000000000000000 RDI: 0x0000000000000000 RSI: 0x00000000000fe000
  R8: 0x000079c33b35c078  R9: 0x000079c31f2814d0 R10: 0x000058e4893a6ee0
 R11: 0xb84a1fbe2b4aab11 R12: 0x0000000000000002 R13: 0x0000000000000000
 R14: 0x000058e489384b60 R15: 0x000079c33a5fefa0 EFL: 0x0000000000010246

-- C level backtrace information -------------------------------------------
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_print_backtrace+0x24) [0x79c33b392149] .rvm/src/ruby-head/vm_dump.c:1105
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_vm_bugreport+0x33f) [0x79c33b39291b] .rvm/src/ruby-head/vm_dump.c:1450
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_bug_for_fatal_signal+0x147) [0x79c33b13a3e2]
.rvm/rubies/ruby-head/lib/libruby.so.4.0(sigsegv+0x84) [0x79c33b2bea4d] .rvm/src/ruby-head/signal.c:948
.rvm/rubies/ruby-head/lib/libruby.so.4.0(sigill) (null):0
/lib/x86_64-linux-gnu/libc.so.6(0x79c33ac45330) [0x79c33ac45330]
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_obj_field_get+0x105) [0x79c33b33e2cd] .rvm/src/ruby-head/variable.c:1412
.rvm/rubies/ruby-head/lib/libruby.so.4.0(object_id_get+0x4e) [0x79c33b16ed28] .rvm/src/ruby-head/gc.c:1875
.rvm/rubies/ruby-head/lib/libruby.so.4.0(object_id0+0x46) [0x79c33b16ed78] .rvm/src/ruby-head/gc.c:1895
.rvm/rubies/ruby-head/lib/libruby.so.4.0(object_id+0xb1) [0x79c33b16ee93] .rvm/src/ruby-head/gc.c:1936
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_find_object_id+0x43) [0x79c33b16f6ea] .rvm/src/ruby-head/gc.c:2179
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_obj_id+0x2e) [0x79c33b16f75a] .rvm/src/ruby-head/gc.c:2234
.rvm/rubies/ruby-head/lib/libruby.so.4.0(ractor_safe_call_cfunc_0+0x30) [0x79c33b35c0a8] .rvm/src/ruby-head/vm_insnhelper.c:3718
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_cfunc_with_frame_+0x221) [0x79c33b35ccb1] .rvm/src/ruby-head/vm_insnhelper.c:3902
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_cfunc_with_frame+0x76) [0x79c33b35cf26] .rvm/src/ruby-head/vm_insnhelper.c:3948
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_cfunc_other+0x12b) [0x79c33b35d053] .rvm/src/ruby-head/vm_insnhelper.c:3974
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_cfunc+0x147) [0x79c33b35d49a] .rvm/src/ruby-head/vm_insnhelper.c:4056
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_method_each_type+0x180) [0x79c33b360161] .rvm/src/ruby-head/vm_insnhelper.c:4888
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_method+0xa1) [0x79c33b360c1d] .rvm/src/ruby-head/vm_insnhelper.c:5014
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_call_general+0x2f) [0x79c33b360e1f] .rvm/src/ruby-head/vm_insnhelper.c:5058
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_sendish+0x1d3) [0x79c33b363689] .rvm/src/ruby-head/vm_insnhelper.c:6124
.rvm/rubies/ruby-head/lib/libruby.so.4.0(vm_exec_core+0x3b1d) [0x79c33b36b42c] .rvm/src/ruby-head/insns.def:903
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_vm_exec+0x140) [0x79c33b384d94] .rvm/src/ruby-head/vm.c:2784
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_iseq_eval_main+0x3d) [0x79c33b385be2] .rvm/src/ruby-head/vm.c:3050
.rvm/rubies/ruby-head/lib/libruby.so.4.0(rb_ec_exec_node+0x128) [0x79c33b145eb4] .rvm/src/ruby-head/eval.c:283
.rvm/rubies/ruby-head/lib/libruby.so.4.0(ruby_run_node+0x8a) [0x79c33b146025] .rvm/src/ruby-head/eval.c:321
.rvm/rubies/ruby-head/bin/ruby(rb_main+0x4c) [0x58e47500e51e] ./main.c:42
.rvm/rubies/ruby-head/bin/ruby(main+0x62) [0x58e47500e596] ./main.c:62

Let me know if I can provide any more info!


Files

bug-21710.patch (1.25 KB) bug-21710.patch nobu (Nobuyoshi Nakada), 11/25/2025 02:19 PM

Related issues 1 (1 open0 closed)

Related to Ruby - Feature #21722: Expose rb_gc_mark_weak API for use in extensionsOpenActions

Updated by byroot (Jean Boussier) 21 days ago Actions #1 [ruby-core:123904]

I believe I understand what's going on. The NEWOBJ callback is invoked before struct_alloc has set the necessary flags such as RSTRUCT_GEN_FIELDS and RSTRUCT_EMBED_LEN_MASK.

This cause rb_object_id to look for, and set, the fields_obj at the wrong place.

The ideal fix would be for struct_alloc to invoke NEWOBJ_OF with all these flags, instead of setting them later. Not 100% sure it's possible though, but I'll keep digging.

Updated by byroot (Jean Boussier) 21 days ago Actions #2 [ruby-core:123905]

  • Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN to 3.2: DONTNEED, 3.3: DONTNEED, 3.4: DONTNEED

I got a fix here: https://github.com/ruby/ruby/pull/15320

But I'd like to find some time to add a regression test, or at the very least to audit the other types to see if a similar issue is possible with the other types that also optimize generic fields access (e.g. T_DATA).

Updated by nobu (Nobuyoshi Nakada) 21 days ago Actions #3 [ruby-core:123906]

With the attached patch, GH-15320 works fine for Struct.

$ ./ruby -r-test-/tracepoint -e 'foo = Struct.new(:foo); bar = nil; Bug.tracepoint_add_object_id { bar = foo.new(1) }; p bar.object_id'
16

Crashes for Object.

$ ./ruby -r-test-/tracepoint -e 'bar = nil; Bug.tracepoint_add_object_id { bar = Object.new }; p bar.object_id'
../src/shape.c:1295: Assertion Failed: rb_shape_verify_consistency:flags_heap_index > 0
ruby 4.0.0dev (2025-11-25T14:15:17Z master 24268ceb68) +PRISM [arm64-darwin25]

C level backtrace is:

/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_vm_bugreport+0xbd8) [0x101726c8c] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm_dump.c:1450
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_vm_bugreport) (null):0
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_assert_failure_detail+0xd4) [0x101894e28] /Users/nobu/build/ruby/master/aarch64-darwin/../src/error.c:1216
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_assert_failure_detail+0x0) [0x101894d54] /Users/nobu/build/ruby/master/aarch64-darwin/../src/error.c:1192
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_assert_failure) (null):0
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_shape_verify_consistency.cold.8+0x0) [0x1018ba9a0] /Users/nobu/build/ruby/master/aarch64-darwin/../src/shape.c:1295
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_shape_verify_consistency.cold.7) (null):0
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_shape_verify_consistency+0x154) [0x101668428] /Users/nobu/build/ruby/master/aarch64-darwin/../src/shape.c:1295
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(obj_field_set+0xec) [0x1016e3c98] ../src/shape.h:177
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(RB_BUILTIN_TYPE+0x0) [0x10155f91c] /Users/nobu/build/ruby/master/aarch64-darwin/../src/gc.c:1901
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rbimpl_RB_TYPE_P_fastpath) ../src/include/ruby/internal/value_type.h:352
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(object_id0) ../src/shape.h:145
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(exec_hooks_body+0x20) [0x101728dc0] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm_trace.c:358
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(exec_hooks_unprotected) /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm_trace.c:387
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_exec_event_hooks+0xc8) [0x101728d04] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm_trace.c:433
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_exec_event_hook_orig+0x58) [0x10155f2d0] ../src/vm_core.h:2293
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_gc_event_hook+0x30) [0x101555d3c] /Users/nobu/build/ruby/master/aarch64-darwin/../src/gc.c:237
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(newobj_of.cold.6+0x5c) [0x101898cb8] /Users/nobu/build/ruby/master/aarch64-darwin/../src/gc.c:1012
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(newobj_of+0x258) [0x10154be38] /Users/nobu/build/ruby/master/aarch64-darwin/../src/gc.c:1001
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(RB_SPECIAL_CONST_P+0x0) [0x1015c7ebc] /Users/nobu/build/ruby/master/aarch64-darwin/../src/object.c:127
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_class_allocate_instance) ../src/shape.h:144
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_class_alloc+0x68) [0x1015c9d40] /Users/nobu/build/ruby/master/aarch64-darwin/../src/object.c:2214
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(vm_exec_core+0xf50) [0x1016f28f8] /Users/nobu/build/ruby/master/aarch64-darwin/insns.def:928
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_vm_exec+0x170) [0x1016efce8] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm.c:2784
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(invoke_block_from_c_bh+0x2f4) [0x10171ed48] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm.c:1814
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_yield_0+0x80) [0x101700e9c] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm.c:1865
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_yield+0x60) [0x101700f2c] ../src/vm_eval.c:1378
/Users/nobu/build/ruby/master/aarch64-darwin/.ext/arm64-darwin25/-test-/tracepoint.bundle(add_object_id+0x34) [0x1008e0afc] /Users/nobu/build/ruby/master/aarch64-darwin/ext/-test-/tracepoint/../../../../src/ext/-test-/tracepoint/tracepoint.c:103
/Users/nobu/build/ruby/master/aarch64-darwin/.ext/arm64-darwin25/-test-/tracepoint.bundle(add_object_id) (null):0
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(vm_call_cfunc_with_frame_+0xe8) [0x1017155d0] ../src/vm_insnhelper.c:3902
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(vm_sendish+0x6fc) [0x1016ef9cc]
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(vm_exec_core+0xb70) [0x1016f2518]
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_vm_exec+0x170) [0x1016efce8] /Users/nobu/build/ruby/master/aarch64-darwin/../src/vm.c:2784
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(rb_ec_exec_node+0x74) [0x101539908] /Users/nobu/build/ruby/master/aarch64-darwin/../src/eval.c:283
/Users/nobu/build/ruby/master/aarch64-darwin/libruby.4.0.dylib(ruby_run_node+0x64) [0x101539838] /Users/nobu/build/ruby/master/aarch64-darwin/../src/eval.c:321
/Users/nobu/build/ruby/master/aarch64-darwin/ruby(rb_main+0x1c) [0x100828600] /Users/nobu/build/ruby/master/aarch64-darwin/../src/main.c:42
/Users/nobu/build/ruby/master/aarch64-darwin/ruby(main) /Users/nobu/build/ruby/master/aarch64-darwin/../src/main.c:62
/Users/nobu/build/ruby/master/aarch64-darwin/ruby(main) (null):0

Updated by byroot (Jean Boussier) 21 days ago Actions #4 [ruby-core:123907]

Thank you @nobu (Nobuyoshi Nakada), I'll integrate your test case and fix the remaining issues.

Updated by byroot (Jean Boussier) 21 days ago Actions #5 [ruby-core:123908]

I'm still waiting on CI, but https://github.com/ruby/ruby/pull/15320 now handles T_OBJECT too and include @nobu's test. Review welcome.

Updated by ivoanjo (Ivo Anjo) 20 days ago Actions #6 [ruby-core:123912]

Thanks for the hyper-quick investigation & PR ๐Ÿ™

Updated by byroot (Jean Boussier) 18 days ago Actions #7

  • Target version set to 4.0

Updated by jhawthorn (John Hawthorn) 17 days ago ยท Edited Actions #8 [ruby-core:123946]

I don't think we should support this (at least not long term, I'm fine if it is patched for Ruby 4.0.0). object_id has always potentially allocated (either to enlarge the backing hashes, or to make a BIGNUM on 32-bit). It's only in the past ~year that hasn't crashed (https://github.com/ruby/ruby/commit/f4476f0d07c781c906ed1353d8e1be5a7314d6e7) so calling rb_obj_id has always been unsafe (it's just more likely to crash now).

IMO the long term expectation should be that NEWOBJ callbacks do not allocate (new newobj, no xmalloc) or modify the Ruby objects they are given in any way (and also they should be thread-safe, but that's another discussion).

Updated by byroot (Jean Boussier) 17 days ago Actions #9 [ruby-core:123950]

IMO the long term expectation should be that NEWOBJ callbacks do not allocate (new newobj, no xmalloc) or modify the Ruby objects they are given in any way

While I generally agree with this, I suspect it's unworkable for use cases akin to memory profiling. It seems quite reasonable to want to keep track of allocated objects so you can recognize them in the FREE callback.

And since rb_gc_mark_weak isn't public API, there's no decent way to have weakrefs in a C extensions (except if you assume/enforce no compaction, then you can use just the address).

Since you also happen to maintain a memory profiler, I wonder how you think this should be done.

Updated by ivoanjo (Ivo Anjo) 15 days ago Actions #10 [ruby-core:123960]

And since rb_gc_mark_weak isn't public API, there's no decent way to have weakrefs in a C extensions (except if you assume/enforce no compaction, then you can use just the address).

Ahhhhh I totally had missed that one. Ever since I spotted https://github.com/ruby/ruby/pull/8113 I've had an item on my TODO list on "move away from id2obj for heap profiling in the Datadog profiler" but I completely missed that rb_gc_mark_weak is not public.

I'll open a separate issue for that -- definitely with the change to the object id implementation in Ruby 4.0 having rb_gc_mark_weak be public would be a great migration path.

Updated by ivoanjo (Ivo Anjo) 15 days ago Actions #11 [ruby-core:123962]

Created https://bugs.ruby-lang.org/issues/21722 to ask if rb_gc_mark_weak could be made a public symbol

Updated by byroot (Jean Boussier) 13 days ago Actions #12

  • Status changed from Open to Closed

Applied in changeset git|8c3909935e2ba9f79bf3492772c77c305a0d370b.


Handle NEWOBJ tracepoints settings fields

[Bug #21710]

  • struct.c: struct_alloc

It is possible for a NEWOBJ tracepoint call back to write fields
into a newly allocated object before struct_alloc had the time
to set the RSTRUCT_GEN_FIELDS flags and such.

Hence we can't blindly initialize the fields_obj reference to 0
we first need to check no fields were added yet.

  • object.c: rb_class_allocate_instance

Similarly, if a NEWOBJ tracepoint tries to set fields on the object,
the shape_id must already be set, as it's required on T_OBJECT to
know where to write fields.

NEWOBJ_OF had to be refactored to accept a shape_id.

Updated by byroot (Jean Boussier) 13 days ago Actions #13 [ruby-core:123994]

Alright, I merged the fix, but as @jhawthorn (John Hawthorn) mentioned, we should really figure out a way so you don't have to do this, because all we could think of when looking at the code was: "How has this ever worked?".

rb_gc_mark_weak can be an option, another may be to have a mechanism to prevent compaction, so you can simply rely on addresses.

Updated by ivoanjo (Ivo Anjo) 13 days ago Actions #14 [ruby-core:123996]

rb_gc_mark_weak can be an option, another may be to have a mechanism to prevent compaction, so you can simply rely on addresses.

Maybe, yeah! Effectively the two requirements I believe are:

a) Be able to identify or reference object without preventing its GC, even if the object is moved
b) Be able to tell if an object is alive or been GC at some later point

Right now we're using object_id for a) and _id2ref for b).

Preventing compaction, or at least knowing a bit more about when it runs and the ability to ask about what moved would probably solve a), not sure for b) (I guess FREEOBJ tracepoint... although that one's a bit costly?).

Updated by byroot (Jean Boussier) 13 days ago Actions #15 [ruby-core:123997]

That makes me wonder if a MOVEDOBJ tracepoint would be a good solution. At least for now you can assume compaction almost never happens, and even when it happens, not so many objects would be moved.

So you'd get the benefit of using addresses as identifiers, which is cheap and fast, and in the rare event of a move, you'd have the necessary data to update your datastructure.

MOVEDOBJ would be triggered with the old ref, so that you can call rb_gc_location() to get the new ref.

@jhawthorn (John Hawthorn) and @ivoanjo how does that sound?

NB: I do think rb_gc_mark_weak would be a very valuable to expose to extensions, but it might be a much higher bar to clear.

Updated by ivoanjo (Ivo Anjo) 13 days ago Actions #16 [ruby-core:123998]

Having MOVEDOBJ would be very useful for being able to use the address as identity + I think it may come in handy for caches -- I think it would solve the "a) Be able to identify or reference object without preventing its GC, even if the object is moved".

But I'm not sure it solves the whole puzzle, since we'd still need to query/be notified when such "home-grown weak referenced" object is collected.

That's where rb_gc_mark_weak would indeed be a nice solution, but TBH having:

  • a "is this ref a valid object"?
  • or something closer to Java's Reference Queues where Ruby would throw the addresses of the objects it GC in an array that could be post-processed later
  • be able to somehow mark or tell Ruby which objects we want to know about in FREEOBJ (possibly even MOVEDOBJ)

would be enough for the use-case.

Updated by Eregon (Benoit Daloze) 13 days ago ยท Edited Actions #17 [ruby-core:124003]

ivoanjo (Ivo Anjo) wrote in #note-14:

Maybe, yeah! Effectively the two requirements I believe are:

a) Be able to identify or reference object without preventing its GC, even if the object is moved
b) Be able to tell if an object is alive or been GC at some later point

That sounds like a use case for WeakRef or ObjectSpace::WeakMap.

In #21722 you mention that would need the GVL, but don't you need the GVL for object_id or EDIT: "your own weakref with" rb_gc_mark_weak anyway ? (I may be missing something)

Updated by Eregon (Benoit Daloze) 13 days ago Actions #18

  • Related to Feature #21722: Expose rb_gc_mark_weak API for use in extensions added

Updated by ivoanjo (Ivo Anjo) 13 days ago Actions #19 [ruby-core:124004]

That sounds like a use case for WeakRef or ObjectSpace::WeakMap.

In #21722 you mention that would need the GVL, but don't you need the GVL for object_id or EDIT: "your own weakref with" rb_gc_mark_weak anyway ? (I may be missing something)

In retrospect I was a bit confusing there -- The key missing bit is having something that's safe to use from NEWOBJ tracepoint, which is what WeakRef and WeakMap don't currently provide.

You're right that rb_gc_mark_weak or even the solution we use right now (_id2ref) both rely on having the GVL so what I wrote there wasn't entirely accurate. I'll tweak the text a bit.

Updated by ko1 (Koichi Sasada) 12 days ago Actions #20 [ruby-core:124010]

+1 for MOVEDOBJ (or MOVEOBJ like NEWOBJ/FREEOBJ. It should be called just before/after moving) for object profiling purpose.

I'm not sure how to "simulate weak references"...

Updated by jhawthorn (John Hawthorn) 12 days ago Actions #21 [ruby-core:124016]

I'm fine with adding MOVEOBJ, it makes sense, but I don't think it's currently necessary. ObjectSpace.trace_object_allocations handles this by subscribing both to NEWOBJ and FREEOBJ and tracking the objects in its own hash table (subscribing to FREEOBJ giving it weakref). It then has a mostly normal compact routine to move the objects https://github.com/ruby/ruby/blob/master/ext/objspace/object_tracing.c#L200-L236 (although the rb_gc_pointer_to_heap_p is odd, I believe that's only necessary after the object tracing has finished). I used the same technique in Vernier's object tracing and it seems to work fine with compaction.

I think this is a problem for Datadog's profiler vs the other users because it (to my understanding) does not subscribe to FREEOBJ and do its own bookkeeping. So it wants a way for Ruby to handle weakrefs for it (#21722)

To be clear, Ruby 3.4 is the only version of Ruby it has ever been somewhat safe to call rb_obj_id inside of NEWOBJ. In 3.3 and below if you call it enough you will eventually miss a write barrier and eventually crash. This was done to work around bugs in the ObjectSpace extension, not to provide guarantees for profilers or other consumers of this hook, and in Ruby 4.1 I hope/expect we'll return to the previous behaviour and rb_obj_id will again be unsafe inside of NEWOBJ.

Updated by ivoanjo (Ivo Anjo) 12 days ago Actions #22 [ruby-core:124023]

I'm fine with adding MOVEOBJ, it makes sense, but I don't think it's currently necessary. ObjectSpace.trace_object_allocations handles this by subscribing both to NEWOBJ and FREEOBJ and tracking the objects in its own hash table (subscribing to FREEOBJ giving it weakref). It then has a mostly normal compact routine to move the objects https://github.com/ruby/ruby/blob/master/ext/objspace/object_tracing.c#L200-L236 (although the rb_gc_pointer_to_heap_p is odd, I believe that's only necessary after the object tracing has finished). I used the same technique in Vernier's object tracing and it seems to work fine with compaction.

I think this is a problem for Datadog's profiler vs the other users because it (to my understanding) does not subscribe to FREEOBJ and do its own bookkeeping. So it wants a way for Ruby to handle weakrefs for it (#21722)

Very true! So to clarify, the main reason we took this approach is to try to minimize overhead impact on applications.

Having the NEWOBJ tracepoint active all the time in production does already by itself add overhead, and that's even with a as-simple-as-possible skip most objects check.

The addition of a FREEOBJ that for every single object needs to look up a hashtable is a cost that we were hoping to avoid if we can ๐Ÿ˜€

(We also had a smaller worry that FREEOBJ could sometimes be skipped and thus this would become a source of memory leaks, there were some comments in the VM code about it, but unclear if that's something that happens in practice or not)

In 3.3 and below if you call it enough you will eventually miss a write barrier and eventually crash.

I'd very much like to know more about this -- if you have the time/ability to share any details (even privately?) ๐Ÿ™

For what it's worth, we run a lot of tests with 3.3 and below and we've not spotted the issue so far so I wonder if we've just been lucky or there's something we can do to help there...

Updated by jhawthorn (John Hawthorn) 5 days ago Actions #23 [ruby-core:124118]

  • Status changed from Closed to Open

This continues to be broken (in a way that I hope helps illustrate that this was always unsafe). I think there's two issues.

Since the change was introduced we're regularly seeing failures on CI related to object_id and _id2ref. Here we can see that object_id is -1. I think this is because the GC assertions set all fields to -1

The second issue is that allocating objects inside of a NEWOBJ hook causes crashes, and setting an object_id may allocate a fields object. Allocating an object will run GC even if we've called rb_gc_disable_no_rest(). Though that prevents a GC due to malloc, to prevent a GC due to newobj we need to call the version that finishes incremental GC.

I adjusted the test to run a lot more times and it crashes consistently (even without any debug assertions)

    Bug.tracepoint_add_object_id do
      1_000.times do
        klass = Struct.new
        20_000.times { klass.new }

        klass = Struct.new(:a)
        20_000.times { klass.new }

        klass = Struct.new(:a, :b, :c)
        20_000.times { klass.new }

        20_000.times { Set.new } # To test T_DATA / TypedData RUBY_TYPED_EMBEDDABLE
        20_000.times { Proc.new { } } # To test T_DATA / TypedData non embeddable

        20_000.times { Object.new }
      end
    end

Continuing GC immediately after we've allocated an object's slot is unsafe because we haven't filled in the object's values, so the mark function can see garbage and other issues can happen.

On Ruby 3.4 this probably mostly worked because we would only allocate an object when calling rb_obj_id if on a 32-bit platform and we'd called it enough times to overflow. On Ruby 3.3 an below we made no effort to disable GC and so any xmalloc could have caused GC and triggered these types of bug.

I just don't think we should support this. It's always been broken and trying to make our GC match these constraints is too challenging and only more likely to cause and mask other bugs.

Updated by byroot (Jean Boussier) 5 days ago Actions #24 [ruby-core:124119]

I just don't think we should support this.

I said previously, I generally agree, but the datadog gem is quite popular, so I think we should try to find a short term solution if we can. At least that's what I tried to do.

Updated by jhawthorn (John Hawthorn) 5 days ago Actions #25 [ruby-core:124120]

It's a noble goal, but in my view it's negative to even temporarily support this as it invites new extensions to be badly behaved and misuse these hooks.

Updated by byroot (Jean Boussier) 5 days ago Actions #26 [ruby-core:124122]

Yeah... Either way, with the object_id now stored in imemo/fields, generating an object_id for anything other than T_OBJECT will inevitably allocate an IMEMO.

I'll remove the test case.

Updated by ivoanjo (Ivo Anjo) 5 days ago Actions #27 [ruby-core:124135]

byroot (Jean Boussier) wrote in #note-24:

I just don't think we should support this.

I said previously, I generally agree, but the datadog gem is quite popular, so I think we should try to find a short term solution if we can. At least that's what I tried to do.

To be clear, we are aware we're doing really shady weird things to get a bit less overhead / a few more features, etc and we don't expect upstream to have to make promises that those things keep working.

So if the answer is "you can't do this" (anymore?) while it'll affect Datadog and our customers until we find another way of doing it (and thanks John for the great suggestion in #21722) I think it's very fair and it's up to us downstream to find a solution or come up with a reasonable proposal that fits within the VM design.

Thus, if you want to close the ticket with "won't do", that's fine.

On Ruby 3.4 this probably mostly worked because we would only allocate an object when calling rb_obj_id if on a 32-bit platform and we'd called it enough times to overflow. On Ruby 3.3 an below we made no effort to disable GC and so any xmalloc could have caused GC and triggered these types of bug.

Ah, I think that helps explain why we haven't seen issues from here in our tests/in practice:

  1. For profiling in particular (not the rest of the datadog gem) we only support x86-64/arm64 linux
  2. If we do spot a bignum object id, we stop profiling immediately with an error

Updated by luke-gru (Luke Gruber) 4 days ago Actions #28

  • Status changed from Open to Closed
Actions

Also available in: PDF Atom