Project

General

Profile

Actions

Bug #19041

closed

Weakref is still alive after major garbage collection

Added by parker (Parker Finch) over 1 year ago. Updated about 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-darwin21]
[ruby-core:110218]

Description

I am able to get into an infinite loop waiting for garbage collection to take a WeakRef.

Reproduction Process

The following script prints a "0", then a "1", and then hangs forever. I expect it to keep printing numbers.

require "weakref"

iterations = 0

loop do
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  GC.start while obj.weakref_alive?
  iterations += 1
end

Ruby Version

I have tested this on Ruby 3.1.2, 3.1.0, 3.0.4, 3.0.0, 2.7.6, and 2.7.0 on macOS. All exhibit this behavior.

Further Investigation

Sleeping

Sleeping before the garbage collection allows the loop to continue. The below exhibits the expected behavior:

require "weakref"

iterations = 0

loop do
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  (sleep(0.5); GC.start) while obj.weakref_alive?
  iterations += 1
end

However, sleeping after the garbage collection still shows the buggy behavior (loop hangs):

require "weakref"

iterations = 0

loop do
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  (GC.start; sleep(0.5)) while obj.weakref_alive?
  iterations += 1
end

Running Garbage Collection Multiple Times

Explicitly running garbage collection multiple times allows the loop to continue. This has the expected behavior, more numbers continue to be printed:

require "weakref"

iterations = 0

loop do
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  while obj.weakref_alive?
    GC.start
    GC.start
    GC.start
  end
  iterations += 1
end

However, with certain rubies, running those garbage collection calls in a times block prevents even a single iteration from completing. The following prints only "0" with ruby 3.0.4 on macOS, ruby 2.7.6 on macOS, and ruby 3.1.2 on linux (ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux] on a virtual machine). It shows the expected behavior on ruby 3.1.2 on macOS.

require "weakref"

iterations = 0

loop do
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  3.times { GC.start } while obj.weakref_alive?
  iterations += 1
end

Files

manifest_weakref_issue.rb (1.52 KB) manifest_weakref_issue.rb Script to reproduce the issue parker (Parker Finch), 02/03/2023 02:49 PM

Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #19460: Class not able to be garbage collectedClosedActions

Updated by byroot (Jean Boussier) over 1 year ago

I don't think this is a bug per say. The Ruby GC is conservative. That means it goes over the whole stack in search for potential references to objects, and mark them.

As a result, it can happen that an object ref stays in an unused saved register and prevent an object from being merged.

Actions #2

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

  • Status changed from Open to Closed

Updated by parker (Parker Finch) over 1 year ago

Thanks @byroot (Jean Boussier)! I think this could be considered a bug in the documentation, since the docs for WeakRef imply that a WeakRef should be collected after a garbage collection. Perhaps we could call this corner-case out?

I'm also curious to learn more about this case. (I'm unfamiliar with Ruby's use of registers and how that interacts with live objects and garbage collection.) It seems like calling the weakref_alive? method is continually forcing the object ref into a register, and sleeping after calling that method gives time for the register to clear. Is that understanding correct? (I'm surprised that calling a method on the WeakRef object prevents the underlying object from being collected, since shouldn't that underlying one be collected even though the WeakRef itself still has a reference? Does the method call put the underlying object ref in a register?)

Is there a more reliable/direct way to get rid of the reference than sleeping?

One aspect of this where I'm still confused is why the loop given to reproduce this issue completes an iteration before hanging. What is different on the first iteration that allows this to succeed?

Updated by chrisseaton (Chris Seaton) over 1 year ago

The documentation could be more clear, but also note that this isn't in any way specific to Ruby - I would say that this is expected behaviour for a managed language. A weak-ref may be cleared if no other references exist. That's should be the extent of the guarantee offered.

Updated by tenderlovemaking (Aaron Patterson) over 1 year ago

parker (Parker Finch) wrote in #note-3:

Thanks @byroot (Jean Boussier)! I think this could be considered a bug in the documentation, since the docs for WeakRef imply that a WeakRef should be collected after a garbage collection. Perhaps we could call this corner-case out?

I'm also curious to learn more about this case. (I'm unfamiliar with Ruby's use of registers and how that interacts with live objects and garbage collection.

Ruby's garbage collector is conservative. Ruby objects that are allocated inside of C code must be kept alive. Lets look at a simple example:

void neat_function(void) {
    VALUE list = rb_ary_new();
    rb_gc_start();
    rb_ary_push(list, Qnil);
}

The above C code is compiled in to machine code, but the array's life span is managed by the garbage collector. How can the garbage collector ensure that the array stays alive even after the call to rb_gc_start()? We humans can clearly see that the array is used in the C code, but the GC cannot read the C code. In fact there is no C code for the GC because it's all machine code now! So how can the GC keep the reference alive? It will scan the machine registers as well as the stack memory looking for addresses that might be Ruby objects. The C compiler will probably have generated machine code that puts a reference to the local variable list in either a register or stack memory (there are cases where this doesn't happen, and we have to deal with it manually. See RB_GC_GUARD).

The GC will look at the values stored in the machine registers, as well as any values in stack memory, then check if those values are within the bounds of Ruby's GC heap memory. If the address is inside the bounds, then the GC will consider the object to be alive. The GC cannot know if a pointer stored in a machine register will ever be used again, so it takes a conservative approach and keeps the reference alive.

This conservative approach can lead to the behavior that you are seeing with the weak reference: a value that nobody is actually using or referencing is kept alive because the GC can't know that fact for sure. The reference may or may not stay alive, but it depends on what machine code has executed, if the value is in the stack, if any registers have been overwritten, etc.

I hope this helps.

Updated by parker (Parker Finch) over 1 year ago

Thanks for that explanation @tenderlovemaking (Aaron Patterson), it helps and I truly appreciate it!

One misunderstanding I had was that I was thinking about this in terms of the Ruby VM. But it seems like garbage collection actually occurs down at the machine level (which makes much more sense now that I think about it) and that's why we're dealing with registers. (And the stack we're talking about is the C stack and not the Ruby VM stack.)

The recommendation to take a look at RB_GC_GUARD was helpful as well, that's a great comment there.

I'm still curious why calling #weakref_alive? on the WeakRef seems to put the underlying Object (that the WeakRef delegates to) in a register or on the stack. But the fact that this is happening so close to the actual machine makes it seem like it would be tricky to figure out.

Anyway, I'll keep learning more about how memory management works, thank you for the info here! I think the docs are fine as-is, so it makes sense to me to close this one.

Thank you all for your time and explanations!

Updated by tenderlovemaking (Aaron Patterson) over 1 year ago

parker (Parker Finch) wrote in #note-6:

I'm still curious why calling #weakref_alive? on the WeakRef seems to put the underlying Object (that the WeakRef delegates to) in a register or on the stack. But the fact that this is happening so close to the actual machine makes it seem like it would be tricky to figure out.

That method may not be putting the object in a register. Something else may have put it in a register or in the stack, and it just happens that no other machine code has overwritten the register or stack memory. If you dump the heap (ObjectSpace.dump_all), you'll probably see one of the roots (probably VM?) pointing at the object. Unfortunately the heap dump won't tell you how it found the reference, just that the reference exists. You could find whether it's a register or stack memory by adding some debugging code to the GC or by tracing the machine code via lldb.

It might be nice if ObjectSpace.dump_all could indicate whether the reference came from the stack or machine registers as I've also tried to figure that out. But it is work. 😅

Updated by parker (Parker Finch) over 1 year ago

tenderlovemaking (Aaron Patterson) wrote in #note-7:

That method may not be putting the object in a register. Something else may have put it in a register or in the stack, and it just happens that no other machine code has overwritten the register or stack memory.

There's some evidence that the weakref_alive? method is putting it in a register or the stack. Running garbage collection immediately after calling weakref_alive? will fail to collect the underlying object. But if there's a sleep between the weakref_alive? and running garbage collection then the garbage collection will succeed in collecting the underlying object.

To test if it was the weakref_alive? call itself that was causing the issue I ran a few different scenarios:

# This version does not manifest the issue. (It makes it through two iterations
# and terminates.)

require "weakref"

iterations = 0

while iterations < 2
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  while obj.weakref_alive?
    # Sleep to give registers a chance to clear.
    sleep(0.5)
    GC.start
  end
  iterations += 1
end
# This version does manifest the issue. (It gets stuck in the inner loop and
# never terminates.)
require "weakref"

iterations = 0

while iterations < 2
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  while obj.weakref_alive?
    # Sleep to give registers a chance to clear.
    sleep(0.5)

    # Call the `WeakRef#weakref_alive?` method to see if that causes the issue
    # to manifest. (It does, GC does _not_ clear out the underlying Object after
    # this.)
    obj.weakref_alive?

    GC.start
  end
  iterations += 1
end
# This version does not manifest the issue. (It makes it through two iterations
# and terminates.)

require "weakref"

iterations = 0

while iterations < 2
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  while obj.weakref_alive?
    # Sleep to give registers a chance to clear.
    sleep(0.5)

    # Reference the WeakRef object to see if that causes the issue to
    # manifest. (It does not, GC still clears out the underlying Object here.)
    obj

    GC.start
  end
  iterations += 1
end
# This version does not manifest the issue. (It makes it through two iterations
# and terminates.)

require "weakref"

iterations = 0

while iterations < 2
  print "\r#{iterations}"
  obj = WeakRef.new(Object.new)
  while obj.weakref_alive?
    # Sleep to give registers a chance to clear.
    sleep(0.5)

    # Call another method on the WeakRef object to see if that causes the issue
    # to manifest. (It does not, GC still clears out the underlying Object
    # here.)
    obj.object_id

    GC.start
  end
  iterations += 1
end

Sorry for the wall of code there — the summary is that the issue only seems to manifest when the weakref_alive? method is called immediately before garbage collecting.

The fact that the behavior is predictable in those different scenarios makes me think that the weakref_alive? method is doing something that adds a reference to the underlying Object to a register or the stack. Is there another explanation for the behavior there that I'm missing?


If you dump the heap (ObjectSpace.dump_all), you'll probably see one of the roots (probably VM?) pointing at the object. Unfortunately the heap dump won't tell you how it found the reference, just that the reference exists. You could find whether it's a register or stack memory by adding some debugging code to the GC or by tracing the machine code via lldb.

Thanks @tenderlovemaking (Aaron Patterson)! I didn't know about ObjectSpace.dump_all. I'll try exploring those options to see if I can pin down how it's finding the reference to the Object. Heads up that it will likely take me a while since I'm not yet familiar with C and lldb.

Updated by parker (Parker Finch) about 1 year ago

Hi @tenderlovemaking (Aaron Patterson)! I'm having difficulty interpreting the results of the ObjectSpace dump and I'm hoping you can help.

I've adjusted the script to print out the address of the underlying object, and then (when the issue manifests) print all lines from ObjectSpace.dump_all that match that address. The code is attached, here's some example output:

Ruby version: 3.3.0
Iteration: 0
Object address: 0x1051cd788
Inner iterations: 1
Iteration: 1
Object address: 0x105205ae8
Inner iterations: 1
Inner iterations: 2
Inner iterations: 3
{"address":"0x105205ae8", "type":"OBJECT", "shape_id":5, "slot_size":40, "class":"0x1029bfe80", "embedded":true, "ivars":0, "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}
{"address":"0x10520da90", "type":"STRING", "shape_id":0, "slot_size":40, "class":"0x1029beda0", "embedded":true, "bytesize":11, "value":"0x105205ae8", "encoding":"UTF-8", "coderange":"7bit", "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

In that example, the underlying object was at 0x105205ae8. But as far as I can tell, there's nothing else that points at it. (The other object there is the String used to hold that address.) I would have expected that, if nothing was referencing it, it would be collected by GC.

One interesting tidbit is that just calling ObjectSpace.dump_all prevents the issue from manifesting. Is it possible that something was referencing the object address, then running dump_all caused that reference to be removed?

Actions #10

Updated by byroot (Jean Boussier) about 1 year ago

  • Related to Bug #19460: Class not able to be garbage collected added
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0