ruby_vm_at_exit can sometime cause a crash.
This behavior has been seen erratically, but one of our users got it to reproduce almost systematically. We didn't managed to understand what made his system special that it would get that crash to reproduce so well.
Here's one of the reports:
The current workaround to that one (alongside a few other comments) is done here: https://github.com/grpc/grpc/pull/5337/files
Note that removing the call to ruby_vm_at_exit makes everything load fine. Also note that the removed comment from that pull request is wrong: this has been happening on versions of Ruby other than 2.0.
It's interesting to note from the backtrace information that this is happening during a garbage collection. The fact that a garbage collection happens at that exact moment is probably the reason that bug is so difficult to reproduce. Perhaps a modified version of ruby might help reproducing it. Or very specific garbage collector settings.
The fault address (0x88) seems to indicate that a NULL pointer into a struct was being dereferenced.
Disassembling the corresponding execution address seems to point at a crash inside obj_info, from the first line of gc_writebarrier_incremental, but this is after a very quick inspection of the code, so don't take my word from it.
This problem has been repoted to us on Ruby 2.0.0, Ruby 2.2.0, Ruby 2.2.3, Ruby 2.3.0, at least.
- vm_core.h (rb_vm_struct): make at_exit a single linked list but not RArray, not to mark the registered functions by the write barrier. based on the patches by Evan Phoenix. [Bug #12095]
merge revision(s) 54484: [Backport #12095]
* vm_core.h (rb_vm_struct): make at_exit a single linked list but not RArray, not to mark the registered functions by the write barrier. based on the patches by Evan Phoenix. [Bug #12095]
#1 [ruby-core:74312] Updated by evanphx (Evan Phoenix) about 1 year ago
I'm hitting this as well, and looking over the code in question on 2.3.0, I wondering if the problem is that the at_exit pseudo-object is actually allocated within the body of rb_vm_t. It's address is taken and passed to
rb_ary_push, which perform OBJ_WRITE. That's where wb_incremental is invoked from.
Because the mark bits are not located with the object header anymore, the mark bitmap is consulted but the position in the mark bitmap is calculated against the address of at_exit, which isn't located on the main ruby heap at all!
The path to the bad pointer, given X as the address of at_exit within rb_vm_t is: RVALUE_BLACK_P(X) => RVALUE_MARKED(X) => RVALUE_MARK_BITMAP(X) => GET_HEAP_MARK_BITS(X) => GET_HEAP_PAGE(X) => GET_PAGE_HEADER(X) => GET_PAGE_BODY(X) => ((struct heap_page_body *)((bits_t)(x) & ~(HEAP_ALIGN_MASK))).
The value returned by that above sequence is supposed to return a page header that can itself be dereferenced to find the mark bits. But because the at_exit is in a random place, the page header is basically random bytes, and thus the deference crashes.
#2 [ruby-core:74319] Updated by evanphx (Evan Phoenix) about 1 year ago
Attached is a patch that fixes this issue by replacing the troublesome usage of a VALUE to store the at_exit functions with a simple linked list. This patch was created against the ruby_2_3 branch. It should apply cleanly to most branches because of it's small size.
#3 [ruby-core:74324] Updated by nobu (Nobuyoshi Nakada) about 1 year ago
Thank you for the investigation and the patch, I've missed this.
freethe list in
- use the argument
- replace the existing
struct, as multiple
typedefs are not allowed in C, IIRC.
#7 Updated by nobu (Nobuyoshi Nakada) 12 months ago
- Status changed from Open to Closed