Project

General

Profile

Bug #12095

ruby_vm_at_exit can sometime cause a crash.

Added by nicolasnoble (Nicolas Noble) about 1 year ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:73908]

Description

This behavior has been seen erratically, but one of our users got it to reproduce almost systematically. We didn't managed to understand what made his system special that it would get that crash to reproduce so well.

Here's one of the reports:

https://gist.github.com/blowmage/7ebe774039013bc8c990

The current workaround to that one (alongside a few other comments) is done here: https://github.com/grpc/grpc/pull/5337/files

Note that removing the call to ruby_vm_at_exit makes everything load fine. Also note that the removed comment from that pull request is wrong: this has been happening on versions of Ruby other than 2.0.

It's interesting to note from the backtrace information that this is happening during a garbage collection. The fact that a garbage collection happens at that exact moment is probably the reason that bug is so difficult to reproduce. Perhaps a modified version of ruby might help reproducing it. Or very specific garbage collector settings.

The fault address (0x88) seems to indicate that a NULL pointer into a struct was being dereferenced.

Disassembling the corresponding execution address seems to point at a crash inside obj_info, from the first line of gc_writebarrier_incremental, but this is after a very quick inspection of the code, so don't take my word from it.

This problem has been repoted to us on Ruby 2.0.0, Ruby 2.2.0, Ruby 2.2.3, Ruby 2.3.0, at least.

at_exit_fix.patch View (2.08 KB) evanphx (Evan Phoenix), 03/15/2016 02:47 AM

at_exit_fix.patch View (2.18 KB) evanphx (Evan Phoenix), 03/15/2016 04:21 PM

Associated revisions

Revision 54484
Added by nobu (Nobuyoshi Nakada) 12 months ago

at_exit list

  • vm_core.h (rb_vm_struct): make at_exit a single linked list but not RArray, not to mark the registered functions by the write barrier. based on the patches by Evan Phoenix. [Bug #12095]

Revision 54633
Added by naruse (Yui NARUSE) 11 months ago

merge revision(s) 54484: [Backport #12095]

* vm_core.h (rb_vm_struct): make at_exit a single linked list but
  not RArray, not to mark the registered functions by the write
  barrier.  based on the patches by Evan Phoenix.
   [Bug #12095]

Revision 54683
Added by usa (Usaku NAKAMURA) 11 months ago

merge revision(s) 54484: [Backport #12095]

* vm_core.h (rb_vm_struct): make at_exit a single linked list but
  not RArray, not to mark the registered functions by the write
  barrier.  based on the patches by Evan Phoenix.
   [Bug #12095]

History

#1 [ruby-core:74312] Updated by evanphx (Evan Phoenix) about 1 year ago

I'm hitting this as well, and looking over the code in question on 2.3.0, I wondering if the problem is that the at_exit pseudo-object is actually allocated within the body of rb_vm_t. It's address is taken and passed to rb_ary_push, which perform OBJ_WRITE. That's where wb_incremental is invoked from.

Because the mark bits are not located with the object header anymore, the mark bitmap is consulted but the position in the mark bitmap is calculated against the address of at_exit, which isn't located on the main ruby heap at all!

The path to the bad pointer, given X as the address of at_exit within rb_vm_t is: RVALUE_BLACK_P(X) => RVALUE_MARKED(X) => RVALUE_MARK_BITMAP(X) => GET_HEAP_MARK_BITS(X) => GET_HEAP_PAGE(X) => GET_PAGE_HEADER(X) => GET_PAGE_BODY(X) => ((struct heap_page_body *)((bits_t)(x) & ~(HEAP_ALIGN_MASK))).

The value returned by that above sequence is supposed to return a page header that can itself be dereferenced to find the mark bits. But because the at_exit is in a random place, the page header is basically random bytes, and thus the deference crashes.

#2 [ruby-core:74319] Updated by evanphx (Evan Phoenix) about 1 year ago

Attached is a patch that fixes this issue by replacing the troublesome usage of a VALUE to store the at_exit functions with a simple linked list. This patch was created against the ruby_2_3 branch. It should apply cleanly to most branches because of it's small size.

#3 [ruby-core:74324] Updated by nobu (Nobuyoshi Nakada) about 1 year ago

Thank you for the investigation and the patch, I've missed this.

you should:

  • free the list in ruby_vm_run_at_exit_hooks,
  • use the argument vm instead of GET_VM(), and
  • replace the existing typedef of rb_vm_t with mere struct, as multiple typedefs are not allowed in C, IIRC.

#4 [ruby-core:74342] Updated by evanphx (Evan Phoenix) about 1 year ago

Thank you for the feedback nobu!

Attached is an updated version of the patch with the changes.

#5 [ruby-core:74796] Updated by naruse (Yui NARUSE) 12 months ago

  • Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN, 2.3: UNKNOWN to 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED

#6 [ruby-core:74806] Updated by nobu (Nobuyoshi Nakada) 12 months ago

at_exit functions should be called in the inverse order, LIFO.

#7 Updated by nobu (Nobuyoshi Nakada) 12 months ago

  • Status changed from Open to Closed

Applied in changeset r54484.


at_exit list

  • vm_core.h (rb_vm_struct): make at_exit a single linked list but not RArray, not to mark the registered functions by the write barrier. based on the patches by Evan Phoenix. [Bug #12095]

#8 [ruby-core:74999] Updated by naruse (Yui NARUSE) 11 months ago

  • Backport changed from 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED to 2.1: REQUIRED, 2.2: REQUIRED, 2.3: DONE

ruby_2_3 r54633 merged revision(s) 54484.

#9 [ruby-core:75091] Updated by usa (Usaku NAKAMURA) 11 months ago

  • Backport changed from 2.1: REQUIRED, 2.2: REQUIRED, 2.3: DONE to 2.1: REQUIRED, 2.2: DONE, 2.3: DONE

ruby_2_2 r54683 merged revision(s) 54484.

Also available in: Atom PDF