Bug #4612
closedSegmentation fault in fiber GC mark cycle
Description
=begin
((|Fiber.current|)) can cause segfault on GC cycle when used in threads. Please find attached ruby sample which should help to pinpoint the problem.
The coredump shows the following backtrace:
....
#18
#19 0x000000010004cce5 in mark_locations_array (objspace=0x100838000, x=0x101dc8ab0, n=649) at gc.c:1315
#20 0x000000010004cea6 in gc_mark_locations (objspace=0x100838000, start=0x101dc8ab0, end=0x101dc9f00) at gc.c:1331
#21 0x000000010004f3dc in rb_gc_mark_machine_stack (th=0x1008a1048) at gc.c:2235
#22 0x0000000100177a4f in rb_thread_mark (ptr=0x1008a1048) at vm.c:1683
#23 0x000000010018179a in cont_mark (ptr=0x1008a1000) at cont.c:88
#24 0x0000000100181947 in fiber_mark (ptr=0x1008a1000) at cont.c:168
#25 0x000000010004dbb9 in gc_mark_children (objspace=0x100838000, ptr=4303907200, lev=1) at gc.c:1719
#26 0x000000010004d3fb in gc_mark (objspace=0x100838000, ptr=4303907200, lev=0) at gc.c:1514
#27 0x000000010004d428 in rb_gc_mark (ptr=4303907200) at gc.c:1520
#28 0x000000010017797f in rb_thread_mark (ptr=0x10049fa00) at vm.c:1673
....
It looks like in frame 28 ((|rb_thread_mark|)) is called on terminated thread, which is ok, but then it goes down to fiber's continuation member ((|saved_thread|)) in frame 22. I think saved_thread holds an earlier snapshot (still running) of the same thread that we see in 28 (thread_id are equal), and because its machine stack pointers are stale, ((|rb_gc_mark_machine_stack|)) starts marking inaccessible memory.
One possible quick-fix is to set machine stack pointers to 0 in ((|cont_init()|)), given the original thread will take care of that stuff and free as needed. It cures segfaults, but I wonder if that doesn't break some other code.
I was also wondering why continuation holds a copy of thread struct instead of a pointer to it. It's hard to correctly follow real thread life cycle with ((|saved_thread|)). So can it harm in other cases like the above?
BTW, while following the code around fibers and continuations, I've found another curious thing: ((|fiber_free()|)) calls ((|cont_free(&fib->cont)|)) on its cont member, and ((|cont_free()|)) calls ((|ruby_xfree()|)) on it. Is that ok, given cont was allocated as part of fiber, and don't we need to ((|ruby_xfree|)) the ((|fib|)) itself?
The attached patch is made against ruby_1_9_2 branch; trunk seems to have the segfault behavior too.
=end
Files