Project

General

Profile

Actions

Bug #13875

closed

segfault in Enumerable#zip after GC

Added by kernigh (George Koehler) over 6 years ago. Updated over 6 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.5.0dev (2017-09-06 trunk 59764) [x86_64-openbsd6.1]
[ruby-core:82681]

Description

There is a chance of segmentation fault in Enumerable#zip after garbage collection. This script reproduces the crash.

GC.stress = true

up = 1.upto(10)
down = 10.downto(1)
up.zip(down) {|a, b| a + b == 11 or fail 'oops'}
$ ruby crash.rb                                                                
crash.rb:5: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.5.0dev (2017-09-06 trunk 59764) [x86_64-openbsd6.1]

-- Control frame information -----------------------------------------------
c:0006 p:---- s:0023 e:000021 IFUNC 
c:0005 p:---- s:0019 e:000018 CFUNC  :upto
c:0004 p:---- s:0016 e:000015 CFUNC  :each
c:0003 p:---- s:0013 e:000012 CFUNC  :zip
c:0002 p:0045 s:0008 E:000590 EVAL   crash.rb:5 [FINISH]
c:0001 p:0000 s:0003 E:0004f0 (none) [FINISH]

-- Ruby level backtrace information ----------------------------------------
crash.rb:5:in `<main>'
crash.rb:5:in `zip'
crash.rb:5:in `each'
crash.rb:5:in `upto'

-- Other runtime information -----------------------------------------------

* Loaded script: crash.rb

* Loaded features:

    0 enumerator.so
    1 thread.rb
    2 rational.so
    3 complex.so
    4 /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/enc/encdb.so
    5 /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/enc/trans/transdb.so
    6 /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/rbconfig.rb
    7 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/compatibility.rb
    8 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/defaults.rb
    9 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/deprecate.rb
   10 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/errors.rb
   11 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/version.rb
   12 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/requirement.rb
   13 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/platform.rb
   14 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/basic_specification.rb
   15 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/stub_specification.rb
   16 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/util/list.rb
   17 /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/stringio.so
   18 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/specification.rb
   19 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/exceptions.rb
   20 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/dependency.rb
   21 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/core_ext/kernel_gem.rb
   22 /home/kernigh/prefix/lib/ruby/2.5.0/monitor.rb
   23 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb
   24 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems.rb
   25 /home/kernigh/prefix/lib/ruby/2.5.0/rubygems/path_support.rb

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: http://www.ruby-lang.org/bugreport.html

Abort trap (core dumped) 
$ gdb ruby ruby.core
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-unknown-openbsd6.1"...
Core was generated by `ruby'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/lib/libpthread.so.23.0...done.
Loaded symbols for /usr/lib/libpthread.so.23.0
Loaded symbols for /home/kernigh/prefix/bin/ruby
Reading symbols from /usr/lib/libm.so.10.0...done.
Loaded symbols for /usr/lib/libm.so.10.0
Symbols already loaded for /usr/lib/libpthread.so.23.0
Reading symbols from /usr/lib/libc.so.89.3...done.
Loaded symbols for /usr/lib/libc.so.89.3
Reading symbols from /usr/libexec/ld.so...done.
Loaded symbols for /usr/libexec/ld.so
Reading symbols from /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/enc/encdb.so...done.
Loaded symbols for /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/enc/encdb.so
Reading symbols from /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/enc/trans/transdb.so...done.
Loaded symbols for /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/enc/trans/transdb.so
Reading symbols from /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/stringio.so...done.
Loaded symbols for /home/kernigh/prefix/lib/ruby/2.5.0/x86_64-openbsd6.1/stringio.so
#0  0x00001fdd990d45fa in thrkill () at {standard input}:5
5       {standard input}: No such file or directory.
        in {standard input}
(gdb) bt
#0  0x00001fdd990d45fa in thrkill () at {standard input}:5
#1  0x00001fdd9912a549 in *_libc_abort ()
    at /usr/src/lib/libc/stdlib/abort.c:52
#2  0x00001fdaa5b0abe1 in rb_bug_context (ctx=0x1fdcb77b1c80, 
    fmt=0x1fdaa5c5a905 "Segmentation fault at %p") at ../ruby/error.c:511
#3  0x00001fdaa5a01bc6 in sigsegv (sig=11, info=0x1fdcb77b1d70, 
    ctx=0x1fdcb77b1c80) at ../ruby/signal.c:932
#4  <signal handler called>
#5  rb_id_table_lookup (tbl=0x0, id=152, valp=0x7f7ffffdfd30) at id_table.c:131
#6  0x00001fdaa5a85f54 in vm_respond_to (th=0x1fdd2e063800, 
    klass=35034836138920, obj=35034836139360, id=3137, priv=1)
    at vm_method.c:182
#7  0x00001fdaa5a877a6 in rb_check_funcall_default (recv=35034836139360, 
    mid=3137, argc=0, argv=0x0, def=52) at vm_eval.c:347
#8  0x00001fdaa5974198 in rb_check_convert_type_with_id (val=35034836139360, 
    type=7, tname=0x1fdaa5c3b9ad "Array", method=3137) at ../ruby/object.c:2891
#9  0x00001fdaa5a70251 in vm_callee_setup_block_arg (th=0x1fdd2e063800, 
    calling=0x7f7ffffdff60, ci=Variable "ci" is not available.
) at vm_insnhelper.c:2563
#10 0x00001fdaa5a7dcd9 in rb_yield_force_blockarg (values=Variable "values" is not available.
)
    at vm_insnhelper.c:2626
#11 0x00001fdaa5afd562 in zip_i (val=Variable "val" is not available.
) at ../ruby/enum.c:59
#12 0x00001fdaa5a6b18f in vm_yield_with_cfunc (th=0x1fdd2e063800, 
    captured=0x1fdda3995f88, self=35035751942480, argc=1, argv=0x7f7ffffe0120, 
---Type <return> to continue, or q <return> to quit---
    block_handler=Variable "block_handler" is not available.
) at vm_insnhelper.c:2532
#13 0x00001fdaa5a7d8e1 in rb_yield (val=Variable "val" is not available.
) at ../ruby/vm.c:1057
#14 0x00001fdaa595fc61 in int_upto (from=3, to=21) at ../ruby/numeric.c:4884
#15 0x00001fdaa5a7fbfe in vm_call0_body (th=0x1fdd2e063800, calling=Variable "calling" is not available.
)
    at vm_eval.c:86
#16 0x00001fdaa5a8ab0c in iterate_method (obj=Variable "obj" is not available.
) at vm_eval.c:59
#17 0x00001fdaa5a73053 in rb_iterate0 (
    it_proc=0x1fdaa5a8aa10 <iterate_method>, data1=140187732411696, ifunc=0x0, 
    th=0x1fdd2e063800) at vm_eval.c:1129
#18 0x00001fdaa5a734eb in rb_block_call (obj=Variable "obj" is not available.
) at vm_eval.c:1161
#19 0x00001fdaa5a7fbfe in vm_call0_body (th=0x1fdd2e063800, calling=Variable "calling" is not available.
)
    at vm_eval.c:86
#20 0x00001fdaa5a8ab0c in iterate_method (obj=Variable "obj" is not available.
) at vm_eval.c:59
#21 0x00001fdaa5a73053 in rb_iterate0 (
    it_proc=0x1fdaa5a8aa10 <iterate_method>, data1=140187732412640, 
    ifunc=0x1fdd2ef67d88, th=0x1fdd2e063800) at vm_eval.c:1129
#22 0x00001fdaa5a734eb in rb_block_call (obj=Variable "obj" is not available.
) at vm_eval.c:1161
#23 0x00001fdaa5afacb0 in enum_zip (argc=1, argv=Variable "argv" is not available.
) at ../ruby/enum.c:2664
#24 0x00001fdaa5a6d681 in vm_call_cfunc_with_frame (th=0x1fdd2e063800, 
    reg_cfp=0x1fdda3995fa0, calling=Variable "calling" is not available.
) at vm_insnhelper.c:1903
#25 0x00001fdaa5a8c164 in vm_call_method_each_type (th=0x1fdd2e063800, 
    cfp=0x1fdda3995fa0, calling=0x7f7ffffe0e70, ci=0x1fdd52984f70, 
    cc=0x1fdda51bc178) at vm_insnhelper.c:1919
---Type <return> to continue, or q <return> to quit---
#26 0x00001fdaa5a8d4ae in vm_call_general (th=0x1fdd2e063800, 
    reg_cfp=0x1fdda3995fa0, calling=0x7f7ffffe0e70, ci=0x1fdd52984f70, 
    cc=0x1fdda51bc178) at vm_insnhelper.c:2367
#27 0x00001fdaa5a76701 in vm_exec_core (th=0x1fdd2e063800, initial=Variable "initial" is not available.
)
    at insns.def:789
#28 0x00001fdaa5a7b505 in vm_exec (th=0x1fdd2e063800) at ../ruby/vm.c:1793
#29 0x00001fdaa590319d in ruby_exec_internal (n=0x1fdd658c9990)
    at ../ruby/eval.c:246
#30 0x00001fdaa59073e0 in ruby_run_node (n=Variable "n" is not available.
) at ../ruby/eval.c:310
#31 0x00001fdaa59015b4 in main (argc=2, argv=0x7f7ffffe11f8)
    at ../ruby/main.c:42
Current language:  auto; currently asm
(gdb) 

Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #13887: test/ruby/test_io.rb may get stuck with FIBER_USE_NATIVE=0 on LinuxClosedko1 (Koichi Sasada)Actions

Updated by kernigh (George Koehler) over 6 years ago

The problem is with VALUE tmp; in enum.c zip_i(). The garbage collector frees tmp too early. I try to protect it with RB_GC_GUARD(tmp), but this doesn't fix the bug. Ruby still crashes.

diff --git a/enum.c b/enum.c
index 4613ab733c..bca63dab5e 100644
--- a/enum.c
+++ b/enum.c
@@ -2593,6 +2593,7 @@ zip_i(RB_BLOCK_CALL_FUNC_ARGLIST(val, memoval))
     }
 
     RB_GC_GUARD(args);
+    RB_GC_GUARD(tmp);
 
     return Qnil;
 }

I removed my RB_GC_GUARD(tmp) and added a trick with rb_ivar_set() to make a reference from another Ruby object to tmp. This seems to prevent the bug. Ruby doesn't crash.

diff --git a/enum.c b/enum.c
index 4613ab733c..5e5c50e37d 100644
--- a/enum.c
+++ b/enum.c
@@ -2568,6 +2568,7 @@ zip_i(RB_BLOCK_CALL_FUNC_ARGLIST(val, memoval))
     int i;
 
     tmp = rb_ary_new2(RARRAY_LEN(args) + 1);
+    rb_ivar_set(args, rb_intern("@tmp"), tmp);
     rb_ary_store(tmp, 0, rb_enum_values_pack(argc, argv));
     for (i=0; i<RARRAY_LEN(args); i++) {
        if (NIL_P(RARRAY_AREF(args, i))) {

But this trick might not be the correct fix. I fear a problem with fibers, because there is a fiber switch when zip_i() calls Enumerator#next. Perhaps the GC can't find tmp while the other fiber is running.

Updated by normalperson (Eric Wong) over 6 years ago

wrote:

https://bugs.ruby-lang.org/issues/13875#change-66516

But this trick might not be the correct fix. I fear a problem
with fibers, because there is a fiber switch when zip_i()
calls Enumerator#next. Perhaps the GC can't find tmp while the
other fiber is running.

Is FIBER_USE_NATIVE enabled in cont.c?
Does the problem go away if you flip that?

Also, which compiler + non-standard CFLAGS do you use?

We don't have a lot of OpenBSD users here; +cc Jeremy...

Does this affect older versions of Ruby, too? We've had
some recent movement in cont.c in trunk and maybe broke
something in Fiber stack marking...

I'm also curious which RB_GC_GUARD implementation you
use (it's compiler-dependent). The rb_gc_guarded_ptr_val
one in include/ruby/ruby.h should be strongest since it can't
be inlined (but slowest). Perhaps try that one if you're
not using it...

Thanks.

Updated by kernigh (George Koehler) over 6 years ago

In reply to Eric Wong:

FIBER_USE_NATIVE is 0. Flipping it to 1 causes compiler errors; OpenBSD doesn't have <ucontext.h>.

I'm using the system gcc, which is gcc (GCC) 4.2.1 20070719. I configured Ruby with ../ruby/configure --prefix=$HOME/prefix --with-baseruby=ruby23 and didn't add any extra CFLAGS. I have two other compilers (a newer gcc and clang), but I have not tried them with Ruby.

I forgot to try older Ruby versions. I have Ruby 2.3 and 2.4 from OpenBSD packages. My script doesn't reproduce the crash in 2.3 or 2.4, so bug is only in trunk.

$ ruby24 -v
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-openbsd]
$ ruby23 -v
ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-openbsd]

Ruby was using GNUC version of RB_GC_GUARD. I now edit include/ruby/ruby.h so it uses rb_gc_guarded_ptr_val version. I put my RB_GC_GUARD(tmp) in zip_i(). My script still crashes Ruby. So rb_gc_guarded_ptr_val doesn't fix this bug.

Updated by jeremyevans0 (Jeremy Evans) over 6 years ago

normalperson (Eric Wong) wrote:

wrote:

https://bugs.ruby-lang.org/issues/13875#change-66516

But this trick might not be the correct fix. I fear a problem
with fibers, because there is a fiber switch when zip_i()
calls Enumerator#next. Perhaps the GC can't find tmp while the
other fiber is running.

Is FIBER_USE_NATIVE enabled in cont.c?

No. It is set to 0 on OpenBSD.

Does the problem go away if you flip that?

It doesn't compile, as ucontext.h is not available on OpenBSD:

cont.c:68:10: fatal error: 'ucontext.h' file not found

Does this affect older versions of Ruby, too? We've had
some recent movement in cont.c in trunk and maybe broke
something in Fiber stack marking...

I can't reproduce on OpenBSD-current or OpenBSD 6.1 using:

ruby 2.2.7p470 (2017-03-28 revision 58194) [x86_64-openbsd]
ruby 2.3.4p301 (2017-03-30 revision 58214) [x86_64-openbsd]
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-openbsd]

Note that OpenBSD-current uses clang 4.0.0 as the system compiler, as opposed to OpenBSD 6.1 and previous versions, which use gcc 4.2.1.

I can reproduce on both OpenBSD-current and OpenBSD 6.1 using:

ruby 2.5.0dev (2017-09-06 trunk 59764) [x86_64-openbsd]

I'm also curious which RB_GC_GUARD implementation you
use (it's compiler-dependent). The rb_gc_guarded_ptr_val
one in include/ruby/ruby.h should be strongest since it can't
be inlined (but slowest). Perhaps try that one if you're
not using it...

OpenBSD defaults to the first branch (rb_gc_guarded_ptr), even when compiling with clang. Forcing it to use last branch (rb_gc_guarded_ptr_val) with the following diff still results in the same segfault using the code provided by kernigh:

--- a/include/ruby/ruby.h
+++ b/include/ruby/ruby.h
@@ -534,7 +534,7 @@ static inline int rb_type(VALUE obj);
        ((type) == RUBY_T_FLOAT) ? RB_FLOAT_TYPE_P(obj) : \
        (!RB_SPECIAL_CONST_P(obj) && RB_BUILTIN_TYPE(obj) == (type)))
 
-#ifdef __GNUC__
+#ifndef __GNUC__
 #define RB_GC_GUARD(v) \
     (*__extension__ ({ \
        volatile VALUE *rb_gc_guarded_ptr = &(v); \

Updated by normalperson (Eric Wong) over 6 years ago

Thanks, I can reproduce the bug on GNU/Linux with:

--- a/cont.c
+++ b/cont.c
@@ -57,6 +57,7 @@

define FIBER_USE_NATIVE 1

endif

#endif
+#undef FIBER_USE_NATIVE
#if !defined(FIBER_USE_NATIVE)
#define FIBER_USE_NATIVE 0
#endif

Now, I'm testing the following patch:

https://80x24.org/spew/20170907193559.27639-1-e@80x24.org/raw

And I no longer get segfaults with the new test

However, test/ruby/test_io.rb seems stuck when FIBER_USE_NATIVE is 0
on my system...

Updated by kernigh (George Koehler) over 6 years ago

Jeremy Evans wrote:

Note that OpenBSD-current uses clang 4.0.0 as the system compiler, as opposed to OpenBSD 6.1 and previous versions, which use gcc 4.2.1.

Thanks for testing with clang. You showed that the bug wasn't only with gcc 4.2.1.

Eric Wrong wrote:

https://80x24.org/spew/20170907193559.27639-1-e@80x24.org/raw

This patch also prevents the segfault for me.

However, test/ruby/test_io.rb seems stuck when FIBER_USE_NATIVE is 0 on my system...

This file (test/ruby/test_io.rb) and a few other tests usually get stuck in OpenBSD. The cause is a bug that I reported to OpenBSD (fifo plus threads equals stuck: https://marc.info/?l=openbsd-bugs&m=146276089610123&w=2). I have local edits to those tests so they fail and don't get stuck when I run make test-all or make test-spec.

You can see my local edits here:
https://gist.github.com/kernigh/5770f8b90427ce6ede535dae729cb960

Your patch, with my OpenBSD machine, didn't cause any more tests (in test/ruby/test_io.rb or elsewhere) to get stuck. If you run Linux, you probably don't have the OpenBSD bug. So your stuck test might be different from my stuck test. You might have found a bug in Ruby that happens in GNU/Linux but I can't reproduce in OpenBSD.

Updated by jeremyevans0 (Jeremy Evans) over 6 years ago

kernigh (George Koehler) wrote:

Eric Wrong wrote:

However, test/ruby/test_io.rb seems stuck when FIBER_USE_NATIVE is 0 on my system...

This file (test/ruby/test_io.rb) and a few other tests usually get stuck in OpenBSD. The cause is a bug that I reported to OpenBSD (fifo plus threads equals stuck: https://marc.info/?l=openbsd-bugs&m=146276089610123&w=2). I have local edits to those tests so they fail and don't get stuck when I run make test-all or make test-spec.

Some additional background: the OpenBSD ports for ruby also skip this test. The fifo pthread fdlock bug has been in OpenBSD probably since it moved from userland threads to kernel threads, and there has been a failing regress test for it since 2012 (https://github.com/openbsd/src/blob/master/regress/lib/libpthread/blocked_fifo/blocked_fifo.c). There was an attempt to fix it (https://github.com/openbsd/src/commit/4ca9b96f0bca4f64040c5f77f0c29ccfac8bd418#diff-3701716ce89e506e5b445acbe4095ee6), but it was backed out shortly after being committed due to regressions (https://github.com/openbsd/src/commit/4185654479fabb05682e85a51de78cbd2fa8dc5c#diff-3701716ce89e506e5b445acbe4095ee6).

If you don't want to skip the test in test_io.rb, the workaround is fairly simple:

-          open("fifo", "r") {|r|
+          open("fifo", "r+") {|r|

Your patch, with my OpenBSD machine, didn't cause any more tests (in test/ruby/test_io.rb or elsewhere) to get stuck. If you run Linux, you probably don't have the OpenBSD bug. So your stuck test might be different from my stuck test. You might have found a bug in Ruby that happens in GNU/Linux but I can't reproduce in OpenBSD.

I also tested the patch using OpenBSD-current with clang 4.0.0, and it fixes the issue here too.

Updated by normalperson (Eric Wong) over 6 years ago

Thank you both for the extra info, I think there's a different
bug for FIBER_USE_NATIVE=0 on my GNU/Linux system...

Anyways, for this segfault here is an updated v2 patch (for
r59776) which I'll commit soonish:

https://80x24.org/spew/20170908062817.GA9144@dcvr/raw

I'll try to work on tracking down the test_io.rb stuckage with
FIBER_USE_NATIVE=0 tomorrow. It seems to affect 2.4.1, even.

Actions #9

Updated by Anonymous over 6 years ago

  • Status changed from Open to Closed

Applied in changeset trunk|r59785.


fiber: fix machine stack marking when FIBER_USE_NATIVE is 0

  • cont.c (cont_mark): mark Fiber machine stack correctly when
    FIBER_USE_NATIVE is 0
  • test/ruby/test_fiber.rb (test_mark_fiber): new test
    [Bug #13875] [ruby-core:82681]

This bug appears to be introduced with r59557.
("refactoring Fiber status")

Actions #10

Updated by wanabe (_ wanabe) over 6 years ago

  • Related to Bug #13887: test/ruby/test_io.rb may get stuck with FIBER_USE_NATIVE=0 on Linux added
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0