Project

General

Profile

Bug #16288

Segmentation fault with finalizers, threads

Added by davidw (David Welton) 16 days ago. Updated 3 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.6.6p116 (2019-10-02 revision 67825) [x86_64-linux]
[ruby-core:95620]

Description

Hi,

This is a tricky one and I am still working on narrowing it down, but I will report what I have so far.

I compiled a version of 2_6_6 from github: ruby 2.6.6p116 (2019-10-02 revision 67825) [x86_64-linux]

I have a minimal Rails project that uses Mongoid. It crashes with a segmentation fault when rspec runs. The concurrent ruby gem is in some way involved, and I have been posting there: https://github.com/ruby-concurrency/concurrent-ruby/issues/808

However, I think there is a deeper problem - I would not expect a user level script to cause a segmentation fault.

I have been putting a lot of debugging statements in, and turned on Thread.DEBUG, and have noticed some things. I am not experienced with Ruby's internals, so some of these bits of data might be normal or irrelevant:

  • The concurrent-ruby gem uses ObjectSpace.define_finalizer to set a finalizer
  • That finalizer creates a new Thread
  • However, it appears as if that thread is running after the main thread is already dead, so code that expects to reference the main thread crashes, because it's a NULL reference.

I tried the following test code:

class Foo
  def initialize
    ObjectSpace.define_finalizer(self, proc do
                                   Foo.foo_finalizer
                                 end)
  end

  def bar
    puts 'bar'
  end

  def Foo.foo_finalizer
    puts "foo_finalizer"
    t = Thread.new do
      puts "Thread reporting for duty"
    end
    puts "foo_finalizer thread launched"
    sleep 5
  end
end

f = Foo.new
f.bar
f = nil

While trying to develop a simple test case to demonstrate the problem. It triggers rb_raise(rb_eThreadError, "can't alloc thread"); in thread_s_new, because it looks like the main thread has already been marked as 'killed' in this case. When I check the main thread status in thread_s_new with the above code, it reports 'dead'.

When I run my rspec code in the sample Rails project, thread_s_new shows the main thread's status as 'run' even if it should be dead?

I have seen some debugging things that shows some exceptions and thread_join interrupts and so on.

Is it possible that something like this is happening?

Main thread starts doing a cleanup, and gets an exception or something that generates an interrupt, and its KILLED status gets reset to RUNNABLE

Then, in the finalizer, it starts creating a Thread, but at this point the main thread actually does get killed, and when that finalizer thread tries to run it runs into a null reference?

I can provide the Rails sample project if needs be.

Sorry if any of the above isn't clear; I've been staring at the C code for several hours and am a bit cross-eyed!

Thank you for any insights.

History

Updated by davidw (David Welton) 16 days ago

A little bit more data:

I added some more debugging statements, and I found that the main thread goes from being marked with the THREAD_KILLED status to THREAD_STOPPED_FOREVER

This appears to happen in thread_join_sleep in thread.c.

After this happens, it would be possible for new threads to be started in thread_s_new, although I'm not sure why the check there happens after the rb_thread_alloc call.

Updated by mame (Yusuke Endoh) 15 days ago

Thank you for the report and the great investigation! I could reproduce the issue by using your example: https://github.com/mainameiz/segfault_app

Currently, starting a thread in a finalizer is dangerous. The termination of the interpreter is: (1) kill all threads except the main thread, (2) run all finalizers, and (3) destruct all (including all mutexes, the main thread, timer thread, VM itself, etc.). If a finalizer creates a thread, it starts running during or after (3), which leads to a catastrophic situation.

An easy solution is to prohibit thread creation after the process (3) is started. The following patch fixes the segfault of your example.

diff --git a/thread.c b/thread.c
index eff5d39b51..fc609907ef 100644
--- a/thread.c
+++ b/thread.c
@@ -833,6 +833,11 @@ thread_create_core(VALUE thval, VALUE args, VALUE (*fn)(void *))
    rb_raise(rb_eThreadError,
         "can't start a new thread (frozen ThreadGroup)");
     }
+    if (current_th->vm->main_thread->status == THREAD_KILLED) {
+        rb_warn("can't start a new thread after the main thread has stopped");
+        rb_raise(rb_eThreadError,
+                 "can't start a new thread (the main thread has already terminated)");
+    }

     if (fn) {
         th->invoke_type = thread_invoke_type_func;

By this patch, your example ends gracefully.

$ bundle exec rspec spec/models/user_spec.rb
/home/mame/work/ruby/local/lib/ruby/gems/2.7.0/gems/mongoid-7.0.5/lib/mongoid.rb:104: warning: The last argument is used as the keyword parameter
/home/mame/work/ruby/local/lib/ruby/gems/2.7.0/gems/activesupport-6.0.0/lib/active_support/core_ext/module/delegation.rb:171: warning: for `delegate' defined here
config.eager_load is set to nil. Please update your config/environments/*.rb files accordingly:

  * development - set it to false
  * test - set it to false (unless you use a tool that preloads your test environment)
  * production - set it to true

/home/mame/work/ruby/local/lib/ruby/gems/2.7.0/gems/tzinfo-1.2.5/lib/tzinfo/ruby_core_support.rb:142: warning: The last argument is used as the keyword parameter
/home/mame/work/ruby/local/lib/ruby/gems/2.7.0/gems/tzinfo-1.2.5/lib/tzinfo/ruby_core_support.rb:142: warning: The last argument is used as the keyword parameter
/home/mame/work/ruby/local/lib/ruby/gems/2.7.0/gems/actionpack-6.0.0/lib/action_dispatch/middleware/stack.rb:37: warning: The last argument is used as the keyword parameter
/home/mame/work/ruby/local/lib/ruby/gems/2.7.0/gems/actionpack-6.0.0/lib/action_dispatch/middleware/static.rb:110: warning: for `initialize' defined here
.

Finished in 0.00977 seconds (files took 1.22 seconds to load)
1 example, 0 failures

/home/mame/work/ruby/local/lib/ruby/2.7.0/timeout.rb:85: warning: can't start a new thread on a finalizer
/home/mame/work/ruby/local/lib/ruby/2.7.0/timeout.rb:85: warning: can't start a new thread on a finalizer

However, as the last two lines show, Timeout cannot be used safely in a finalizer. I'm unsure if it is acceptable, but to support thread creation in a finalizer, we need to revamp the termination process. ko1 (Koichi Sasada) and nobu (Nobuyoshi Nakada), what do you think?

davidw (David Welton) I could be wrong as I don't understand your statement about thread_join, but I couldn't see the behavior by running your example under gdb. Anyways, thanks for the great investigation. It is really helpful.

Updated by davidw (David Welton) 13 days ago

mame (Yusuke Endoh) wrote:

Thank you for the report and the great investigation! I could reproduce the issue by using your example: https://github.com/mainameiz/segfault_app

Thanks for checking in to it.

Currently, starting a thread in a finalizer is dangerous. The termination of the interpreter is: (1) kill all threads except the main thread, (2) run all finalizers, and (3) destruct all (including all mutexes, the main thread, timer thread, VM itself, etc.). If a finalizer creates a thread, it starts running during or after (3), which leads to a catastrophic situation.

Yes, I think there are multiple parts to this bug. It makes sense to me that starting threads in a finalizer like the concurrenty-ruby gem is doing is not correct.

An easy solution is to prohibit thread creation after the process (3) is started. The following patch fixes the segfault of your example.

Yes, that prevents the segfault in my test application as well.

However, as the last two lines show, Timeout cannot be used safely in a finalizer. I'm unsure if it is acceptable, but to support thread creation in a finalizer, we need to revamp the termination process. ko1 (Koichi Sasada) and nobu (Nobuyoshi Nakada), what do you think?

davidw (David Welton) I could be wrong as I don't understand your statement about thread_join, but I couldn't see the behavior by running your example under gdb. Anyways, thanks for the great investigation. It is really helpful.

Apologies, I was just kind of guessing at what was going on - I am not at all familiar with Ruby's internals! I think the interaction is kind of complex, as you can see from the simple code I posted in the first bug report, which only tries to launch a thread in a finalizer - and fails. It's strange that the timeout.rb code is able to successfully create a thread.

Thank you for your prompt response and looking into the problem.

Updated by davidw (David Welton) 12 days ago

modified   lib/timeout.rb
@@ -94,7 +94,7 @@ def timeout(sec, klass = nil, message = nil)   #:yield: +sec+
       ensure
         if y
           y.kill
-          y.join # make sure y is dead.
+          # y.join # make sure y is dead.
         end
       end
     end

This also stops the segfault from happening.

In my test app, the call to timeout seems to be coming from the Mongo gem:

      def connect!
        Timeout.timeout(options[:connect_timeout], Error::SocketTimeoutError) do
          socket.setsockopt(IPPROTO_TCP, TCP_NODELAY, 1)
          handle_errors { socket.connect(::Socket.pack_sockaddr_in(port, host)) }
          self
        end
      end

When I comment out that join, I get the same error message as with my simple example I pasted in the original report:

Exception `ThreadError' at /home/davidw/.rvm/gems/ruby-2.6.5/gems/concurrent-ruby-1.1.5/lib/concurrent/atomic/ruby_thread_local_var.rb:87 - can't alloc thread

I think that join is somehow marking the main thread as being in some state other than KILLED...?

Updated by davidw (David Welton) 3 days ago

This code pretty reliably produces a segmentation fault on my machine:

require 'timeout'

Thread.DEBUG = 1

class Foo
  def initialize
    ObjectSpace.define_finalizer(self, proc do
                                   Foo.foo_finalizer
                                 end)
  end

  def bar
    puts 'foo'
  end

  def Foo.foo_finalizer
    STDERR.puts "finalizing a Foo"
    Thread.new do
      sleep 5
      STDERR.puts "finalizing foo thread done"
    end
  end
end

class Bar
  def initialize
    ObjectSpace.define_finalizer(self, proc do
                                   Bar.bar_finalizer
                                 end)
  end

  def foo
    puts 'foo'
  end

  def Bar.bar_finalizer
    Timeout::timeout(2) do
      100.times do
        f = Foo.new
        f.bar
      end
    end
  end
end

b = Bar.new
b.foo

Written like that, it looks weird, but the fact that it's appearing for a lot of people is because there is some combination of finalizers and threads being used by, I think, the mongo, mongoid and concurrent-ruby gems, which is why a number of people have reported problems in the github issue I linked in the initial report. And I would guess that a larger number are simply seeing the segmentation fault and not knowing where to file a report.

Something that I think is important here is the Timeout. That runs a join against the main thread, and I think (I am not familiar with this code!) that is tickling thread_join_sleep, which does th->status = THREAD_STOPPED_FOREVER;, so that then in thread_s_new, it doesn't raise the "can't alloc thread" exception, because it's in a SLEEP_FOREVER state.

The previously mentioned patch that checks the main thread status in thread_create_core would likely break any code that uses Timeout in a finalizer.

Does that make sense?

Also available in: Atom PDF