Project

General

Profile

Bug #15499

Breaking behavior on ruby 2.6: rb_thread_call_without_gvl doesn't invoke unblock_function when used on the main thread

Added by apolcyn (alex polcyn) about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Target version:
-
[ruby-core:90865]

Description

This issue was noticed when trying to add ruby 2.6 support to the "grpc" ruby gem (this gem is a native C-extension), and was caught by a unit test.

There are several APIs on the grpc ruby gem (https://github.com/grpc/grpc/tree/master/src/ruby) that invoke "rb_thread_call_without_gvl" on the current thread, doing a blocking operation in the "without gvl" callback and cancel that blocking operation in the "unblocking function". These APIs work in ruby versions prior to ruby 2.6 (e.g. ruby 2.5), but have problems when used on ruby 2.6

Minimal repro:

My system:

> lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.6 (stretch)
Release:    9.6
Codename:   stretch

> ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux

# I installed ruby 2.6.0 with rvm - https://rvm.io/rvm/install

> GRPC_CONFIG=dbg gem install grpc --platform ruby # build grpc gem from source with debug symbols

ruby script, "repro.rb" that looks like this:

require 'grpc'

ch = GRPC::Core::Channel.new('localhost:1234', {}, :this_channel_is_insecure)
ch.watch_connectivity_state(ch.connectivity_state, Time.now + 360)

Run "ruby repro.rb" with an interactive shell, and it will hang there. At this point, ctrlC the process, and it will not terminate.
What should happen is this unblocking func should be invoked: https://github.com/grpc/grpc/blob/master/src/ruby/ext/grpc/rb_channel.c#L354, but as seen with logging or debuggers, that unblocking func is never ran. Thus the blocking operation never completes and the main thread is stuck.

When the same repro.rb is ran on e.g. ruby 2.5.3 or ruby 2.4.1, the blocking operation is unblocked and the process terminates, as expected, when sending it a SIGINT.

Also note that if the blocking operation is put in a background thread, e.g. with this script:

require 'grpc'

th = Thread.new do
  ch = GRPC::Core::Channel.new('localhost:1234', {}, :this_channel_is_insecure)
  ch.watch_connectivity_state(ch.connectivity_state, Time.now + 360)
end
th.join

then "unblocking" functions will in fact be invoked upon sending the process a SIGINT, so this looks like a problem specifically with rb_thread_call_without_gvl being used on the main thread.

Please let me know and I can provide more details or alternative repro cases.

Thanks in advance.

Updated by normalperson (Eric Wong) about 1 year ago

https://bugs.ruby-lang.org/issues/15499

Probably my fault with timer-thread elimination :<

I think we'll need to dynamically enable timer-thread
for certain unblock functions.

Can you try compiling with "cppflags=-DUBF_TIMER=2" ?

Updated by normalperson (Eric Wong) about 1 year ago

Can you try compiling with "cppflags=-DUBF_TIMER=2" ?

Nevermind, both need fixing...

Updated by apolcyn (alex polcyn) about 1 year ago

Thanks for the quick look! I didn't get a chance to try out UBF_TIMER=2 before your last comment, but let me know if there's something else to try.

Updated by normalperson (Eric Wong) about 1 year ago

apolcyn@google.com wrote:

Thanks for the quick look! I didn't get a chance to try out
UBF_TIMER=2 before your last comment, but let me know if
there's something else to try.

Sorry for the breakage. For now, you can workaround this
breakage by spawning a do-nothing thread to handle signals:

Thread.new { sleep }

I'm slowly working on a permanent fix which won't increase
overhead for the majority of use cases.

#5

Updated by normalperson (Eric Wong) about 1 year ago

  • Status changed from Open to Closed

Applied in changeset trunk|r66708.


thread.c (call_without_gvl): spawn thread for UBF iff single-threaded

We need another native thread to call some unblocking functions
which aren't RUBY_UBF_IO or RUBY_UBF_PROCESS. Instead of a
permanent thread in <= 2.5, we can now rely on the thread cache
feature to perform interrupts.

[ruby-core:90865] [Bug #15499]

Updated by normalperson (Eric Wong) about 1 year ago

  • Assignee set to ko1 (Koichi Sasada)
  • Status changed from Closed to Assigned

ko1: can you check the rb_nogvl C-API introduced in r66712?
Thanks.

Updated by normalperson (Eric Wong) about 1 year ago

Eric Wong normalperson@yhbt.net wrote:

apolcyn@google.com wrote:

Thanks for the quick look! I didn't get a chance to try out
UBF_TIMER=2 before your last comment, but let me know if
there's something else to try.

Sorry for the breakage. For now, you can workaround this
breakage by spawning a do-nothing thread to handle signals:

r66708 should fix the breakage in your particular case.
Again, deeply sorry for the breakage.

Thread.new { sleep }

I'm slowly working on a permanent fix which won't increase
overhead for the majority of use cases.

So r66708 uses a short-lived thread and thread-cache to avoid
permanent overhead.

r66712 introduces an experimental new API (rb_nogvl) which
allows C-API users to fire some unblock functions in signal
handlers. It won't affect grpc's case, but it affects zlib and
bignum in core, at least.

It turns out zlib and bignum had the same problem you encountered;
but I missed them.

#8

Updated by nagachika (Tomoyuki Chikanaga) about 1 year ago

  • Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN to 2.4: DONTNEED, 2.5: DONTNEED, 2.6: REQUIRED

Updated by dentarg (Patrik Ragnarsson) about 1 year ago

After switching to Ruby 2.6.0 for our application on Heroku, we have been getting R12 - Exit timeout errors after each deploy. The application is using Puma in cluster mode. While we haven't tried to reproduce this on Heroku, one of our tests starts our application just to see that it can boot, and we noticed that with 2.6.0, the TERM signal wasn't enough to stop the application. Our test now sends KILL after a while.

I've tried to do a minimal repro of the situation at https://github.com/dentarg/gists/tree/master/gists/ruby-bug-15499. I've sampled the hanged processes (that starts eating ~100% CPU) with Activity Monitor, see the txt files, maybe that can give some clue.

Updated by normalperson (Eric Wong) about 1 year ago

Issue #15499 has been updated by dentarg (Patrik Ragnarsson).

Patrik: can you try r66708? (git 9e66910b3bd85de32e95cf019ed285a36aecfd9e)
It was committed before your comment.

Sorry, I barely have time for Ruby, anymore.

Updated by dentarg (Patrik Ragnarsson) about 1 year ago

Patrik: can you try r66708? (git 9e66910b3bd85de32e95cf019ed285a36aecfd9e)

Eric: Sorry, I should have mentioned it in my comment. I also tried r66716 (ruby 2.7.0dev (2019-01-05 trunk 66716) [x86_64-darwin18]), which can be seen at GitHub. The behaviour was the same as with 2.6.0.

Do you want me to try r66708 specifically and not the commits after? I can do that if so.

#12

Updated by ko1 (Koichi Sasada) about 1 year ago

  • Description updated (diff)

Updated by ko1 (Koichi Sasada) about 1 year ago

Sorry I didn't check this ticket (because it seems difficult).

r66708

I'm not sure why another thread is needed. could you explain about it?

r66712

rb_nogvl seems not good name.
Maybe Microsoft will name it rb_thread_call_without_gvl_ex().

I'm not sure the requirement of RB_NOGVL_UBF_ASYNC_SAFE because I can't understand why r66708 is needed.

     else if (ubf && vm_living_thread_num(th->vm) == 1) {
-        ubf_th = rb_thread_start_unblock_thread();
+        if (RB_NOGVL_UBF_ASYNC_SAFE) {
+            th->vm->ubf_async_safe = 1;
+        }
+        else {
+            ubf_th = rb_thread_start_unblock_thread();
+        }
     }

The condition is always true and maybe r66708 is disabled.
No test?

Updated by ko1 (Koichi Sasada) about 1 year ago

I recommend to backport only r66708 for ruby 2.6.1 now.
Honestly speaking I don't understand why it is needed, but eric should know more so I want to believe.

r66712 has several problem so we need discuss more.

#15

Updated by naruse (Yui NARUSE) about 1 year ago

  • Status changed from Assigned to Closed

Applied in changeset ruby_2_6|r66940.


merge revision(s) 66708: [Backport #15499]

thread.c (call_without_gvl): spawn thread for UBF iff single-threaded

We need another native thread to call some unblocking functions
which aren't RUBY_UBF_IO or RUBY_UBF_PROCESS.  Instead of a
permanent thread in <= 2.5, we can now rely on the thread cache
feature to perform interrupts.

[ruby-core:90865] [Bug #15499]

Updated by naruse (Yui NARUSE) about 1 year ago

  • Backport changed from 2.4: DONTNEED, 2.5: DONTNEED, 2.6: REQUIRED to 2.4: DONTNEED, 2.5: DONTNEED, 2.6: DONE

ruby_2_6 r66940 merged revision(s) 66708.

Also available in: Atom PDF