Project

General

Profile

Actions

Bug #15499

open

Breaking behavior on ruby 2.6: rb_thread_call_without_gvl doesn't invoke unblock_function when used on the main thread

Added by apolcyn (alex polcyn) over 2 years ago. Updated 4 months ago.

Status:
Assigned
Priority:
Normal
Target version:
-
[ruby-core:90865]

Description

This issue was noticed when trying to add ruby 2.6 support to the "grpc" ruby gem (this gem is a native C-extension), and was caught by a unit test.

There are several APIs on the grpc ruby gem (https://github.com/grpc/grpc/tree/master/src/ruby) that invoke "rb_thread_call_without_gvl" on the current thread, doing a blocking operation in the "without gvl" callback and cancel that blocking operation in the "unblocking function". These APIs work in ruby versions prior to ruby 2.6 (e.g. ruby 2.5), but have problems when used on ruby 2.6

Minimal repro:

My system:

> lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.6 (stretch)
Release:    9.6
Codename:   stretch

> ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux

# I installed ruby 2.6.0 with rvm - https://rvm.io/rvm/install

> GRPC_CONFIG=dbg gem install grpc --platform ruby # build grpc gem from source with debug symbols

ruby script, "repro.rb" that looks like this:

require 'grpc'

ch = GRPC::Core::Channel.new('localhost:1234', {}, :this_channel_is_insecure)
ch.watch_connectivity_state(ch.connectivity_state, Time.now + 360)

Run "ruby repro.rb" with an interactive shell, and it will hang there. At this point, ctrlC the process, and it will not terminate.
What should happen is this unblocking func should be invoked: https://github.com/grpc/grpc/blob/master/src/ruby/ext/grpc/rb_channel.c#L354, but as seen with logging or debuggers, that unblocking func is never ran. Thus the blocking operation never completes and the main thread is stuck.

When the same repro.rb is ran on e.g. ruby 2.5.3 or ruby 2.4.1, the blocking operation is unblocked and the process terminates, as expected, when sending it a SIGINT.

Also note that if the blocking operation is put in a background thread, e.g. with this script:

require 'grpc'

th = Thread.new do
  ch = GRPC::Core::Channel.new('localhost:1234', {}, :this_channel_is_insecure)
  ch.watch_connectivity_state(ch.connectivity_state, Time.now + 360)
end
th.join

then "unblocking" functions will in fact be invoked upon sending the process a SIGINT, so this looks like a problem specifically with rb_thread_call_without_gvl being used on the main thread.

Please let me know and I can provide more details or alternative repro cases.

Thanks in advance.

Updated by normalperson (Eric Wong) over 2 years ago

https://bugs.ruby-lang.org/issues/15499

Probably my fault with timer-thread elimination :<

I think we'll need to dynamically enable timer-thread
for certain unblock functions.

Can you try compiling with "cppflags=-DUBF_TIMER=2" ?

Updated by normalperson (Eric Wong) over 2 years ago

Can you try compiling with "cppflags=-DUBF_TIMER=2" ?

Nevermind, both need fixing...

Updated by apolcyn (alex polcyn) over 2 years ago

Thanks for the quick look! I didn't get a chance to try out UBF_TIMER=2 before your last comment, but let me know if there's something else to try.

Updated by normalperson (Eric Wong) over 2 years ago

apolcyn@google.com wrote:

Thanks for the quick look! I didn't get a chance to try out
UBF_TIMER=2 before your last comment, but let me know if
there's something else to try.

Sorry for the breakage. For now, you can workaround this
breakage by spawning a do-nothing thread to handle signals:

Thread.new { sleep }

I'm slowly working on a permanent fix which won't increase
overhead for the majority of use cases.

Actions #5

Updated by normalperson (Eric Wong) over 2 years ago

  • Status changed from Open to Closed

Applied in changeset trunk|r66708.


thread.c (call_without_gvl): spawn thread for UBF iff single-threaded

We need another native thread to call some unblocking functions
which aren't RUBY_UBF_IO or RUBY_UBF_PROCESS. Instead of a
permanent thread in <= 2.5, we can now rely on the thread cache
feature to perform interrupts.

[ruby-core:90865] [Bug #15499]

Updated by normalperson (Eric Wong) over 2 years ago

  • Assignee set to ko1 (Koichi Sasada)
  • Status changed from Closed to Assigned

ko1: can you check the rb_nogvl C-API introduced in r66712?
Thanks.

Updated by normalperson (Eric Wong) over 2 years ago

Eric Wong normalperson@yhbt.net wrote:

apolcyn@google.com wrote:

Thanks for the quick look! I didn't get a chance to try out
UBF_TIMER=2 before your last comment, but let me know if
there's something else to try.

Sorry for the breakage. For now, you can workaround this
breakage by spawning a do-nothing thread to handle signals:

r66708 should fix the breakage in your particular case.
Again, deeply sorry for the breakage.

Thread.new { sleep }

I'm slowly working on a permanent fix which won't increase
overhead for the majority of use cases.

So r66708 uses a short-lived thread and thread-cache to avoid
permanent overhead.

r66712 introduces an experimental new API (rb_nogvl) which
allows C-API users to fire some unblock functions in signal
handlers. It won't affect grpc's case, but it affects zlib and
bignum in core, at least.

It turns out zlib and bignum had the same problem you encountered;
but I missed them.

Actions #8

Updated by nagachika (Tomoyuki Chikanaga) over 2 years ago

  • Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN to 2.4: DONTNEED, 2.5: DONTNEED, 2.6: REQUIRED

Updated by dentarg (Patrik Ragnarsson) over 2 years ago

After switching to Ruby 2.6.0 for our application on Heroku, we have been getting R12 - Exit timeout errors after each deploy. The application is using Puma in cluster mode. While we haven't tried to reproduce this on Heroku, one of our tests starts our application just to see that it can boot, and we noticed that with 2.6.0, the TERM signal wasn't enough to stop the application. Our test now sends KILL after a while.

I've tried to do a minimal repro of the situation at https://github.com/dentarg/gists/tree/master/gists/ruby-bug-15499. I've sampled the hanged processes (that starts eating ~100% CPU) with Activity Monitor, see the txt files, maybe that can give some clue.

Updated by normalperson (Eric Wong) over 2 years ago

Issue #15499 has been updated by dentarg (Patrik Ragnarsson).

Patrik: can you try r66708? (git 9e66910b3bd85de32e95cf019ed285a36aecfd9e)
It was committed before your comment.

Sorry, I barely have time for Ruby, anymore.

Updated by dentarg (Patrik Ragnarsson) over 2 years ago

Patrik: can you try r66708? (git 9e66910b3bd85de32e95cf019ed285a36aecfd9e)

Eric: Sorry, I should have mentioned it in my comment. I also tried r66716 (ruby 2.7.0dev (2019-01-05 trunk 66716) [x86_64-darwin18]), which can be seen at GitHub. The behaviour was the same as with 2.6.0.

Do you want me to try r66708 specifically and not the commits after? I can do that if so.

Actions #12

Updated by ko1 (Koichi Sasada) over 2 years ago

  • Description updated (diff)

Updated by ko1 (Koichi Sasada) over 2 years ago

Sorry I didn't check this ticket (because it seems difficult).

r66708

I'm not sure why another thread is needed. could you explain about it?

r66712

rb_nogvl seems not good name.
Maybe Microsoft will name it rb_thread_call_without_gvl_ex().

I'm not sure the requirement of RB_NOGVL_UBF_ASYNC_SAFE because I can't understand why r66708 is needed.

     else if (ubf && vm_living_thread_num(th->vm) == 1) {
-        ubf_th = rb_thread_start_unblock_thread();
+        if (RB_NOGVL_UBF_ASYNC_SAFE) {
+            th->vm->ubf_async_safe = 1;
+        }
+        else {
+            ubf_th = rb_thread_start_unblock_thread();
+        }
     }

The condition is always true and maybe r66708 is disabled.
No test?

Updated by ko1 (Koichi Sasada) over 2 years ago

I recommend to backport only r66708 for ruby 2.6.1 now.
Honestly speaking I don't understand why it is needed, but eric should know more so I want to believe.

r66712 has several problem so we need discuss more.

Actions #15

Updated by naruse (Yui NARUSE) over 2 years ago

  • Status changed from Assigned to Closed

Applied in changeset ruby_2_6|r66940.


merge revision(s) 66708: [Backport #15499]

thread.c (call_without_gvl): spawn thread for UBF iff single-threaded

We need another native thread to call some unblocking functions
which aren't RUBY_UBF_IO or RUBY_UBF_PROCESS.  Instead of a
permanent thread in <= 2.5, we can now rely on the thread cache
feature to perform interrupts.

[ruby-core:90865] [Bug #15499]

Updated by naruse (Yui NARUSE) over 2 years ago

  • Backport changed from 2.4: DONTNEED, 2.5: DONTNEED, 2.6: REQUIRED to 2.4: DONTNEED, 2.5: DONTNEED, 2.6: DONE

ruby_2_6 r66940 merged revision(s) 66708.

Actions #17

Updated by naruse (Yui NARUSE) about 1 year ago

  • Backport changed from 2.4: DONTNEED, 2.5: DONTNEED, 2.6: DONE to 2.4: DONTNEED, 2.5: DONTNEED, 2.6: REQUIRED

Updated by nagachika (Tomoyuki Chikanaga) about 1 year ago

  • Backport changed from 2.4: DONTNEED, 2.5: DONTNEED, 2.6: REQUIRED to 2.4: DONTNEED, 2.5: DONTNEED, 2.6: DONE

The ff98931dac is the fix for 23444302d9 and the 23444302d9 have not been backported into ruby_2_6 branch. 23444302d9 seems a kind of performance improvement.
I will restore Backport field to 2.6: DONE.

Updated by apolcyn (alex polcyn) 5 months ago

Thanks for the followup on this issue. However, was the fix ever released?

I can still reproduce this issue on the following ruby versions, using the grpc gem at version 1.34.0:

$ ruby -v
ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-linux]
$ ruby -v
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
$ ruby -v
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-linux]

My system:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:    10
Codename:   buster

Note that this is still NOT an issue on:

$ ruby -v
ruby 2.5.8p224 (2020-03-31 revision 67882) [x86_64-linux]

Updated by ko1 (Koichi Sasada) 4 months ago

  • ruby -v changed from 2.6.0 to master
  • Status changed from Closed to Assigned

Thank you for your report. I can reproduce the issue.

Thread.new{ Ractor.receive }
Ractor.receive

It shouldn't be a Ractor.receive, but the condition are:

  • Make two or more threads.
  • All threads are waiting for some nogvl operation with custom ubf.

Maybe because it is by 9e66910b3bd85de32e95cf019ed285a36aecfd9e

We need another native thread to call some unblocking functions
which aren't RUBY_UBF_IO or RUBY_UBF_PROCESS.  Instead of a
permanent thread in <= 2.5, we can now rely on the thread cache
feature to perform interrupts.

But I can't understand what is the "thread cache feature".
I need to investigate more.

Actions

Also available in: Atom PDF