Project

General

Profile

Actions

Bug #14841

closed

Very rarely IO#readpartial does not raise EOFError

Added by hirura (Hiroyuki URANISHI) over 6 years ago. Updated almost 6 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.6.0dev (2018-06-15 trunk 63671) [x86_64-linux]
[ruby-core:87467]

Description

Hi,

I'm now writing a multi thread application, which uses IO.pipe, IO#close and IO#readpartial.
During running a code many times, I found that, even though calling IO#close on a write side object instantiated by IO.pipe, IO#readpartial of the read side object very rarely does not raise EOFError and keeps blocking.

The below code is working fine in most cases.

  1. Thread t2 generates '0' and '1' values and writes them on w object.
  2. Thread t1 read from r object with IO#readpartial
  3. After thread t2 exited, close w object to finish thread t1
require 'logger'

logger = Logger.new 'logger.log', 1, 10000000
logger.level = Logger::DEBUG

r, w = IO.pipe

t0 = Thread.new{
  # This is a dummy thread to avoid being killed by Ruby interpreter automatically due to deadlock
  loop do
    sleep 1
    Thread.pass
  end
}

t1 = Thread.new{
  loop do
    begin
      logger.debug { "r.readpartial(10240)" }
      data_read = r.readpartial(10240)
      logger.debug { "data_read: #{data_read}" }
    rescue EOFError
      logger.info { "EOFError" }
      break
    rescue => e
      logger.error { e.message }
    end
  end
  logger.info { "exiting t1" }
}

t2 = Thread.new{
  2.times{ |i|
    sleep 0.1
    w.write i.to_s
  }
  logger.info { "exiting t2" }
}

begin
  t2.join
  logger.info { "t2 exited" }
  logger.info { "close w" }
  w.close
  logger.info { "w closed" }
rescue IOError
  logger.info { "IOError when closing w" }
end

t1.join
logger.info { "t1 exited" }

begin
  r.close
rescue IOError
  logger.info { "IOError when closing r" }
end

And I run the above code many times like:

$ i=0; while :; do echo $i; ruby readpartial_and_eof_test.rb; i=$(expr $i + 1); done
0
1
(Snip)
3892
3893
3894
(Stopped here)

As seen above, during 3894th iteration, the code is not finished.

The output of logger is the below. Closing w object is done, but EOFError looks not being raised.

$ tail -n 32 logger.log 
I, [2018-06-10T22:09:58.738087 #19893]  INFO -- : t1 exited
D, [2018-06-10T22:09:58.871906 #19916] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:58.972198 #19916] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:58.972327 #19916] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.072294 #19916]  INFO -- : exiting t2
D, [2018-06-10T22:09:59.072441 #19916] DEBUG -- : data_read: 1
D, [2018-06-10T22:09:59.072758 #19916] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.072667 #19916]  INFO -- : t2 exited
I, [2018-06-10T22:09:59.072915 #19916]  INFO -- : close w
I, [2018-06-10T22:09:59.073011 #19916]  INFO -- : w closed
I, [2018-06-10T22:09:59.073101 #19916]  INFO -- : EOFError
I, [2018-06-10T22:09:59.073166 #19916]  INFO -- : exiting t1
I, [2018-06-10T22:09:59.073268 #19916]  INFO -- : t1 exited
D, [2018-06-10T22:09:59.165037 #19944] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.265229 #19944] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:59.265279 #19944] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.365360 #19944] DEBUG -- : data_read: 1
D, [2018-06-10T22:09:59.365469 #19944] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.365252 #19944]  INFO -- : exiting t2
I, [2018-06-10T22:09:59.365640 #19944]  INFO -- : t2 exited
I, [2018-06-10T22:09:59.365709 #19944]  INFO -- : close w
I, [2018-06-10T22:09:59.365754 #19944]  INFO -- : w closed
I, [2018-06-10T22:09:59.365819 #19944]  INFO -- : EOFError
I, [2018-06-10T22:09:59.365959 #19944]  INFO -- : exiting t1
I, [2018-06-10T22:09:59.366078 #19944]  INFO -- : t1 exited
D, [2018-06-10T22:09:59.472484 #19967] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.572743 #19967] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:59.572891 #19967] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.672976 #19967]  INFO -- : exiting t2
I, [2018-06-10T22:09:59.673165 #19967]  INFO -- : t2 exited
I, [2018-06-10T22:09:59.673199 #19967]  INFO -- : close w
I, [2018-06-10T22:09:59.673220 #19967]  INFO -- : w closed
$

This issue occurred in CentOS 7.2.1511 and Ubuntu 16.04.4 LTS.

  • CentOS
$ cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core) 
$ uname -a
Linux mfz-remote-gw-01 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
$ rbenv -v
rbenv 1.1.1-30-gc8ba27f
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) 
  • Ubuntu
$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
$ uname -a
Linux epsilon 4.4.0-122-generic #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
$ rbenv -v
rbenv 1.1.1-37-g1c772d5
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) 

Files

gdb_trace_normal.txt (15.6 KB) gdb_trace_normal.txt gdb trace output for normal case (break at rb_eof_error) hirura (Hiroyuki URANISHI), 06/19/2018 09:12 AM
gdb_trace_reproduced.txt (14.2 KB) gdb_trace_reproduced.txt gdb trace output for reproduced case hirura (Hiroyuki URANISHI), 06/19/2018 09:13 AM

Updated by hirura (Hiroyuki URANISHI) over 6 years ago

Hi,

I tried getting gdb backtrace of a reproduced, means stopped in this case, process.
Attached files are the outputs of reproduced case and normal case.

Could anyone help on this ?

hirura (Hiroyuki URANISHI) wrote:

Hi,

I'm now writing a multi thread application, which uses IO.pipe, IO#close and IO#readpartial.
During running a code many times, I found that, even though calling IO#close on a write side object instantiated by IO.pipe, IO#readpartial of the read side object very rarely does not raise EOFError and keeps blocking.

The below code is working fine in most cases.

  1. Thread t2 generates '0' and '1' values and writes them on w object.
  2. Thread t1 read from r object with IO#readpartial
  3. After thread t2 exited, close w object to finish thread t1
require 'logger'

logger = Logger.new 'logger.log', 1, 10000000
logger.level = Logger::DEBUG

r, w = IO.pipe

t0 = Thread.new{
  # This is a dummy thread to avoid being killed by Ruby interpreter automatically due to deadlock
  loop do
    sleep 1
    Thread.pass
  end
}

t1 = Thread.new{
  loop do
    begin
      logger.debug { "r.readpartial(10240)" }
      data_read = r.readpartial(10240)
      logger.debug { "data_read: #{data_read}" }
    rescue EOFError
      logger.info { "EOFError" }
      break
    rescue => e
      logger.error { e.message }
    end
  end
  logger.info { "exiting t1" }
}

t2 = Thread.new{
  2.times{ |i|
    sleep 0.1
    w.write i.to_s
  }
  logger.info { "exiting t2" }
}

begin
  t2.join
  logger.info { "t2 exited" }
  logger.info { "close w" }
  w.close
  logger.info { "w closed" }
rescue IOError
  logger.info { "IOError when closing w" }
end

t1.join
logger.info { "t1 exited" }

begin
  r.close
rescue IOError
  logger.info { "IOError when closing r" }
end

And I run the above code many times like:

$ i=0; while :; do echo $i; ruby readpartial_and_eof_test.rb; i=$(expr $i + 1); done
0
1
(Snip)
3892
3893
3894
(Stopped here)

As seen above, during 3894th iteration, the code is not finished.

The output of logger is the below. Closing w object is done, but EOFError looks not being raised.

$ tail -n 32 logger.log 
I, [2018-06-10T22:09:58.738087 #19893]  INFO -- : t1 exited
D, [2018-06-10T22:09:58.871906 #19916] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:58.972198 #19916] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:58.972327 #19916] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.072294 #19916]  INFO -- : exiting t2
D, [2018-06-10T22:09:59.072441 #19916] DEBUG -- : data_read: 1
D, [2018-06-10T22:09:59.072758 #19916] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.072667 #19916]  INFO -- : t2 exited
I, [2018-06-10T22:09:59.072915 #19916]  INFO -- : close w
I, [2018-06-10T22:09:59.073011 #19916]  INFO -- : w closed
I, [2018-06-10T22:09:59.073101 #19916]  INFO -- : EOFError
I, [2018-06-10T22:09:59.073166 #19916]  INFO -- : exiting t1
I, [2018-06-10T22:09:59.073268 #19916]  INFO -- : t1 exited
D, [2018-06-10T22:09:59.165037 #19944] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.265229 #19944] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:59.265279 #19944] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.365360 #19944] DEBUG -- : data_read: 1
D, [2018-06-10T22:09:59.365469 #19944] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.365252 #19944]  INFO -- : exiting t2
I, [2018-06-10T22:09:59.365640 #19944]  INFO -- : t2 exited
I, [2018-06-10T22:09:59.365709 #19944]  INFO -- : close w
I, [2018-06-10T22:09:59.365754 #19944]  INFO -- : w closed
I, [2018-06-10T22:09:59.365819 #19944]  INFO -- : EOFError
I, [2018-06-10T22:09:59.365959 #19944]  INFO -- : exiting t1
I, [2018-06-10T22:09:59.366078 #19944]  INFO -- : t1 exited
D, [2018-06-10T22:09:59.472484 #19967] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.572743 #19967] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:59.572891 #19967] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.672976 #19967]  INFO -- : exiting t2
I, [2018-06-10T22:09:59.673165 #19967]  INFO -- : t2 exited
I, [2018-06-10T22:09:59.673199 #19967]  INFO -- : close w
I, [2018-06-10T22:09:59.673220 #19967]  INFO -- : w closed
$

This issue occurred in CentOS 7.2.1511 and Ubuntu 16.04.4 LTS.

  • CentOS
$ cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core) 
$ uname -a
Linux mfz-remote-gw-01 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
$ rbenv -v
rbenv 1.1.1-30-gc8ba27f
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) 
  • Ubuntu
$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
$ uname -a
Linux epsilon 4.4.0-122-generic #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
$ rbenv -v
rbenv 1.1.1-37-g1c772d5
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) 

Updated by normalperson (Eric Wong) over 6 years ago

wrote:

File gdb_trace_reproduced.txt added
ruby -v changed from ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux] to ruby 2.6.0dev (2018-06-15 trunk 63671) [x86_64-linux]

Odd, it is stuck in rb_mutex_sleep based on your backtrace (from
Logger, it looks like); and you can reproduce it on recent trunk.

I'm trying to reproduce the bug, but I cannot. I will let it
run while I do other things...

I suspect the problem is a bug in our deadlock detection
(vm->sleeper (*))

In your code, you have a t0 dummy thread. Is it really needed?

Curious, how many CPU cores and what model do you have?
Can you reproduce it if you use taskset or schedtool to limit
it to a single core?

Do you have a different hardware to try on?

(*) I also wonder why vm->sleeper is volatile, it should not
need to be.

Updated by normalperson (Eric Wong) over 6 years ago

Sorry, I left this out: If you can reproduce it again, can you
print vm->sleeper and vm->living_thread_num values from gdb?
Thanks.

Updated by hirura (Hiroyuki URANISHI) over 6 years ago

Hi Eric,

Thank you for replying.

I tried and reproduced this on

  • Ubuntu Linux on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (4cores / 8threads)
  • CentOS Linux on Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (1cores /
    2threads; a VM)
    Unfortunately I don't have a Linux machine with AMD or other vendors'.

As for additional gdb outputs, I have a dumped core file of the
stopped process, and I tried printing the instructed.
But I couldn't get "vm" variable like

(gdb) print vm
No symbol "vm" in current context.

Could you give me a bit detailed way to get the output? I'm sorry I'm
not familiar with gdb.

The t0 dummy thread is used so that the process is not killed by
deadlock detection and I can get the core-dump.
Without the thread, the process is killed as deadlock.
The output of the case is the below. I suppose deadlock detection is
working as expected.
(Line 45 is t1.join in this case)

Traceback (most recent call last):
1: from readpartial_and_eof_test.rb:45:in `<main>'
readpartial_and_eof_test.rb:45:in `join': No live threads left.
Deadlock? (fatal)
2 threads, 2 sleeps current:0x00007f8c498cfc20 main thread:0x00007f8c49566470
* #<Thread:0x00007f8c49598c60 sleep_forever>
rb_thread_t:0x00007f8c49566470 native:0x00007f8c47674740 int:0
readpartial_and_eof_test.rb:45:in `join'
readpartial_and_eof_test.rb:45:in `<main>'
* #<Thread:0x00007f8c4988ce80@readpartial_and_eof_test.rb:11 sleep_forever>
rb_thread_t:0x00007f8c498cfc20 native:0x00007f8c3f908700 int:0
depended by: tb_thread_id:0x00007f8c49566470
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:185:in `lock'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:185:in
`mon_enter'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:224:in
`mon_synchronize'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/logger.rb:688:in `write'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/logger.rb:471:in `add'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/logger.rb:493:in `debug'
readpartial_and_eof_test.rb:16:in `block (2 levels) in <main>'
readpartial_and_eof_test.rb:12:in `loop'
readpartial_and_eof_test.rb:12:in `block in <main>'

Thanks,

Updated by normalperson (Eric Wong) over 6 years ago

hirura wrote:

I tried and reproduced this on

  • Ubuntu Linux on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (4cores / 8threads)
  • CentOS Linux on Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (1cores /
    2threads; a VM)
    Unfortunately I don't have a Linux machine with AMD or other vendors'.

OK, I'm on AMD (Phenom II) with 4 cores and no HT.

As for additional gdb outputs, I have a dumped core file of the
stopped process, and I tried printing the instructed.
But I couldn't get "vm" variable like

(gdb) print vm
No symbol "vm" in current context.

Could you give me a bit detailed way to get the output? I'm sorry I'm
not familiar with gdb.

You can use "up" and "down" to move up/down stack frames until you're
in a function where "vm" is accessible. (I'm not a gdb expert, either)

The t0 dummy thread is used so that the process is not killed by
deadlock detection and I can get the core-dump.
Without the thread, the process is killed as deadlock.

Odd, I can't reproduce a deadlock with t0 commented out, either.

The output of the case is the below. I suppose deadlock detection is
working as expected.
(Line 45 is t1.join in this case)

Traceback (most recent call last):
        1: from readpartial_and_eof_test.rb:45:in `<main>'
readpartial_and_eof_test.rb:45:in `join': No live threads left.
Deadlock? (fatal)
2 threads, 2 sleeps current:0x00007f8c498cfc20 main thread:0x00007f8c49566470

OK, so that's 2 for both vm->living_thread_num and vm->sleeper;
in that case that's normal.

I'm wondering if a stuck process will show some strange values
(e.g. negative due to underflow).

Perhaps `volatile' is unnecessary and the compiler does
something strange with sleeper... Does removing it change
things?

--- a/vm_core.h
+++ b/vm_core.h
@@ -565,7 +565,7 @@ typedef struct rb_vm_struct {
unsigned int safe_level_: 1;

int trace_running;
-    volatile int sleeper;
+    int sleeper;

/* object management */
VALUE mark_object_ary;

Updated by hirura (Hiroyuki URANISHI) over 6 years ago

Thank you for letting me know how to touch gdb.
The below is the output, it seems that sleeper and living_thread_num
have reasonable values.

(gdb) t 1
[Switching to thread 1 (Thread 0x7f46c51c8700 (LWP 827))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) up 3
#3  0x000055bec706df9a in thread_join_sleep (arg=140725177158608) at
thread.c:932
932     native_sleep(th, 0);
(gdb) p th->vm
$1 = (rb_vm_t *) 0x55bec8085f20
(gdb) p th->vm->sleeper
$2 = 2
(gdb) p th->vm->living_thread_num
$3 = 3
(gdb)

(gdb) t 3
[Switching to thread 3 (Thread 0x7f46c37aa700 (LWP 841))]
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
225 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No
such file or directory.
(gdb) up 3
#2  0x000055bec7069920 in native_sleep (th=0x55bec8468fa0,
timeout_rel=0x7f46c37a7b50) at thread_pthread.c:1073
1073 native_cond_timedwait(cond, lock, &timeout);
(gdb) p th->vm
$1 = (rb_vm_t *) 0x55bec8085f20
(gdb) p th->vm->sleeper
$2 = 2
(gdb) p th->vm->living_thread_num
$3 = 3
(gdb)

(gdb) t 4
[Switching to thread 4 (Thread 0x7f46c35a8700 (LWP 842))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) up 2
#2  0x000055bec7069907 in native_sleep (th=0x55bec8425390,
timeout_rel=0x0) at thread_pthread.c:1071
1071 rb_native_cond_wait(cond, lock);
(gdb) p th->vm
$1 = (rb_vm_t *) 0x55bec8085f20
(gdb) p th->vm->sleeper
$2 = 2
(gdb) p th->vm->living_thread_num
$3 = 3
(gdb)

Nextly, I will try the same code with limited CPU cores, and will try
patching vm_core.h.

Updated by normalperson (Eric Wong) over 6 years ago

hirura wrote:

Nextly, I will try the same code with limited CPU cores, and will try
patching vm_core.h.

Thanks. I'm failing to reproduce the problem on my end.
I also got my netbook to overheat, but that's not Ruby's fault :x

Updated by normalperson (Eric Wong) over 6 years ago

wrote:

https://bugs.ruby-lang.org/issues/14841

I think the problem is mutex->th may be altered by
RUBY_VM_CHECK_INTS_BLOCKING(th->ec), and we miss
a point where "mutex->th = 0;"

Can you test?

--- a/thread_sync.c
+++ b/thread_sync.c
@@ -272,6 +272,7 @@ rb_mutex_lock(VALUE self)
	    list_add_tail(&mutex->waitq, &w.node);
	    native_sleep(th, timeout); /* release GVL */
	    list_del(&w.node);
+	    RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
	    if (!mutex->th) {
		mutex->th = th;
	    }
@@ -289,8 +290,6 @@ rb_mutex_lock(VALUE self)
	    th->vm->sleeper--;

	    if (mutex->th == th) mutex_locked(th, self);
-
-	    RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
	}
}
return self;

I think the bug existed in <2.4, but 2.5 made it easier-to-hit.

Updated by normalperson (Eric Wong) over 6 years ago

Eric Wong wrote:

wrote:

https://bugs.ruby-lang.org/issues/14841

Can you test?

Sorry, right idea, bad patch (broke during "make exam")
This should work:

--- a/thread_sync.c
+++ b/thread_sync.c
@@ -289,8 +289,11 @@ rb_mutex_lock(VALUE self)
	    th->vm->sleeper--;

	    if (mutex->th == th) mutex_locked(th, self);
-
-	    RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
+	    RUBY_VM_CHECK_INTS_BLOCKING(th->ec); /* may release mutex */
+	    if (!mutex->th) {
+		mutex->th = th;
+	        mutex_locked(th, self);
+	    }
	}
}
return self;
Actions #10

Updated by normalperson (Eric Wong) over 6 years ago

  • Status changed from Open to Closed

Applied in changeset trunk|r63711.


thread_sync.c (rb_mutex_lock): fix deadlock

Actions #11

Updated by normalperson (Eric Wong) over 6 years ago

  • Backport changed from 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN to 2.3: REQUIRED, 2.4: REQUIRED, 2.5: REQUIRED
Actions #12

Updated by normalperson (Eric Wong) over 6 years ago

  • Status changed from Closed to Open

Updated by normalperson (Eric Wong) over 6 years ago

This affects all versions, but is most easily triggered in 2.5+

Actions #14

Updated by Anonymous over 6 years ago

  • Status changed from Open to Closed

Applied in changeset commit:ruby-git|501069b8a4013f2e3fdde35c50e9527ef0061963.


thread_sync.c (rb_mutex_lock): fix deadlock

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@63711 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Updated by hirura (Hiroyuki URANISHI) over 6 years ago

Hi,

Sorry to reply late.
Thank you for your quick help and fix.

I have tried the same code more than 1,000,000 times on "ruby 2.6.0dev (2018-06-22 trunk 63723) [x86_64-linux]" version that includes the fix, and I confirmed the issue is not reproduced.
And just for information, I tried the older version, that does not include the fix, with "taskset" tool, and the issue was not reproduced as well.

Thanks,

Updated by nagachika (Tomoyuki Chikanaga) about 6 years ago

  • Backport changed from 2.3: REQUIRED, 2.4: REQUIRED, 2.5: REQUIRED to 2.3: REQUIRED, 2.4: REQUIRED, 2.5: DONE

ruby_2_5 r64998 merged revision(s) 63711,64398.

Updated by usa (Usaku NAKAMURA) almost 6 years ago

  • Backport changed from 2.3: REQUIRED, 2.4: REQUIRED, 2.5: DONE to 2.3: REQUIRED, 2.4: DONE, 2.5: DONE

ruby_2_4 r65115 merged revision(s) 63711,64398.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0