Project

General

Profile

Actions

Backport #8080

closed

Segfault in rb_fd_set

Added by jonleighton (Jon Leighton) about 11 years ago. Updated almost 11 years ago.

Status:
Closed
[ruby-core:53349]

Description

I have experienced a segfault with Ruby 2 during an IO.select call:

See https://travis-ci.org/jonleighton/spring/jobs/5393025 or https://gist.github.com/jonleighton/5147785 to see the crash output.

I cannot reproduce on a different version of Linux (Fedora). However I was able to reproduce by downloading a VM image of the Travis CI environment and running the code on there (see http://pivotallabs.com/debugging-travis-builds/ for how to do that).

I tried to produce a simple script to reproduce, but without success. I also tried to build Ruby 2 with debugging symbols, but this did not produce the crash. I'm not sure why - perhaps related to compiler optimisations.

I found a workaround for the crash with https://github.com/jonleighton/spring/commit/c8a7afdd3238ef88bffc2c8f56baa21042400e15.


Related issues 1 (0 open1 closed)

Has duplicate Ruby master - Bug #6653: 1.9.2/1.9.3 exhibit SEGV with many threads+tcp connectionsClosedakr (Akira Tanaka)06/27/2012Actions

Updated by normalperson (Eric Wong) about 11 years ago

"jonleighton (Jon Leighton)" wrote:

I have experienced a segfault with Ruby 2 during an IO.select call:

See https://travis-ci.org/jonleighton/spring/jobs/5393025 or https://gist.github.com/jonleighton/5147785 to see the crash output.

I cannot reproduce on a different version of Linux (Fedora). However I was able to reproduce by downloading a VM image of the Travis CI environment and running the code on there (see http://pivotallabs.com/debugging-travis-builds/ for how to do that).

I tried to produce a simple script to reproduce, but without success. I also tried to build Ruby 2 with debugging symbols, but this did not produce the crash. I'm not sure why - perhaps related to compiler optimisations.

I found a workaround for the crash with https://github.com/jonleighton/spring/commit/c8a7afdd3238ef88bffc2c8f56baa21042400e15.

Looking at your workaround, I think a better one is "watcher.to_io"
method needs to memoize its return value.

I think Ruby expects the return value of obj.to_io to be persistent
for the lifetime of obj.

Making IO.select cache the value of to_io internally might be alright...

Updated by kosaki (Motohiro KOSAKI) about 11 years ago

  • Category set to core
  • Status changed from Open to Assigned
  • Assignee set to kosaki (Motohiro KOSAKI)
  • Target version set to 2.1.0

Updated by jonleighton (Jon Leighton) about 11 years ago

normalperson (Eric Wong) wrote:

Looking at your workaround, I think a better one is "watcher.to_io"
method needs to memoize its return value.

I think Ruby expects the return value of obj.to_io to be persistent
for the lifetime of obj.

Making IO.select cache the value of to_io internally might be alright...

It seems that IO.select does cache the value of to_io internally. At least that seems to be the case from https://gist.github.com/jonleighton/5172263.

Running the script prints "to_io" once and then hangs.

Updated by normalperson (Eric Wong) about 11 years ago

"jonleighton (Jon Leighton)" wrote:

Issue #8080 has been updated by jonleighton (Jon Leighton).
normalperson (Eric Wong) wrote:

Looking at your workaround, I think a better one is "watcher.to_io"
method needs to memoize its return value.

I think Ruby expects the return value of obj.to_io to be persistent
for the lifetime of obj.

Making IO.select cache the value of to_io internally might be alright...

It seems that IO.select does cache the value of to_io internally. At least that seems to be the case from https://gist.github.com/jonleighton/5172263.

Running the script prints "to_io" once and then hangs.

That's because nothing else is running while IO.select is running.
IO.select does not cache, it accesses once the data once.
See comments below:

---------------------------------8<-------------------------------
class Foo
def to_io
puts "to_io"
IO.pipe.first
end
end
n = 0

trap(:USR1) { n += 1 }

Generate garbage to trigger GC, and then EINTR for select()

Thread.new do
loop do
# make some garbage here
(1..1000000).each { |z| z.to_s.dup }
puts "KILL #{n}"
Process.kill("USR1", $$)
end
end

This will now return empty arrays

If you swap Foo.new for $stdin, this will never return (as expected)

p IO.select([Foo.new])
---------------------------------8<-------------------------------

Actions #5

Updated by kosaki (Motohiro KOSAKI) about 11 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r39775.
Jon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • thread.c: disabled _FORTIFY_SOURCE for avoid to hit glibc bug.
    [Bug #8080] [ruby-core:53349]
  • test/ruby/test_io.rb (TestIO#test_io_select_with_many_files):
    test for the above.
Actions #6

Updated by kosaki (Motohiro KOSAKI) about 11 years ago

  • Tracker changed from Bug to Backport
  • Project changed from Ruby master to Backport200
  • Category deleted (core)
  • Status changed from Closed to Assigned
  • Assignee changed from kosaki (Motohiro KOSAKI) to nagachika (Tomoyuki Chikanaga)
  • Priority changed from Normal to 5
  • Target version deleted (2.1.0)

The root cause is, Linux's select implementation supports >1024 files, but glibc doesn't. glibc doesn't correctly understand linux select(2) spec.
This issue is only happen when _FORTIFY_SOURCE >= 1. That said, 2.0 and 1.9.x on Ubuntu. (because Ubuntu enable FORTIFY_SOURCE by default).

I think this fix is important for server usecase. Please backport r39772-r39775.

Actions #7

Updated by kosaki (Motohiro KOSAKI) about 11 years ago

The root cause is, Linux's select implementation supports >1024 files, but glibc doesn't. glibc doesn't correctly understand linux select(2) spec.
This issue is only happen when _FORTIFY_SOURCE >= 1. That said, 2.0 and 1.9.x on Ubuntu. (because Ubuntu enable FORTIFY_SOURCE by default).

I think this fix is important for server usecase. Please backport r39772-r39775.

Updated by kosaki (Motohiro KOSAKI) about 11 years ago

Hi nagachika-sanm

You need backport 39777, 39779, 39781 and 39783 too.

Actions #9

Updated by nagachika (Tomoyuki Chikanaga) about 11 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r39838.
Jon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 39772,39773: [Backport #8080]

* configure.in: check struct timeval exist or not.

* include/ruby/missing.h (struct timeval): check HAVE_STRUCT_TIMEVAL
  properly. and don't include sys/time.h if struct timeval exist.

* file.c: include sys/time.h explicitly.

* random.c: ditto.

* thread_pthread.c: ditto.

* time.c: ditto.

* ext/date/date_strftime.c: ditto.

* include/ruby/missing.h (struct timespec): include <sys/time.h>

Updated by nagachika (Tomoyuki Chikanaga) about 11 years ago

r39772-r39775, r39777, r39779, r39781 and r39783 are backported.
I also merged r39160, r39162, r39174, r39198 and r39200 for clean merge of configure.in.

Actions #11

Updated by kosaki (Motohiro KOSAKI) about 11 years ago

  • Project changed from Backport200 to Backport193
  • Status changed from Closed to Assigned
  • Assignee changed from nagachika (Tomoyuki Chikanaga) to usa (Usaku NAKAMURA)
Actions #12

Updated by usa (Usaku NAKAMURA) almost 11 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r39985.
Jon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 39772,39773,39774,39775,39777,39779,39781,39783: [Backport #8080]

* configure.in: check struct timeval exist or not.

* include/ruby/missing.h (struct timeval): check HAVE_STRUCT_TIMEVAL
  properly. and don't include sys/time.h if struct timeval exist.

* file.c: include sys/time.h explicitly.

* random.c: ditto.

* thread_pthread.c: ditto.

* time.c: ditto.

* ext/date/date_strftime.c: ditto.

* include/ruby/missing.h (struct timespec): include <sys/time.h>

* include/ruby/missing.h (__syscall): moved to...

* io.c: here. because __syscall() is only used from io.c.

* include/ruby/missing.h: move "#include <sys/type.h>" to ....

* include/ruby/intern.h: here. because it was introduced for
  fixing NFDBITS issue. [ruby-core:05179].

* thread.c: disabled _FORTIFY_SOURCE for avoid to hit glibc bug.
  [Bug #8080] [ruby-core:53349]

* test/ruby/test_io.rb (TestIO#test_io_select_with_many_files):
  test for the above.

* include/ruby/missing.h: removed __linux__. it's unnecessary.

Updated by znz (Kazuhiro NISHIYAMA) almost 11 years ago

  • Status changed from Closed to Assigned

This backport to ruby_1_9_3 is not enough.
Build failed on Mac.
see https://gist.github.com/hsbt/5263190

Actions #14

Updated by usa (Usaku NAKAMURA) almost 11 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r39995.
Jon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • include/ruby/missing.h: fixed merge mistake of r39985.
    [Backport #8080]
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0