Bug #6653

1.9.2/1.9.3 exhibit SEGV with many threads+tcp connections

Added by Erik Hollensbe almost 2 years ago. Updated about 1 year ago.

[ruby-core:45902]
Status:Closed
Priority:Low
Assignee:Akira Tanaka
Category:-
Target version:next minor
ruby -v:ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-linux] Backport:

Description

the script: https://gist.github.com/4f36f8543ad702861096
the trace + output of the run: https://gist.github.com/cf7dd137ad65802c46ae

ruby -v is 1.9.2-p290, but we're seeing this in 1.9.3-p194 as well.

This does not exhibit on OS X, only linux, we tested on Ubuntu 12.04.

I can get more information if desired.

Just guessing, this appears to be a bug in how FD_SETSIZE is handled.

Thank you!


Related issues

Duplicates Backport93 - Backport #8080: Segfault in rb_fd_set Closed 03/13/2013

History

#1 Updated by Eric Wong almost 2 years ago

"erikh (Erik Hollensbe)" erik@hollensbe.org wrote:

Issue #6653 has been reported by erikh (Erik Hollensbe).


Bug #6653: 1.9.2/1.9.3 exhibit SEGV with many threads+tcp connections
https://bugs.ruby-lang.org/issues/6653

Author: erikh (Erik Hollensbe)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-linux]

the script: https://gist.github.com/4f36f8543ad702861096
the trace + output of the run: https://gist.github.com/cf7dd137ad65802c46ae

Private gist for public bug reports makes no sense. Private gists
requires account + ssh key on github to "git clone" from.

ruby -v is 1.9.2-p290, but we're seeing this in 1.9.3-p194 as well.

This does not exhibit on OS X, only linux, we tested on Ubuntu 12.04.

I can't reproduce this on a similar system (Debian testing (wheezy))
with 1.9.3-p194 nor Ruby 1.9.2-p290.

rbfdset() should not get called under 1.9.3 on Linux from
rbthreadfd_writable(), can you show a backtrace from 1.9.3?

Are you certain /opt/ruby/lib/libruby.so.1.9 got changed/upgraded
to the 1.9.3 version?

The ruby/config.h header for 1.9.3 should have detected ppoll() and
set: #define HAVE_PPOLL 1

ppoll() usage would prevent rbfdset() usage in your particular code
path.

Also, what is the value of HAVERBFD_INIT in ruby/config.h?
(it should be 1 on Linux for all Ruby 1.9.x)

If you have build logs handy, can you see if ppoll() got detected
on 1.9.3?

#2 Updated by Motohiro KOSAKI almost 2 years ago

  • Status changed from Open to Feedback

#3 Updated by Tommy Odom over 1 year ago

I've hit a similar issue while using Chef with Ruby 1.9.3 on Ubuntu 12.04 x8664. I've tried with both the Ubuntu 1.9.3 packages as well as the packages provided by Brightbox (ruby 1.9.3p194 (2012-04-20 revision 35410) [x8664-linux]) and with both I've hit a very similar stack trace. One thing I have noticed though is that this does not occur if the max open files is set to <= 1700.

You can see the stack trace at: https://gist.github.com/3294941

The code in Chef that is failing is: https://github.com/opscode/mixlib-shellout/blob/master/lib/mixlib/shellout/unix.rb

** Update **
I figured out that I had a piece of code that was opening a bunch of file handles (around 1700) using File.new and wasn't closing them. So it appears that in my case having 1700 open files was contributing to the issue.

#4 Updated by Yusuke Endoh over 1 year ago

  • Priority changed from Normal to Low

Please write a complete reproducing procedure. It requires memcached, right?
I cannot repro on Ubuntu 12.04.

Yusuke Endoh mame@tsg.ne.jp

#5 Updated by Yusuke Endoh over 1 year ago

Erik Hollensbe, ping?

Yusuke Endoh mame@tsg.ne.jp

#6 Updated by Erik Hollensbe over 1 year ago

Sorry for the abysmally late response -- I can't seem to get the redmine here to send me email for some reason.

Hi Folks, so I actually sorted this out with some help from others. It's not an issue of memcached, or rather, didn't appear to be when I looked into it.

If you adjust the limit (either with ulimit or the Process:: tooling) it goes away. Conversely you should see this problem if you adjust the ulimit threshold below the amount of descriptors you're trying to work with.

I will also say that it has been a significant amount of time since I had this problem and have changed jobs since then, so I don't have access to specifics on build env, etc anymore.

The problem seems to be the handling of the case where the system says "I can't give you any more descriptors", not any specific value. I was using a lot of threads too, if that matters.

#7 Updated by Yusuke Endoh over 1 year ago

  • Status changed from Feedback to Assigned
  • Assignee set to Akira Tanaka
  • Target version set to 2.0.0

Erik, thank you for the reply!
Well, it seems that there is something wrong in the handling of file descriptors bigger than FD_SETSIZE.

Akr-san, kosaki-san, ko1, do you have any idea?

Yusuke Endoh mame@tsg.ne.jp

#8 Updated by Motohiro KOSAKI over 1 year ago

Unfortunately, I've seen nothing wrong even if file descriptor limits are greater than FD_SETSIZE.

#9 Updated by Yusuke Endoh about 1 year ago

  • Target version changed from 2.0.0 to next minor

#10 Updated by Motohiro KOSAKI about 1 year ago

  • Status changed from Assigned to Closed

closed. because it is duplicated.

Also available in: Atom PDF